1
|
Mswahili ME, Hwang J, Rajapakse JC, Jo K, Jeong YS. Positional embeddings and zero-shot learning using BERT for molecular-property prediction. J Cheminform 2025; 17:17. [PMID: 39910649 PMCID: PMC11800558 DOI: 10.1186/s13321-025-00959-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2024] [Accepted: 01/18/2025] [Indexed: 02/07/2025] Open
Abstract
Recently, advancements in cheminformatics such as representation learning for chemical structures, deep learning (DL) for property prediction, data-driven discovery, and optimization of chemical data handling, have led to increased demands for handling chemical simplified molecular input line entry system (SMILES) data, particularly in text analysis tasks. These advancements have driven the need to optimize components like positional encoding and positional embeddings (PEs) in transformer model to better capture the sequential and contextual information embedded in molecular representations. SMILES data represent complex relationships among atoms or elements, rendering them critical for various learning tasks within the field of cheminformatics. This study addresses the critical challenge of encoding complex relationships among atoms in SMILES strings to explore various PEs within the transformer-based framework to increase the accuracy and generalization of molecular property predictions. The success of transformer-based models, such as the bidirectional encoder representations from transformer (BERT) models, in natural language processing tasks has sparked growing interest from the domain of cheminformatics. However, the performance of these models during pretraining and fine-tuning is significantly influenced by positional information such as PEs, which help in understanding the intricate relationships within sequences. Integrating position information within transformer architectures has emerged as a promising approach. This encoding mechanism provides essential supervision for modeling dependencies among elements situated at different positions within a given sequence. In this study, we first conduct pretraining experiments using various PEs to explore diverse methodologies for incorporating positional information into the BERT model for chemical text analysis using SMILES strings. Next, for each PE, we fine-tune the best-performing BERT (masked language modeling) model on downstream tasks for molecular-property prediction. Here, we use two molecular representations, SMILES and DeepSMILES, to comprehensively assess the potential and limitations of the PEs in zero-shot learning analysis, demonstrating the model's proficiency in predicting properties of unseen molecular representations in the context of newly proposed and existing datasets.Scientific contributionThis study explores the unexplored potential of PEs using BERT model for molecular property prediction. The study involved pretraining and fine-tuning the BERT model on various datasets related to COVID-19, bioassay data, and other molecular and biological properties using SMILES and DeepSMILES representations. The study details the pretraining architecture, fine-tuning datasets, and the performance of the BERT model with different PEs. It also explores zero-shot learning analysis and the model's performance on various classification and regression tasks. In this study, newly proposed datasets from different domains were introduced during fine-tuning in addition to the existing and commonly used datasets. The study highlights the robustness of the BERT model in predicting chemical properties and its potential applications in cheminformatics and bioinformatics.
Collapse
Affiliation(s)
- Medard Edmund Mswahili
- Department of Computer Engineering, Chungbuk National University, Cheongju, 28644, South Korea
| | - JunHa Hwang
- Department of Computer Engineering, Chungbuk National University, Cheongju, 28644, South Korea
| | - Jagath C Rajapakse
- School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
| | - Kyuri Jo
- Department of Computer Engineering, Chungbuk National University, Cheongju, 28644, South Korea.
| | - Young-Seob Jeong
- Department of Computer Engineering, Chungbuk National University, Cheongju, 28644, South Korea.
| |
Collapse
|
2
|
Mswahili ME, Jeong YS. Transformer-based models for chemical SMILES representation: A comprehensive literature review. Heliyon 2024; 10:e39038. [PMID: 39640612 PMCID: PMC11620068 DOI: 10.1016/j.heliyon.2024.e39038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2024] [Revised: 09/26/2024] [Accepted: 10/05/2024] [Indexed: 12/07/2024] Open
Abstract
Pre-trained chemical language models (CLMs) have attracted increasing attention within the domains of cheminformatics and bioinformatics, inspired by their remarkable success in the natural language processing (NLP) domain such as speech recognition, text analysis, translation, and other objectives associated with language. Furthermore, the vast amount of unlabeled data associated with chemical compounds or molecules has emerged as a crucial research focus, prompting the need for CLMs with reasoning capabilities over such data. Molecular graphs and molecular descriptors are the predominant approaches to representing molecules for property prediction in machine learning (ML). However, Transformer-based LMs have recently emerged as de-facto powerful tools in deep learning (DL), showcasing outstanding performance across various NLP downstream tasks, particularly in text analysis. Within the realm of pre-trained transformer-based LMs such as, BERT (and its variants) and GPT (and its variants) have been extensively explored in the chemical informatics domain. Various learning tasks in cheminformatics such as the text analysis that necessitate handling of chemical SMILES data which contains intricate relations among elements or atoms, have become increasingly prevalent. Whether the objective is predicting molecular reactions or molecular property prediction, there is a growing demand for LMs capable of learning molecular contextual information within SMILES sequences or strings from text inputs (i.e., SMILES). This review provides an overview of the current state-of-the-art of chemical language Transformer-based LMs in chemical informatics for de novo design, and analyses current limitations, challenges, and advantages. Finally, a perspective on future opportunities is provided in this evolving field.
Collapse
Affiliation(s)
- Medard Edmund Mswahili
- Chungbuk National University, Department of Computer Engineering, Cheongju, 28644, South Korea
| | - Young-Seob Jeong
- Chungbuk National University, Department of Computer Engineering, Cheongju, 28644, South Korea
| |
Collapse
|
3
|
Wei L, Li Q, Song Y, Stefanov S, Dong R, Fu N, Siriwardane EMD, Chen F, Hu J. Crystal Composition Transformer: Self-Learning Neural Language Model for Generative and Tinkering Design of Materials. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2304305. [PMID: 39101275 PMCID: PMC11423232 DOI: 10.1002/advs.202304305] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Revised: 07/09/2024] [Indexed: 08/06/2024]
Abstract
Self-supervised neural language models have recently achieved unprecedented success from natural language processing to learning the languages of biological sequences and organic molecules. These models have demonstrated superior performance in the generation, structure classification, and functional predictions for proteins and molecules with learned representations. However, most of the masking-based pre-trained language models are not designed for generative design, and their black-box nature makes it difficult to interpret their design logic. Here a Blank-filling Language Model for Materials (BLMM) Crystal Transformer is proposed, a neural network-based probabilistic generative model for generative and tinkering design of inorganic materials. The model is built on the blank-filling language model for text generation and has demonstrated unique advantages in learning the "materials grammars" together with high-quality generation, interpretability, and data efficiency. It can generate chemically valid materials compositions with as high as 89.7% charge neutrality and 84.8% balanced electronegativity, which are more than four and eight times higher compared to a pseudo-random sampling baseline. The probabilistic generation process of BLMM allows it to recommend materials tinkering operations based on learned materials chemistry, which makes it useful for materials doping. The model is applied to discover a set of new materials as validated using the Density Functional Theory (DFT) calculations. This work thus brings the unsupervised transformer language models based generative artificial intelligence to inorganic materials. A user-friendly web app for tinkering materials design has been developed and can be accessed freely at www.materialsatlas.org/blmtinker.
Collapse
Affiliation(s)
- Lai Wei
- Department of Computer Science and EngineeringUniversity of South CarolinaColumbiaSC29201USA
| | - Qinyang Li
- Department of Computer Science and EngineeringUniversity of South CarolinaColumbiaSC29201USA
| | - Yuqi Song
- Department of Computer Science and EngineeringUniversity of South CarolinaColumbiaSC29201USA
- Department of Computer ScienceUniversity of Southern MainePortlandME04131USA
| | - Stanislav Stefanov
- Department of Computer Science and EngineeringUniversity of South CarolinaColumbiaSC29201USA
| | - Rongzhi Dong
- Department of Computer Science and EngineeringUniversity of South CarolinaColumbiaSC29201USA
| | - Nihang Fu
- Department of Computer Science and EngineeringUniversity of South CarolinaColumbiaSC29201USA
| | | | - Fanglin Chen
- Department of Mechanical EngineeringUniversity of South CarolinaColumbiaSC29201USA
| | - Jianjun Hu
- Department of Computer Science and EngineeringUniversity of South CarolinaColumbiaSC29201USA
| |
Collapse
|
4
|
Niu D, Zhang L, Zhang B, Zhang Q, Li Z. DAS-DDI: A dual-view framework with drug association and drug structure for drug-drug interaction prediction. J Biomed Inform 2024; 156:104672. [PMID: 38857738 DOI: 10.1016/j.jbi.2024.104672] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2024] [Revised: 05/09/2024] [Accepted: 06/06/2024] [Indexed: 06/12/2024]
Abstract
In drug development and clinical application, drug-drug interaction (DDI) prediction is crucial for patient safety and therapeutic efficacy. However, traditional methods for DDI prediction often overlook the structural features of drugs and the complex interrelationships between them, which affect the accuracy and interpretability of the model. In this paper, a novel dual-view DDI prediction framework, DAS-DDI is proposed. Firstly, a drug association network is constructed based on similarity information among drugs, which could provide rich context information for DDI prediction. Subsequently, a novel drug substructure extraction method is proposed, which could update the features of nodes and chemical bonds simultaneously, improving the comprehensiveness of the feature. Furthermore, an attention mechanism is employed to fuse multiple drug embeddings from different views dynamically, enhancing the discriminative ability of the model in handling multi-view data. Comparative experiments on three public datasets demonstrate the superiority of DAS-DDI compared with other state-of-the-art models under two scenarios.
Collapse
Affiliation(s)
- Dongjiang Niu
- College of Computer Science and Technology, Qingdao University, No. 308 Ningxia Road, Qingdao, 266071, Shandong, China
| | - Lianwei Zhang
- College of Computer Science and Technology, Qingdao University, No. 308 Ningxia Road, Qingdao, 266071, Shandong, China
| | - Beiyi Zhang
- College of Computer Science and Technology, Qingdao University, No. 308 Ningxia Road, Qingdao, 266071, Shandong, China
| | - Qiang Zhang
- College of Computer Science and Technology, Qingdao University, No. 308 Ningxia Road, Qingdao, 266071, Shandong, China
| | - Zhen Li
- College of Computer Science and Technology, Qingdao University, No. 308 Ningxia Road, Qingdao, 266071, Shandong, China.
| |
Collapse
|
5
|
Son YH, Shin DH, Kam TE. FTMMR: Fusion Transformer for Integrating Multiple Molecular Representations. IEEE J Biomed Health Inform 2024; 28:4361-4372. [PMID: 38551824 DOI: 10.1109/jbhi.2024.3383221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Molecular property prediction has gained substantial attention due to its potential for various bio-chemical applications. Numerous attempts have been made to enhance the performance by combining multiple molecular representations (1D, 2D, and 3D). However, most prior works only merged a limited number of representations or tried to embed multiple representations through a single network without using representation-specific networks. Furthermore, the heterogeneous characteristics of each representation made the fusion more challenging. Addressing these challenges, we introduce the Fusion Transformer for Multiple Molecular Representations (FTMMR) framework. Our strategy employs three distinct representation-specific networks and integrates information from each network using a fusion transformer architecture to generate fused representations. Additionally, we use self-supervised learning methods to align heterogeneous representations and to effectively utilize the limited chemical data available. In particular, we adopt a combinatorial loss function to leverage the contrastive loss for all three representations. We evaluate the performance of FTMMR using seven benchmark datasets, demonstrating that our framework outperforms existing fusion and self-supervised methods.
Collapse
|
6
|
Tang X, Tran A, Tan J, Gerstein MB. MolLM: a unified language model for integrating biomedical text with 2D and 3D molecular representations. Bioinformatics 2024; 40:i357-i368. [PMID: 38940177 PMCID: PMC11256921 DOI: 10.1093/bioinformatics/btae260] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION The current paradigm of deep learning models for the joint representation of molecules and text primarily relies on 1D or 2D molecular formats, neglecting significant 3D structural information that offers valuable physical insight. This narrow focus inhibits the models' versatility and adaptability across a wide range of modalities. Conversely, the limited research focusing on explicit 3D representation tends to overlook textual data within the biomedical domain. RESULTS We present a unified pre-trained language model, MolLM, that concurrently captures 2D and 3D molecular information alongside biomedical text. MolLM consists of a text Transformer encoder and a molecular Transformer encoder, designed to encode both 2D and 3D molecular structures. To support MolLM's self-supervised pre-training, we constructed 160K molecule-text pairings. Employing contrastive learning as a supervisory signal for learning, MolLM demonstrates robust molecular representation capabilities across four downstream tasks, including cross-modal molecule and text matching, property prediction, captioning, and text-prompted molecular editing. Through ablation, we demonstrate that the inclusion of explicit 3D representations improves performance in these downstream tasks. AVAILABILITY AND IMPLEMENTATION Our code, data, pre-trained model weights, and examples of using our model are all available at https://github.com/gersteinlab/MolLM. In particular, we provide Jupyter Notebooks offering step-by-step guidance on how to use MolLM to extract embeddings for both molecules and text.
Collapse
Affiliation(s)
- Xiangru Tang
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
| | - Andrew Tran
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
| | - Jeffrey Tan
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
| | - Mark B Gerstein
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
- Program in Computational Biology & Bioinformatics, Yale University, New Haven, CT 06520, United States
- Department of Molecular Biophysics & Biochemistry, Yale University, New Haven, CT 06520, United States
- Department of Statistics & Data Science, Yale University, New Haven, CT 06520, United States
- Department of Biomedical Informatics & Data Science, Yale University, New Haven, CT 06520, United States
| |
Collapse
|
7
|
van Tilborg D, Brinkmann H, Criscuolo E, Rossen L, Özçelik R, Grisoni F. Deep learning for low-data drug discovery: Hurdles and opportunities. Curr Opin Struct Biol 2024; 86:102818. [PMID: 38669740 DOI: 10.1016/j.sbi.2024.102818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Revised: 03/27/2024] [Accepted: 03/29/2024] [Indexed: 04/28/2024]
Abstract
Deep learning is becoming increasingly relevant in drug discovery, from de novo design to protein structure prediction and synthesis planning. However, it is often challenged by the small data regimes typical of certain drug discovery tasks. In such scenarios, deep learning approaches-which are notoriously 'data-hungry'-might fail to live up to their promise. Developing novel approaches to leverage the power of deep learning in low-data scenarios is sparking great attention, and future developments are expected to propel the field further. This mini-review provides an overview of recent low-data-learning approaches in drug discovery, analyzing their hurdles and advantages. Finally, we venture to provide a forecast of future research directions in low-data learning for drug discovery.
Collapse
Affiliation(s)
- Derek van Tilborg
- Institute for Complex Molecular Systems (ICMS), Department of Biomedical Engineering, Eindhoven University of Technology, PO Box 513, 5600 MB Eindhoven, the Netherlands; Centre for Living Technologies, Alliance TU/e, WUR, UU, UMC Utrecht, Princetonlaan 6, 3584 CB, Utrecht, the Netherlands. https://twitter.com/DerekvTilborg
| | - Helena Brinkmann
- Institute for Complex Molecular Systems (ICMS), Department of Biomedical Engineering, Eindhoven University of Technology, PO Box 513, 5600 MB Eindhoven, the Netherlands. https://twitter.com/hlnbrkmnn
| | - Emanuele Criscuolo
- Institute for Complex Molecular Systems (ICMS), Department of Biomedical Engineering, Eindhoven University of Technology, PO Box 513, 5600 MB Eindhoven, the Netherlands. https://twitter.com/emanuelecriscu9
| | - Luke Rossen
- Institute for Complex Molecular Systems (ICMS), Department of Biomedical Engineering, Eindhoven University of Technology, PO Box 513, 5600 MB Eindhoven, the Netherlands. https://twitter.com/molecular_ml
| | - Rıza Özçelik
- Institute for Complex Molecular Systems (ICMS), Department of Biomedical Engineering, Eindhoven University of Technology, PO Box 513, 5600 MB Eindhoven, the Netherlands; Centre for Living Technologies, Alliance TU/e, WUR, UU, UMC Utrecht, Princetonlaan 6, 3584 CB, Utrecht, the Netherlands. https://twitter.com/Rza_ozcelik
| | - Francesca Grisoni
- Institute for Complex Molecular Systems (ICMS), Department of Biomedical Engineering, Eindhoven University of Technology, PO Box 513, 5600 MB Eindhoven, the Netherlands; Centre for Living Technologies, Alliance TU/e, WUR, UU, UMC Utrecht, Princetonlaan 6, 3584 CB, Utrecht, the Netherlands.
| |
Collapse
|
8
|
Shen A, Yuan M, Ma Y, Du J, Wang M. Complementary multi-modality molecular self-supervised learning via non-overlapping masking for property prediction. Brief Bioinform 2024; 25:bbae256. [PMID: 38801702 PMCID: PMC11129775 DOI: 10.1093/bib/bbae256] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2023] [Revised: 04/25/2024] [Accepted: 05/15/2024] [Indexed: 05/29/2024] Open
Abstract
Self-supervised learning plays an important role in molecular representation learning because labeled molecular data are usually limited in many tasks, such as chemical property prediction and virtual screening. However, most existing molecular pre-training methods focus on one modality of molecular data, and the complementary information of two important modalities, SMILES and graph, is not fully explored. In this study, we propose an effective multi-modality self-supervised learning framework for molecular SMILES and graph. Specifically, SMILES data and graph data are first tokenized so that they can be processed by a unified Transformer-based backbone network, which is trained by a masked reconstruction strategy. In addition, we introduce a specialized non-overlapping masking strategy to encourage fine-grained interaction between these two modalities. Experimental results show that our framework achieves state-of-the-art performance in a series of molecular property prediction tasks, and a detailed ablation study demonstrates efficacy of the multi-modality framework and the masking strategy.
Collapse
Affiliation(s)
- Ao Shen
- Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, 131 Dong’an Road, 200032, Shanghai, China
- Shanghai Key Laboratory of Medical Image Computing and Computer Assisted Intervention, Fudan University, 131 Dong’an Road, 200032, Shanghai, China
| | - Mingzhi Yuan
- Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, 131 Dong’an Road, 200032, Shanghai, China
- Shanghai Key Laboratory of Medical Image Computing and Computer Assisted Intervention, Fudan University, 131 Dong’an Road, 200032, Shanghai, China
| | - Yingfan Ma
- Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, 131 Dong’an Road, 200032, Shanghai, China
- Shanghai Key Laboratory of Medical Image Computing and Computer Assisted Intervention, Fudan University, 131 Dong’an Road, 200032, Shanghai, China
| | - Jie Du
- Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, 131 Dong’an Road, 200032, Shanghai, China
- Shanghai Key Laboratory of Medical Image Computing and Computer Assisted Intervention, Fudan University, 131 Dong’an Road, 200032, Shanghai, China
| | - Manning Wang
- Digital Medical Research Center, School of Basic Medical Sciences, Fudan University, 131 Dong’an Road, 200032, Shanghai, China
- Shanghai Key Laboratory of Medical Image Computing and Computer Assisted Intervention, Fudan University, 131 Dong’an Road, 200032, Shanghai, China
| |
Collapse
|
9
|
Zhao X, Xu J, Shui Y, Xu M, Hu J, Liu X, Che K, Wang J, Liu Y. PermuteDDS: a permutable feature fusion network for drug-drug synergy prediction. J Cheminform 2024; 16:41. [PMID: 38622663 PMCID: PMC11017561 DOI: 10.1186/s13321-024-00839-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 04/03/2024] [Indexed: 04/17/2024] Open
Abstract
MOTIVATION Drug combination therapies have shown promise in clinical cancer treatments. However, it is hard to experimentally identify all drug combinations for synergistic interaction even with high-throughput screening due to the vast space of potential combinations. Although a number of computational methods for drug synergy prediction have proven successful in narrowing down this space, fusing drug pairs and cell line features effectively still lacks study, hindering current algorithms from understanding the complex interaction between drugs and cell lines. RESULTS In this paper, we proposed a Permutable feature fusion network for Drug-Drug Synergy prediction, named PermuteDDS. PermuteDDS takes multiple representations of drugs and cell lines as input and employs a permutable fusion mechanism to combine drug and cell line features. In experiments, PermuteDDS exhibits state-of-the-art performance on two benchmark data sets. Additionally, the results on independent test set grouped by different tissues reveal that PermuteDDS has good generalization performance. We believed that PermuteDDS is an effective and valuable tool for identifying synergistic drug combinations. It is publicly available at https://github.com/littlewei-lazy/PermuteDDS . SCIENTIFIC CONTRIBUTION First, this paper proposes a permutable feature fusion network for predicting drug synergy termed PermuteDDS, which extract diverse information from multiple drug representations and cell line representations. Second, the permutable fusion mechanism combine the drug and cell line features by integrating information of different channels, enabling the utilization of complex relationships between drugs and cell lines. Third, comparative and ablation experiments provide evidence of the efficacy of PermuteDDS in predicting drug-drug synergy.
Collapse
Affiliation(s)
- Xinwei Zhao
- Department of Medical Informatics, School of Biomedical Engineering and Informatics, Nanjing Medical University, 101 Longmian Avenue, Nanjing, 211166, Jiangsu, China
| | - Junqing Xu
- The Second Clinical Medical School, Nanjing Medical University, 101 Longmian Avenue, Nanjing, 211166, Jiangsu, China
| | - Youyuan Shui
- Department of Medical Informatics, School of Biomedical Engineering and Informatics, Nanjing Medical University, 101 Longmian Avenue, Nanjing, 211166, Jiangsu, China
| | - Mengdie Xu
- Department of Medical Informatics, School of Biomedical Engineering and Informatics, Nanjing Medical University, 101 Longmian Avenue, Nanjing, 211166, Jiangsu, China
| | - Jie Hu
- Department of Medical Informatics, School of Biomedical Engineering and Informatics, Nanjing Medical University, 101 Longmian Avenue, Nanjing, 211166, Jiangsu, China
- Institute of Medical Informatics and Management, Nanjing Medical University, 101 Longmian Avenue, Nanjing, 210029, Jiangsu, China
| | - Xiaoyan Liu
- Faculty of Computing, Harbin Institute of Technology, No. 92 West Da Zhi St, Harbin, 150001, Heilongjiang, China
| | - Kai Che
- Xi'an Aeronautics Computing Technique Research Institute, AVIC, No. 156, TaiBai Nroth Road, Xi'an, 710068, Shanxi, China
- Aviation Key Laboratory of Science and Technology on Airborne and Missleborne Computer, Xi'an, 710065, Shanxi, China
| | - Junjie Wang
- Department of Medical Informatics, School of Biomedical Engineering and Informatics, Nanjing Medical University, 101 Longmian Avenue, Nanjing, 211166, Jiangsu, China.
- Institute of Medical Informatics and Management, Nanjing Medical University, 101 Longmian Avenue, Nanjing, 210029, Jiangsu, China.
| | - Yun Liu
- Department of Medical Informatics, School of Biomedical Engineering and Informatics, Nanjing Medical University, 101 Longmian Avenue, Nanjing, 211166, Jiangsu, China.
- Institute of Medical Informatics and Management, Nanjing Medical University, 101 Longmian Avenue, Nanjing, 210029, Jiangsu, China.
- Department of Information, the First Affiliated Hospital, Nanjing Medical University, No. 300 Guang Zhou Road, Nanjing, 210029, Jiangsu, China.
| |
Collapse
|
10
|
Ma M, Lei X. A deep learning framework for predicting molecular property based on multi-type features fusion. Comput Biol Med 2024; 169:107911. [PMID: 38160501 DOI: 10.1016/j.compbiomed.2023.107911] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Revised: 12/18/2023] [Accepted: 12/24/2023] [Indexed: 01/03/2024]
Abstract
Extracting expressive molecular features is essential for molecular property prediction. Sequence-based representation is a common representation of molecules, which ignores the structure information of molecules. While molecular graph representation has a weak ability in expressing the 3D structure. In this article, we try to make use of the advantages of different type representations simultaneously for molecular property prediction. Thus, we propose a fusion model named DLF-MFF, which integrates the multi-type molecular features. Specifically, we first extract four different types of features from molecular fingerprints, 2D molecular graph, 3D molecular graph and molecular image. Then, in order to learn molecular features individually, we use four essential deep learning frameworks, which correspond to four distinct molecular representations. The final molecular representation is created by integrating the four feature vectors and feeding them into prediction layer to predict molecular property. We compare DLF-MFF with 7 state-of-the-art methods on 6 benchmark datasets consisting of multiple molecular properties, the experimental results show that DLF-MFF achieves state-of-the-art performance on 6 benchmark datasets. Moreover, DLF-MFF is applied to identify potential anti-SARS-CoV-2 inhibitor from 2500 drugs. We predict probability of each drug being inferred as a 3CL protease inhibitor and also calculate the binding affinity scores between each drug and 3CL protease. The results show that DLF-MFF product better performance in the identification of anti-SARS-CoV-2 inhibitor. This work is expected to offer novel research perspectives for accurate prediction of molecular properties and provide valuable insights into drug repurposing for COVID-19.
Collapse
Affiliation(s)
- Mei Ma
- School of Computer Science, Shaanxi Normal University, Xi'an, 710119, China; School of Mathematics and Statistics, Qinghai Normal University, Qinghai, 810000, China
| | - Xiujuan Lei
- School of Computer Science, Shaanxi Normal University, Xi'an, 710119, China.
| |
Collapse
|
11
|
Wang X, Patil N, Li F, Wang Z, Zhan H, Schmidt D, Thompson P, Guo Y, Landersdorfer CB, Shen HH, Peleg AY, Li J, Song J. PmxPred: A data-driven approach for the identification of active polymyxin analogues against gram-negative bacteria. Comput Biol Med 2024; 168:107681. [PMID: 37992470 DOI: 10.1016/j.compbiomed.2023.107681] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Revised: 10/07/2023] [Accepted: 11/06/2023] [Indexed: 11/24/2023]
Abstract
The multidrug-resistant Gram-negative bacteria has evolved into a worldwide threat to human health; over recent decades, polymyxins have re-emerged in clinical practice due to their high activity against multidrug-resistant bacteria. Nevertheless, the nephrotoxicity and neurotoxicity of polymyxins seriously hinder their practical use in the clinic. Based on the quantitative structure-activity relationship (QSAR), analogue design is an efficient strategy for discovering biologically active compounds with fewer adverse effects. To accelerate the polymyxin analogues discovery process and find the polymyxin analogues with high antimicrobial activity against Gram-negative bacteria, here we developed PmxPred, a GCN and catBoost-based machine learning framework. The RDKit descriptors were used for the molecule and residues representation, and the ensemble learning model was utilized for the antimicrobial activity prediction. This framework was trained and evaluated on multiple Gram-negative bacteria datasets, including Acinetobacter baumannii, Escherichia coli, Klebsiella pneumoniae, Pseudomonas aeruginosa and a general Gram-negative bacteria dataset achieving an AUROC of 0.857, 0.880, 0.756, 0.895 and 0.865 on the independent test, respectively. PmxPred outperformed the transfer learning method that trained on 10 million molecules. We interpreted our model well-trained model by analysing the importance of global and residue features. Overall, PmxPred provides a powerful additional tool for predicting active polymyxin analogues, and holds the potential elucidate the mechanisms underlying the antimicrobial activity of polymyxins. The source code is publicly available on GitHub (https://github.com/yanwu20/PmxPred).
Collapse
Affiliation(s)
- Xiaoyu Wang
- Monash Biomedicine Discovery Institute, Monash University, Melbourne, VIC, 3800, Australia; Monash Data Futures Institute, Monash University, Melbourne, VIC, 3800, Australia; Centre to Impact AMR, Monash University, Melbourne, VIC, 3800, Australia
| | - Nitin Patil
- Monash Biomedicine Discovery Institute, Monash University, Melbourne, VIC, 3800, Australia
| | - Fuyi Li
- College of Information Engineering, Northwest A&F University, Yangling, 712100, Shaanxi, China; Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Melbourne, Victoria, Australia
| | - Zhikang Wang
- Monash Biomedicine Discovery Institute, Monash University, Melbourne, VIC, 3800, Australia; Monash Data Futures Institute, Monash University, Melbourne, VIC, 3800, Australia; Centre to Impact AMR, Monash University, Melbourne, VIC, 3800, Australia
| | - Haolan Zhan
- Faculty of Information Technology, Monash University, Melbourne, VIC, 3800, Australia
| | - Daniel Schmidt
- Monash Data Futures Institute, Monash University, Melbourne, VIC, 3800, Australia; Faculty of Information Technology, Monash University, Melbourne, VIC, 3800, Australia
| | - Philip Thompson
- Drug Delivery, Disposition and Dynamics, Monash Institute of Pharmaceutical Sciences, Monash University, Parkville, Victoria, Australia
| | - Yuming Guo
- Department of Epidemiology and Preventive Medicine, School of Public Health and Preventive Medicine, Monash University, Melbourne, VIC, 3004, Australia
| | - Cornelia B Landersdorfer
- Centre to Impact AMR, Monash University, Melbourne, VIC, 3800, Australia; Drug Delivery, Disposition and Dynamics, Monash Institute of Pharmaceutical Sciences, Monash University, Parkville, Victoria, Australia
| | - Hsin-Hui Shen
- Monash Biomedicine Discovery Institute, Monash University, Melbourne, VIC, 3800, Australia; Department of Materials Science and Engineering, Faculty of Engineering, Monash University, Clayton, VIC, 3800, Australia
| | - Anton Y Peleg
- Monash Biomedicine Discovery Institute, Monash University, Melbourne, VIC, 3800, Australia; Centre to Impact AMR, Monash University, Melbourne, VIC, 3800, Australia; Department of Infectious Diseases, Alfred Hospital, Alfred Health, Melbourne, Victoria, Australia
| | - Jian Li
- Monash Biomedicine Discovery Institute, Monash University, Melbourne, VIC, 3800, Australia; Monash Data Futures Institute, Monash University, Melbourne, VIC, 3800, Australia; Centre to Impact AMR, Monash University, Melbourne, VIC, 3800, Australia.
| | - Jiangning Song
- Monash Biomedicine Discovery Institute, Monash University, Melbourne, VIC, 3800, Australia; Monash Data Futures Institute, Monash University, Melbourne, VIC, 3800, Australia; Centre to Impact AMR, Monash University, Melbourne, VIC, 3800, Australia.
| |
Collapse
|
12
|
Gao J, Shen Z, Xie Y, Lu J, Lu Y, Chen S, Bian Q, Guo Y, Shen L, Wu J, Zhou B, Hou T, He Q, Che J, Dong X. TransFoxMol: predicting molecular property with focused attention. Brief Bioinform 2023; 24:bbad306. [PMID: 37605947 DOI: 10.1093/bib/bbad306] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Revised: 07/17/2023] [Accepted: 08/04/2023] [Indexed: 08/23/2023] Open
Abstract
Predicting the biological properties of molecules is crucial in computer-aided drug development, yet it's often impeded by data scarcity and imbalance in many practical applications. Existing approaches are based on self-supervised learning or 3D data and using an increasing number of parameters to improve performance. These approaches may not take full advantage of established chemical knowledge and could inadvertently introduce noise into the respective model. In this study, we introduce a more elegant transformer-based framework with focused attention for molecular representation (TransFoxMol) to improve the understanding of artificial intelligence (AI) of molecular structure property relationships. TransFoxMol incorporates a multi-scale 2D molecular environment into a graph neural network + Transformer module and uses prior chemical maps to obtain a more focused attention landscape compared to that obtained using existing approaches. Experimental results show that TransFoxMol achieves state-of-the-art performance on MoleculeNet benchmarks and surpasses the performance of baselines that use self-supervised learning or geometry-enhanced strategies on small-scale datasets. Subsequent analyses indicate that TransFoxMol's predictions are highly interpretable and the clever use of chemical knowledge enables AI to perceive molecules in a simple but rational way, enhancing performance.
Collapse
Affiliation(s)
- Jian Gao
- Hangzhou Institute of Innovative Medicine, Institute of Drug Discovery and Design, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Zheyuan Shen
- Hangzhou Institute of Innovative Medicine, Institute of Drug Discovery and Design, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Yufeng Xie
- School of Software Technology, Zhejiang University, Hangzhou, China
| | - Jialiang Lu
- Hangzhou Institute of Innovative Medicine, Institute of Drug Discovery and Design, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Yang Lu
- Hangzhou Institute of Innovative Medicine, Institute of Drug Discovery and Design, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Sikang Chen
- Hangzhou Institute of Innovative Medicine, Institute of Drug Discovery and Design, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Qingyu Bian
- Hangzhou Institute of Innovative Medicine, Institute of Drug Discovery and Design, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Yue Guo
- Innovation Institute for Artificial Intelligence in Medicine, Zhejiang University, Hangzhou, China
| | - Liteng Shen
- Hangzhou Institute of Innovative Medicine, Institute of Drug Discovery and Design, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Jian Wu
- School of Software Technology, Zhejiang University, Hangzhou, China
| | - Binbin Zhou
- Department of Computer Science and Computing, Zhejiang University City College, Hangzhou, China
| | - Tingjun Hou
- State Key Lab of CAD&CG, College of Pharmaceutical Sciences, Zhejiang University, Zhejiang, China
- Innovation Institute for Artificial Intelligence in Medicine, Zhejiang University, Hangzhou, China
| | - Qiaojun He
- Institute of Pharmacology & Toxicology, Zhejiang Province Key Laboratory of Anti-Cancer Drug Research, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, PR China
- Innovation Institute for Artificial Intelligence in Medicine, Zhejiang University, Hangzhou, China
- Centre for Drug Safety Evaluation and Research of ZJU, Hangzhou, 310058, PR China
- Cancer Center of Zhejiang University, Hangzhou, China
| | - Jinxin Che
- Hangzhou Institute of Innovative Medicine, Institute of Drug Discovery and Design, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Xiaowu Dong
- Hangzhou Institute of Innovative Medicine, Institute of Drug Discovery and Design, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
- Innovation Institute for Artificial Intelligence in Medicine, Zhejiang University, Hangzhou, China
- Cancer Center of Zhejiang University, Hangzhou, China
- Department of Pharmacy, Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
| |
Collapse
|
13
|
Yang X, Niu Z, Liu Y, Song B, Lu W, Zeng L, Zeng X. Modality-DTA: Multimodality Fusion Strategy for Drug-Target Affinity Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1200-1210. [PMID: 36083952 DOI: 10.1109/tcbb.2022.3205282] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Prediction of the drug-target affinity (DTA) plays an important role in drug discovery. Existing deep learning methods for DTA prediction typically leverage a single modality, namely simplified molecular input line entry specification (SMILES) or amino acid sequence to learn representations. SMILES or amino acid sequences can be encoded into different modalities. Multimodality data provide different kinds of information, with complementary roles for DTA prediction. We propose Modality-DTA, a novel deep learning method for DTA prediction that leverages the multimodality of drugs and targets. A group of backward propagation neural networks is applied to ensure the completeness of the reconstruction process from the latent feature representation to original multimodality data. The tag between the drug and target is used to reduce the noise information in the latent representation from multimodality data. Experiments on three benchmark datasets show that our Modality-DTA outperforms existing methods in all metrics. Modality-DTA reduces the mean square error by 15.7% and improves the area under the precisionrecall curve by 12.74% in the Davis dataset. We further find that the drug modality Morgan fingerprint and the target modality generated by one-hot-encoding play the most significant roles. To the best of our knowledge, Modality-DTA is the first method to explore multimodality for DTA prediction.
Collapse
|
14
|
Zhang XC, Wu CK, Yi JC, Zeng XX, Yang CQ, Lu AP, Hou TJ, Cao DS. Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration. RESEARCH (WASHINGTON, D.C.) 2022; 2022:0004. [PMID: 39285949 PMCID: PMC11404312 DOI: 10.34133/research.0004] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Accepted: 10/19/2022] [Indexed: 09/19/2024]
Abstract
Accurate prediction of pharmacological properties of small molecules is becoming increasingly important in drug discovery. Traditional feature-engineering approaches heavily rely on handcrafted descriptors and/or fingerprints, which need extensive human expert knowledge. With the rapid progress of artificial intelligence technology, data-driven deep learning methods have shown unparalleled advantages over feature-engineering-based methods. However, existing deep learning methods usually suffer from the scarcity of labeled data and the inability to share information between different tasks when applied to predicting molecular properties, thus resulting in poor generalization capability. Here, we proposed a novel multitask learning BERT (Bidirectional Encoder Representations from Transformer) framework, named MTL-BERT, which leverages large-scale pre-training, multitask learning, and SMILES (simplified molecular input line entry specification) enumeration to alleviate the data scarcity problem. MTL-BERT first exploits a large amount of unlabeled data through self-supervised pretraining to mine the rich contextual information in SMILES strings and then fine-tunes the pretrained model for multiple downstream tasks simultaneously by leveraging their shared information. Meanwhile, SMILES enumeration is used as a data enhancement strategy during the pretraining, fine-tuning, and test phases to substantially increase data diversity and help to learn the key relevant patterns from complex SMILES strings. The experimental results showed that the pretrained MTL-BERT model with few additional fine-tuning can achieve much better performance than the state-of-the-art methods on most of the 60 practical molecular datasets. Additionally, the MTL-BERT model leverages attention mechanisms to focus on SMILES character features essential to target properties for model interpretability.
Collapse
Affiliation(s)
- Xiao-Chen Zhang
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410013, Hunan, P. R. China
- Shangqiu Normal University, School of Information Technology, Shangqiu 476000, Henan, P. R. China
- College of Computer, National University of Defense Technology, Changsha 410005, Hunan, P. R. China
| | - Cheng-Kun Wu
- College of Computer, National University of Defense Technology, Changsha 410005, Hunan, P. R. China
| | - Jia-Cai Yi
- College of Computer, National University of Defense Technology, Changsha 410005, Hunan, P. R. China
| | - Xiang-Xiang Zeng
- Department of Computer Science, Hunan University, Changsha 410082, Hunan, P. R. China
| | - Can-Qun Yang
- College of Computer, National University of Defense Technology, Changsha 410005, Hunan, P. R. China
| | - Ai-Ping Lu
- Institute for Advancing Translational Medicine in Bone and Joint Diseases, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR 999077, P. R. China
| | - Ting-Jun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, P. R. China
| | - Dong-Sheng Cao
- Xiangya School of Pharmaceutical Sciences, Central South University, Changsha 410013, Hunan, P. R. China
- Institute for Advancing Translational Medicine in Bone and Joint Diseases, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong SAR 999077, P. R. China
| |
Collapse
|
15
|
Su Y, Wang M, Wang P, Zheng C, Liu Y, Zeng X. Deep learning joint models for extracting entities and relations in biomedical: a survey and comparison. Brief Bioinform 2022; 23:6686739. [PMID: 36125190 DOI: 10.1093/bib/bbac342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 07/20/2022] [Accepted: 07/25/2022] [Indexed: 12/14/2022] Open
Abstract
The rapid development of biomedicine has produced a large number of biomedical written materials. These unstructured text data create serious challenges for biomedical researchers to find information. Biomedical named entity recognition (BioNER) and biomedical relation extraction (BioRE) are the two most fundamental tasks of biomedical text mining. Accurately and efficiently identifying entities and extracting relations have become very important. Methods that perform two tasks separately are called pipeline models, and they have shortcomings such as insufficient interaction, low extraction quality and easy redundancy. To overcome the above shortcomings, many deep learning-based joint name entity recognition and relation extraction models have been proposed, and they have achieved advanced performance. This paper comprehensively summarize deep learning models for joint name entity recognition and relation extraction for biomedicine. The joint BioNER and BioRE models are discussed in the light of the challenges existing in the BioNER and BioRE tasks. Five joint BioNER and BioRE models and one pipeline model are selected for comparative experiments on four biomedical public datasets, and the experimental results are analyzed. Finally, we discuss the opportunities for future development of deep learning-based joint BioNER and BioRE models.
Collapse
Affiliation(s)
- Yansen Su
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Artificial Intelligence, Anhui University, 111 Jiulong Road, Economic and Technological Development Zone, 230601, Hefei, China
| | - Minglu Wang
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Computer Science and Technology, Anhui University, 111 Jiulong Road, Economic and Technological Development Zone, 230601, Hefei, China
| | - Pengpeng Wang
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Computer Science and Technology, Anhui University, 111 Jiulong Road, Economic and Technological Development Zone, 230601, Hefei, China
| | - Chunhou Zheng
- Information Materials and Intelligent Sensing Laboratory of Anhui Province, School of Artificial Intelligence, Anhui University, 111 Jiulong Road, Economic and Technological Development Zone, 230601, Hefei, China
| | - Yuansheng Liu
- College of Information Science and Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| | - Xiangxiang Zeng
- College of Information Science and Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| |
Collapse
|
16
|
Zeng L, Liu Y, Yu ZG, Liu Y. iEnhancer-DLRA: identification of enhancers and their strengths by a self-attention fusion strategy for local and global features. Brief Funct Genomics 2022; 21:399-407. [PMID: 35942693 DOI: 10.1093/bfgp/elac023] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 06/30/2022] [Accepted: 07/12/2022] [Indexed: 11/14/2022] Open
Abstract
Identification and classification of enhancers are highly significant because they play crucial roles in controlling gene transcription. Recently, several deep learning-based methods for identifying enhancers and their strengths have been developed. However, existing methods are usually limited because they use only local or only global features. The combination of local and global features is critical to further improve the prediction performance. In this work, we propose a novel deep learning-based method, called iEnhancer-DLRA, to identify enhancers and their strengths. iEnhancer-DLRA extracts local and multi-scale global features of sequences by using a residual convolutional network and two bidirectional long short-term memory networks. Then, a self-attention fusion strategy is proposed to deeply integrate these local and global features. The experimental results on the independent test dataset indicate that iEnhancer-DLRA performs better than nine existing state-of-the-art methods in both identification and classification of enhancers in almost all metrics. iEnhancer-DLRA achieves 13.8% (for identifying enhancers) and 12.6% (for classifying strengths) improvement in accuracy compared with the best existing state-of-the-art method. This is the first time that the accuracy of an enhancer identifier exceeds 0.9 and the accuracy of the enhancer classifier exceeds 0.8 on the independent test set. Moreover, iEnhancer-DLRA achieves superior predictive performance on the rice dataset compared with the state-of-the-art method RiceENN.
Collapse
Affiliation(s)
- Li Zeng
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, 411105, Xiangtan, China
| | - Yang Liu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, 411105, Xiangtan, China
| | - Zu-Guo Yu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, 411105, Xiangtan, China
| | - Yuansheng Liu
- College of Computer Science and Electronic Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| |
Collapse
|
17
|
Zhang W, Hou J, Liu B. iPiDA-LTR: Identifying piwi-interacting RNA-disease associations based on Learning to Rank. PLoS Comput Biol 2022; 18:e1010404. [PMID: 35969645 PMCID: PMC9410559 DOI: 10.1371/journal.pcbi.1010404] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2022] [Revised: 08/25/2022] [Accepted: 07/18/2022] [Indexed: 12/01/2022] Open
Abstract
Piwi-interacting RNAs (piRNAs) are regarded as drug targets and biomarkers for the diagnosis and therapy of diseases. However, biological experiments cost substantial time and resources, and the existing computational methods only focus on identifying missing associations between known piRNAs and diseases. With the fast development of biological experiments, more and more piRNAs are detected. Therefore, the identification of piRNA-disease associations of newly detected piRNAs has significant theoretical value and practical significance on pathogenesis of diseases. In this study, the iPiDA-LTR predictor is proposed to identify associations between piRNAs and diseases based on Learning to Rank. The iPiDA-LTR predictor not only identifies the missing associations between known piRNAs and diseases, but also detects diseases associated with newly detected piRNAs. Experimental results demonstrate that iPiDA-LTR effectively predicts piRNA-disease associations outperforming the other related methods.
Collapse
Affiliation(s)
- Wenxiang Zhang
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Jialu Hou
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
18
|
Zhao L, Zhu Y, Wang J, Wen N, Wang C, Cheng L. A brief review of protein-ligand interaction prediction. Comput Struct Biotechnol J 2022; 20:2831-2838. [PMID: 35765652 PMCID: PMC9189993 DOI: 10.1016/j.csbj.2022.06.004] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 05/30/2022] [Accepted: 06/01/2022] [Indexed: 01/21/2023] Open
Abstract
The task of identifying protein–ligand interactions (PLIs) plays a prominent role in the field of drug discovery. However, it is infeasible to identify potential PLIs via costly and laborious in vitro experiments. There is a need to develop PLI computational prediction approaches to speed up the drug discovery process. In this review, we summarize a brief introduction to various computation-based PLIs. We discuss these approaches, in particular, machine learning-based methods, with illustrations of different emphases based on mainstream trends. Moreover, we analyzed three research dynamics that can be further explored in future studies.
Collapse
Affiliation(s)
- Lingling Zhao
- Faculty of Computing, Harbin Institute of Technology, Harbin, China
| | - Yan Zhu
- Faculty of Computing, Harbin Institute of Technology, Harbin, China
| | - Junjie Wang
- Department of Medical Informatics, School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing, China
| | - Naifeng Wen
- School of Mechanical and Electrical Engineering, Dalian Minzu University, Dalian, China
| | - Chunyu Wang
- Faculty of Computing, Harbin Institute of Technology, Harbin, China
- Corresponding authors.
| | - Liang Cheng
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
- NHC and CAMS Key Laboratory of Molecular Probe and Targeted Theranostics, Harbin Medical University, Harbin, China
- Corresponding authors.
| |
Collapse
|
19
|
Jiao S, Chen Z, Zhang L, Zhou X, Shi L. ATGPred-FL: sequence-based prediction of autophagy proteins with feature representation learning. Amino Acids 2022; 54:799-809. [PMID: 35286461 DOI: 10.1007/s00726-022-03145-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2021] [Accepted: 01/28/2022] [Indexed: 11/26/2022]
Abstract
Autophagy plays an important role in biological evolution and is regulated by many autophagy proteins. Accurate identification of autophagy proteins is crucially important to reveal their biological functions. Due to the expense and labor cost of experimental methods, it is urgent to develop automated, accurate and reliable sequence-based computational tools to enable the identification of novel autophagy proteins among numerous proteins and peptides. For this purpose, a new predictor named ATGPred-FL was proposed for the efficient identification of autophagy proteins. We investigated various sequence-based feature descriptors and adopted the feature learning method to generate corresponding, more informative probability features. Then, a two-step feature selection strategy based on accuracy was utilized to remove irrelevant and redundant features, leading to the most discriminative 14-dimensional feature set. The final predictor was built using a support vector machine classifier, which performed favorably on both the training and testing sets with accuracy values of 94.40% and 90.50%, respectively. ATGPred-FL is the first ATG machine learning predictor based on protein primary sequences. We envision that ATGPred-FL will be an effective and useful tool for autophagy protein identification, and it is available for free at http://lab.malab.cn/~acy/ATGPred-FL , the source code and datasets are accessible at https://github.com/jiaoshihu/ATGPred .
Collapse
Affiliation(s)
- Shihu Jiao
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Zheng Chen
- School of Applied Chemistry and Biological Technology, Shenzhen Polytechnic, 7098 Liuxian Street, Shenzhen, 518055, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, No.4 Block 2 North Jianshe Road, Chengdu, 61005, China
| | - Lichao Zhang
- School of Intelligent Manufacturing and Equipment, Shenzhen Institute of Information Technology, Shenzhen, 518172, China
| | - Xun Zhou
- Beidahuang Industry Group General Hospital, Harbin, 150001, China.
| | - Lei Shi
- Department of Spine Surgery, Changzheng Hospital, Naval Medical University, No 415, Fengyang Road, Huangpu District, Shanghai, 210000, China.
| |
Collapse
|
20
|
Wan H, Zhang J, Ding Y, Wang H, Tian G. Immunoglobulin Classification Based on FC* and GC* Features. Front Genet 2022; 12:827161. [PMID: 35140745 PMCID: PMC8819591 DOI: 10.3389/fgene.2021.827161] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2021] [Accepted: 12/22/2021] [Indexed: 11/13/2022] Open
Abstract
Immunoglobulins have a pivotal role in disease regulation. Therefore, it is vital to accurately identify immunoglobulins to develop new drugs and research related diseases. Compared with utilizing high-dimension features to identify immunoglobulins, this research aimed to examine a method to classify immunoglobulins and non-immunoglobulins using two features, FC* and GC*. Classification of 228 samples (109 immunoglobulin samples and 119 non-immunoglobulin samples) revealed that the overall accuracy was 80.7% in 10-fold cross-validation using the J48 classifier implemented in Weka software. The FC* feature identified in this study was found in the immunoglobulin subtype domain, which demonstrated that this extracted feature could represent functional and structural properties of immunoglobulins for forecasting.
Collapse
Affiliation(s)
- Hao Wan
- Institute of Advanced Cross-field Science, College of Life Science, Qingdao University, Qingdao, China
| | - Jina Zhang
- Geneis (Beijing) Co., Ltd., Beijing, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| | - Hetian Wang
- Beidahuang Industry Group General Hospital, Harbin, China
- *Correspondence: Hetian Wang, ; Geng Tian,
| | - Geng Tian
- Geneis (Beijing) Co., Ltd., Beijing, China
- *Correspondence: Hetian Wang, ; Geng Tian,
| |
Collapse
|
21
|
Lin C, Wang L, Shi L. AAPred-CNN: accurate predictor based on deep convolution neural network for identification of anti-angiogenic peptides. Methods 2022; 204:442-448. [PMID: 35031486 DOI: 10.1016/j.ymeth.2022.01.004] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Revised: 12/28/2021] [Accepted: 01/09/2022] [Indexed: 12/13/2022] Open
Abstract
Recently, deep learning techniques have been developed for various bioactive peptide prediction tasks. However, there are only conventional machine learning-based methods for the prediction of anti-angiogenic peptides (AAP), which play an important role in cancer treatment. The main reason why no deep learning method has been involved in this field is that there are too few experimentally validated AAPs to support the training of deep models but researchers have believed that deep learning seriously depends on the amounts of labeled data. In this paper, as a tentative work, we try to predict AAP by constructing different classical deep learning models and propose the first deep convolution neural network-based predictor (AAPred-CNN) for AAP. Contrary to intuition, the experimental results show that deep learning models can achieve superior or comparable performance to the state-of-the-art model, although they are given a few labeled sequences to train. We also decipher the influence of hyper-parameters and training samples on the performance of deep learning models to help understand how the model work. Furthermore, we also visualize the learned embeddings by dimension reduction to increase the model interpretability and reveal the residue propensity of AAP through the statistics of convolutional features for different residues. In summary, this work demonstrates the powerful representation ability of AAPred-CNNfor AAP prediction, further improving the prediction accuracy of AAP.
Collapse
Affiliation(s)
- Changhang Lin
- School of Big Data and Artificial Intelligence, Fujian Polytechnic Normal University, Fuzhou, China
| | - Lei Wang
- Beidahuang Industry Group General Hospital, Harbin, China.
| | - Lei Shi
- Department of Spine Surgery, Changzheng Hospital, Naval Medical University, Shanghai, China.
| |
Collapse
|
22
|
Song B, Luo X, Luo X, Liu Y, Niu Z, Zeng X. Learning spatial structures of proteins improves protein-protein interaction prediction. Brief Bioinform 2022; 23:6501351. [PMID: 35018418 DOI: 10.1093/bib/bbab558] [Citation(s) in RCA: 56] [Impact Index Per Article: 18.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2021] [Revised: 12/07/2021] [Accepted: 12/07/2021] [Indexed: 01/09/2023] Open
Abstract
Spatial structures of proteins are closely related to protein functions. Integrating protein structures improves the performance of protein-protein interaction (PPI) prediction. However, the limited quantity of known protein structures restricts the application of structure-based prediction methods. Utilizing the predicted protein structure information is a promising method to improve the performance of sequence-based prediction methods. We propose a novel end-to-end framework, TAGPPI, to predict PPIs using protein sequence alone. TAGPPI extracts multi-dimensional features by employing 1D convolution operation on protein sequences and graph learning method on contact maps constructed from AlphaFold. A contact map contains abundant spatial structure information, which is difficult to obtain from 1D sequence data directly. We further demonstrate that the spatial information learned from contact maps improves the ability of TAGPPI in PPI prediction tasks. We compare the performance of TAGPPI with those of nine state-of-the-art sequence-based methods, and TAGPPI outperforms such methods in all metrics. To the best of our knowledge, this is the first method to use the predicted protein topology structure graph for sequence-based PPI prediction. More importantly, our proposed architecture could be extended to other prediction tasks related to proteins.
Collapse
Affiliation(s)
- Bosheng Song
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410012, Hunan, China
| | - Xiaoyan Luo
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410012, Hunan, China.,MindRank AI ltd., Hangzhou, 311113, Zhejiang, China
| | - Xiaoli Luo
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410012, Hunan, China.,BioMap, Haidian, 100089, Beijing, China
| | - Yuansheng Liu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410012, Hunan, China
| | | | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410012, Hunan, China
| |
Collapse
|