1
|
Dražić E, Jelušić D, Janković Bevandić P, Mauša G, Kalafatovic D. Using Machine Learning to Fast-Track Peptide Nanomaterial Discovery. ACS NANO 2025; 19:20295-20320. [PMID: 40440125 DOI: 10.1021/acsnano.5c00670] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2025]
Abstract
Peptides can serve as building blocks for supramolecular materials because of their unique ability to self-assemble, offering potential applications in drug delivery, tissue engineering, and nanotechnology. In this review, we describe peptide self-assembly as a sequence- and context-dependent process and its resulting complexity due to the heterogeneity of the sequences and experimental conditions, which makes cross-laboratory reproducibility a serious challenge and standardized reporting a necessity. Given the large number of possible peptide permutations, machine learning (ML) is suitable for navigating the peptide search space with the aim of reducing trial-and-error experimentation and speeding up the discovery of self-assembling peptides. However, we point out that ML is not a point-and-shoot tool that can be applied directly to any problem and requires careful consideration, domain knowledge, and proper data preparation to achieve meaningful results. In addition, we discuss the lack of negative data reported to be the main limiting factor in the effective application of ML. Considering the transformative potential of artificial intelligence, we conclude that grasping the power of large language models and generative approaches, coupled with explainability techniques, will expedite peptide nanomaterials discovery.
Collapse
Affiliation(s)
- Ena Dražić
- University of Rijeka, Center for Artificial Intelligence and Cybersecurity, 51000 Rijeka, Croatia
| | - Darijan Jelušić
- University of Rijeka, Center for Artificial Intelligence and Cybersecurity, 51000 Rijeka, Croatia
- University of Rijeka, Faculty of Engineering, 51000 Rijeka, Croatia
| | | | - Goran Mauša
- University of Rijeka, Center for Artificial Intelligence and Cybersecurity, 51000 Rijeka, Croatia
- University of Rijeka, Faculty of Engineering, 51000 Rijeka, Croatia
| | - Daniela Kalafatovic
- University of Rijeka, Center for Artificial Intelligence and Cybersecurity, 51000 Rijeka, Croatia
- University of Rijeka, Faculty of Engineering, 51000 Rijeka, Croatia
| |
Collapse
|
2
|
Wei Y, Tan Z, Liu L. CR-deal: Explainable Neural Network for circRNA-RBP Binding Site Recognition and Interpretation. Interdiscip Sci 2025; 17:463-476. [PMID: 40146403 DOI: 10.1007/s12539-025-00694-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Revised: 02/01/2025] [Accepted: 02/06/2025] [Indexed: 03/28/2025]
Abstract
circRNAs are a type of single-stranded non-coding RNA molecules, and their unique feature is their closed circular structure. The interaction between circRNAs and RNA-binding proteins (RBPs) plays a key role in biological functions and is crucial for studying post-transcriptional regulatory mechanisms. The genome-wide circRNA binding event data obtained by cross-linking immunoprecipitation sequencing technology provides a foundation for constructing efficient computational model prediction methods. However, in existing studies, although machine learning techniques have been applied to predict circRNA-RBP interaction sites, these methods still have room for improvement in accuracy and lack interpretability. We propose CR-deal, which is an interpretable joint deep learning network that predicts the binding sites of circRNA and RBP through genome-wide circRNA data. CR-deal utilizes a graph attention network to unify sequence and structural features into the same view, more effectively utilizing structural features to improve accuracy. It can infer marker genes in the binding site through integrated gradient feature interpretation, thereby inferring functional structural regions in the binding site. We conducted benchmark tests on CR-deal on 37 circRNA datasets and 7 lncRNA datasets, respectively, and obtained the interpretability of CR-deal and discovered functional structural regions through 5 circRNA datasets. We believe that CR-deal can help researchers gain a deeper understanding of the functions and mechanisms of circRNA in living organisms and its critical role in the occurrence and development of diseases. The source code of CR-deal is provided free of charge on https://github.com/liuliwei1980/CR .
Collapse
Affiliation(s)
- Yuxiao Wei
- College of Software, Dalian Jiaotong University, Dalian, 116028, China
| | - Zhebin Tan
- College of Software, Dalian Jiaotong University, Dalian, 116028, China
| | - Liwei Liu
- College of Science, Dalian Jiaotong University, Dalian, 116028, China.
| |
Collapse
|
3
|
Chatterjee A, Ravandi B, Haddadi P, Philip NH, Abdelmessih M, Mowrey WR, Ricchiuto P, Liang Y, Ding W, Mobarec JC, Eliassi-Rad T. Topology-driven negative sampling enhances generalizability in protein-protein interaction prediction. Bioinformatics 2025; 41:btaf148. [PMID: 40193392 PMCID: PMC12080959 DOI: 10.1093/bioinformatics/btaf148] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2024] [Revised: 03/03/2025] [Accepted: 04/04/2025] [Indexed: 04/09/2025] Open
Abstract
MOTIVATION Unraveling the human interactome to uncover disease-specific patterns and discover drug targets hinges on accurate protein-protein interaction (PPI) predictions. However, challenges persist in machine learning (ML) models due to a scarcity of quality hard negative samples, shortcut learning, and limited generalizability to novel proteins. RESULTS In this study, we introduce a novel approach for strategic sampling of protein-protein noninteractions (PPNIs) by leveraging higher-order network characteristics that capture the inherent complementarity-driven mechanisms of PPIs. Next, we introduce Unsupervised Pre-training of Node Attributes tuned for PPI (UPNA-PPI), a high throughput sequence-to-function ML pipeline, integrating unsupervised pre-training in protein representation learning with Topological PPNI (TPPNI) samples, capable of efficiently screening billions of interactions. By using our TPPNI in training the UPNA-PPI model, we improve PPI prediction generalizability and interpretability, particularly in identifying potential binding sites locations on amino acid sequences, strengthening the prioritization of screening assays and facilitating the transferability of ML predictions across protein families and homodimers. UPNA-PPI establishes the foundation for a fundamental negative sampling methodology in graph machine learning by integrating insights from network topology. AVAILABILITY AND IMPLEMENTATION Code and UPNA-PPI predictions are freely available at https://github.com/alxndgb/UPNA-PPI.
Collapse
Affiliation(s)
- Ayan Chatterjee
- BioClarity AI, Boston, MA 02130, United States
- Bioinformatics and Data Science, Alexion AstraZeneca Rare Disease, Boston, MA 02210, United States
- Network Science Institute, Northeastern University, Boston, MA 02115, United States
| | - Babak Ravandi
- Bioinformatics and Data Science, Alexion AstraZeneca Rare Disease, Boston, MA 02210, United States
- Network Science Institute, Northeastern University, Boston, MA 02115, United States
- Department of Physics, Northeastern University, Boston, MA 02115, United States
| | - Parham Haddadi
- Bioinformatics and Data Science, Alexion AstraZeneca Rare Disease, Boston, MA 02210, United States
| | - Naomi H Philip
- Bioinformatics and Data Science, Alexion AstraZeneca Rare Disease, Boston, MA 02210, United States
| | - Mario Abdelmessih
- Bioinformatics and Data Science, Alexion AstraZeneca Rare Disease, Boston, MA 02210, United States
| | - William R Mowrey
- Bioinformatics and Data Science, Alexion AstraZeneca Rare Disease, Boston, MA 02210, United States
| | - Piero Ricchiuto
- Bioinformatics and Data Science, Alexion AstraZeneca Rare Disease, Boston, MA 02210, United States
| | - Yupu Liang
- Bioinformatics and Data Science, Alexion AstraZeneca Rare Disease, Boston, MA 02210, United States
| | - Wei Ding
- Bioinformatics and Data Science, Alexion AstraZeneca Rare Disease, Boston, MA 02210, United States
| | - Juan Carlos Mobarec
- Protein Structure and Biophysics, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK
| | - Tina Eliassi-Rad
- Network Science Institute, Northeastern University, Boston, MA 02115, United States
- Khoury College of Computer Sciences, Northeastern University, Boston, MA CB2 0AA, United States
- Santa Fe Institute, Santa Fe, NM 87501, United States
| |
Collapse
|
4
|
Feng M, Liu L, Xian ZN, Wei X, Li K, Yan W, Lu Q, Shi Y, He G. PSTP: accurate residue-level phase separation prediction using protein conformational and language model embeddings. Brief Bioinform 2025; 26:bbaf171. [PMID: 40315433 PMCID: PMC12047702 DOI: 10.1093/bib/bbaf171] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2025] [Revised: 03/07/2025] [Accepted: 03/19/2025] [Indexed: 05/04/2025] Open
Abstract
Phase separation (PS) is essential in cellular processes and disease mechanisms, highlighting the need for predictive algorithms to analyze uncharacterized sequences and accelerate experimental validation. Current high-accuracy methods often rely on extensive annotations or handcrafted features, limiting their generalizability to sequences lacking such annotations and making it difficult to identify key protein regions involved in PS. We introduce Phase Separation's Transfer-learning Prediction (PSTP), which combines conformational embeddings with large language model embeddings, enabling state-of-the-art PS predictions from protein sequences alone. PSTP performs well across various prediction scenarios and shows potential for predicting novel-designed artificial proteins. Additionally, PSTP provides residue-level predictions that are highly correlated with experimentally validated PS regions. By analyzing 160 000+ variants, PSTP characterizes the strong link between the incidence of pathogenic variants and residue-level PS propensities in unconserved intrinsically disordered regions, offering insights into underexplored mutation effects. PSTP's sliding-window optimization reduces its memory usage to a few hundred megabytes, facilitating rapid execution on typical CPUs and GPUs. Offered via both a web server and an installable Python package, PSTP provides a versatile tool for decoding protein PS behavior and supporting disease-focused research.
Collapse
Affiliation(s)
- Mofan Feng
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Shanghai Jiao Tong University, No. 1954 Huashan Road, Xuhui District, Shanghai 200030, China
- Shanghai Institute of Medical Genetics, Shanghai Children’s Hospital, Shanghai Jiao Tong University School of Medicine, No. 24 Lane 1400 West Beijing Road, Jing’an District, Shanghai 200040, China
| | - Liangjie Liu
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Shanghai Jiao Tong University, No. 1954 Huashan Road, Xuhui District, Shanghai 200030, China
- Shanghai Institute of Medical Genetics, Shanghai Children’s Hospital, Shanghai Jiao Tong University School of Medicine, No. 24 Lane 1400 West Beijing Road, Jing’an District, Shanghai 200040, China
| | - Zhuo-Ning Xian
- School of Environmental Science & Engineering, Shanghai Jiao Tong University, No. 800 Dongchuan Road, Minhang District, Shanghai 200240, China
| | - Xiaoxi Wei
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Shanghai Jiao Tong University, No. 1954 Huashan Road, Xuhui District, Shanghai 200030, China
| | - Keyi Li
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Shanghai Jiao Tong University, No. 1954 Huashan Road, Xuhui District, Shanghai 200030, China
- Shanghai Institute of Medical Genetics, Shanghai Children’s Hospital, Shanghai Jiao Tong University School of Medicine, No. 24 Lane 1400 West Beijing Road, Jing’an District, Shanghai 200040, China
| | - Wenqian Yan
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Shanghai Jiao Tong University, No. 1954 Huashan Road, Xuhui District, Shanghai 200030, China
- Shanghai Institute of Medical Genetics, Shanghai Children’s Hospital, Shanghai Jiao Tong University School of Medicine, No. 24 Lane 1400 West Beijing Road, Jing’an District, Shanghai 200040, China
| | - Qing Lu
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Shanghai Jiao Tong University, No. 1954 Huashan Road, Xuhui District, Shanghai 200030, China
| | - Yi Shi
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Shanghai Jiao Tong University, No. 1954 Huashan Road, Xuhui District, Shanghai 200030, China
- Shanghai Institute of Medical Genetics, Shanghai Children’s Hospital, Shanghai Jiao Tong University School of Medicine, No. 24 Lane 1400 West Beijing Road, Jing’an District, Shanghai 200040, China
| | - Guang He
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Shanghai Jiao Tong University, No. 1954 Huashan Road, Xuhui District, Shanghai 200030, China
- Shanghai Institute of Medical Genetics, Shanghai Children’s Hospital, Shanghai Jiao Tong University School of Medicine, No. 24 Lane 1400 West Beijing Road, Jing’an District, Shanghai 200040, China
| |
Collapse
|
5
|
Kim S, Kim MA, Kim B, Lee J, Jung SK, Kim J, Chung HY, Lee CY, Jeong S. Machine learning assessment of zoonotic potential in avian influenza viruses using PB2 segment. BMC Genomics 2025; 26:395. [PMID: 40269678 PMCID: PMC12020041 DOI: 10.1186/s12864-025-11589-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2025] [Accepted: 04/09/2025] [Indexed: 04/25/2025] Open
Abstract
BACKGROUND Influenza A virus (IAV) is a major global health threat, causing seasonal epidemics and occasional pandemics. Particularly, Influenza A viruses from avian species pose significant zoonotic threats, with PB2 adaptation serving as a critical first step in cross-species transmission. A comprehensive risk assessment framework based on PB2 sequences is necessary, which should encompass detailed analyses of specific residues and mutations while maintaining sufficient generality for application to non-PB2 segments. RESULTS In this study, we developed two complementary approaches: a regression-based model for accurately distinguishing among risk groups, and a SHAP-based risk assessment model for more meaningful risk analyses. For the regression-based risk models, we compared various methodologies, including tree ensemble methods, conventional regression models, and deep learning architectures. The optimized regression model, combined with SHAP value analysis, identified and ranked individual residues contributing to zoonotic potential. The SHAP-based risk model enabled intra-class analyses within the zoonotic risk assessment framework and quantified risk yields from specific mutations. CONCLUSION Experimental analyses demonstrated that the Random Forest regression model outperformed other models in most cases, and we validated the target value settings for risk regression through ablation studies. Our SHAP-based analysis identified key residues (271A, 627K, 591R, 588A, 292I, 684S, 684A, 81M, 199S, and 368Q) and mutations (T271A, Q368R/K, E627K, Q591R, A588T/I/V, and I292V/T) critical for zoonotic risk assessment. Using the SHAP-based risk assessment model, we found that influenza A viruses from Phasianidae showed elevated zoonotic risk scores compared to those from other avian species. Additionally, mutations I292V/T, Q368R, A588T/I, V598A/I/T, and E/V627K were identified as significant mutations in the Phasianidae. These PB2-focused quantitative methods provide a robust and generalizable framework for both rapid screening of avians' zoonotic potential and analytical quantification of risks associated with specific residues or mutations.
Collapse
Affiliation(s)
- Sangwook Kim
- Bio-medical Research Institute, Kyungpook National University Hospital, Daegu, South Korea
| | - Min-Ah Kim
- Department of Microbiology, School of Medicine, Kyungpook National University, Daegu, South Korea
| | - Bitgoeul Kim
- Department of Microbiology, School of Medicine, Kyungpook National University, Daegu, South Korea
| | - Jisu Lee
- Department of Microbiology, School of Medicine, Kyungpook National University, Daegu, South Korea
| | - Se-Kyung Jung
- Department of Microbiology, School of Medicine, Kyungpook National University, Daegu, South Korea
| | - Jonghong Kim
- Department of Neurology, Keimyung University Dongsan Medical Center, Daegu, South Korea
| | - Ho-Young Chung
- Department of Medical Informatics, School of Medicine, Kyungpook National University, Daegu, South Korea
| | - Chung-Young Lee
- Department of Microbiology, School of Medicine, Kyungpook National University, Daegu, South Korea.
- Untreatable Infectious Disease Institute, Kyungpook National University, Daegu, South Korea.
| | - Sungmoon Jeong
- Department of Medical Informatics, School of Medicine, Kyungpook National University, Daegu, South Korea.
- Research Center for Artificial Intelligence in Medicine, Kyungpook National University Hospital, Daegu, South Korea.
| |
Collapse
|
6
|
Kallergis G, Asgari E, Empting M, Hirsch AKH, Klawonn F, McHardy AC. Domain adaptable language modeling of chemical compounds identifies potent pathoblockers for Pseudomonas aeruginosa. Commun Chem 2025; 8:114. [PMID: 40216964 PMCID: PMC11992043 DOI: 10.1038/s42004-025-01484-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2024] [Accepted: 03/05/2025] [Indexed: 04/14/2025] Open
Abstract
Computational techniques for predicting molecular properties are emerging as key components for streamlining drug development, optimizing time and financial investments. Here, we introduce ChemLM, a transformer language model for this task. ChemLM leverages self-supervised domain adaptation on chemical molecules to enhance its predictive performance. Within the framework of ChemLM, chemical compounds are conceptualized as sentences composed of distinct chemical 'words', which are employed for training a specialized chemical language model. On the standard benchmark datasets, ChemLM either matched or surpassed the performance of current state-of-the-art methods. Furthermore, we evaluated the effectiveness of ChemLM in identifying highly potent pathoblockers targeting Pseudomonas aeruginosa (PA), a pathogen that has shown an increased prevalence of multidrug-resistant strains and has been identified as a critical priority for the development of new medications. ChemLM demonstrated substantially higher accuracy in identifying highly potent pathoblockers against PA when compared to state-of-the-art approaches. An intrinsic evaluation demonstrated the consistency of the chemical language model's representation concerning chemical properties. The results from benchmarking, experimental data and intrinsic analysis of the ChemLM space confirm the wide applicability of ChemLM for enhancing molecular property prediction within the chemical domain.
Collapse
Affiliation(s)
- Georgios Kallergis
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - Ehsannedin Asgari
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
- Qatar Computing Research Institute (QCRI), Doha, Qatar
| | - Martin Empting
- Antiviral & Antivirulence Drugs (AVID), Helmholtz-Institute for Pharmaceutical Research Saarland (HIPS)-Helmholtz Centre for Infection Research (HZI), Saarbrücken, Germany
- Deutsches Zentrum für Infektionsforschung (DZIF), Hannover-Braunschweig, Germany
- Department of Pharmacy, Saarland University, Campus E8.1, 66123, Saarbrücken, Germany
| | - Anna K H Hirsch
- Deutsches Zentrum für Infektionsforschung (DZIF), Hannover-Braunschweig, Germany
- Department of Pharmacy, Saarland University, Campus E8.1, 66123, Saarbrücken, Germany
- Department of Drug Design and Optimization (DDOP), Helmholtz-Institute for Pharmaceutical Research Saarland (HIPS)-Helmholtz Centre for Infection Research (HZI), Saarbrücken, Germany
| | - Frank Klawonn
- Biostatistics Research Group, Helmholtz Centre for Infection Research, Braunschweig, Germany
- Institute for Information Engineering, Ostfalia University of Applied Sciences, 38302, Wolfenbüttel, Germany
| | - Alice C McHardy
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany.
- Deutsches Zentrum für Infektionsforschung (DZIF), Hannover-Braunschweig, Germany.
| |
Collapse
|
7
|
Wang Y, Wang C. PLM-ATG: Identification of Autophagy Proteins by Integrating Protein Language Model Embeddings with PSSM-Based Features. Molecules 2025; 30:1704. [PMID: 40333592 PMCID: PMC12029579 DOI: 10.3390/molecules30081704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2025] [Revised: 04/05/2025] [Accepted: 04/06/2025] [Indexed: 05/09/2025] Open
Abstract
Autophagy critically regulates cellular development while maintaining pathophysiological homeostasis. Since the autophagic process is tightly regulated by the coordination of autophagy-related proteins (ATGs), precise identification of these proteins is essential. Although current computational approaches have addressed experimental recognition's costly and time-consuming challenges, they still have room for improvement since handcrafted features inadequately capture the intricate patterns and relationships hidden in sequences. In this study, we propose PLM-ATG, a novel computational model that integrates support vector machines with the fusion of protein language model (PLM) embeddings and position-specific scoring matrix (PSSM)-based features for the ATG identification. First, we extracted sequence-based features and PSSM-based features as the inputs of six classifiers to establish baseline models. Among these, the combination of the SVM classifier and the AADP-PSSM feature set achieved the best prediction accuracy. Second, two popular PLM embeddings, i.e., ESM-2 and ProtT5, were fused with the AADP-PSSM features to further improve the prediction of ATGs. Third, we selected the optimal feature subset from the combination of the ESM-2 embeddings and AADP-PSSM features to train the final SVM model. The proposed PLM-ATG achieved an accuracy of 99.5% and an MCC of 0.990, which are nearly 5% and 0.1 higher than those of the state-of-the-art model EnsembleDL-ATG, respectively.
Collapse
Affiliation(s)
| | - Chunhua Wang
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China;
| |
Collapse
|
8
|
D'Hondt S, Oramas J, De Winter H. A beginner's approach to deep learning applied to VS and MD techniques. J Cheminform 2025; 17:47. [PMID: 40200329 PMCID: PMC11980327 DOI: 10.1186/s13321-025-00985-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2024] [Accepted: 03/12/2025] [Indexed: 04/10/2025] Open
Abstract
It has become impossible to imagine the fields of biochemistry and medicinal chemistry without computational chemistry and molecular modelling techniques. In many steps of the drug development process in silico methods have become indispensable. Virtual screening (VS) can tremendously expedite the early discovery phase, whilst the use of molecular dynamics (MD) simulations forms a powerful additional tool to in vitro methods throughout the entire drug discovery process. In the field of biochemistry, MD has also become a compelling method for studying biophysical systems (e.g., protein folding) complementary to experimental techniques. However, both VS and MD come with their own limitations and methodological difficulties, from hardware limitations to restrictions in algorithmic capabilities. One solution to overcoming these difficulties lies in the field of machine learning (ML), and more specifically deep learning (DL). There are many ways in which DL can be applied to these molecular modelling techniques to achieve more accurate results in a more efficient manner or expedite the data analysis of the acquired results. Despite steadily increasing interest in DL amidst computational chemists, knowledge is still limited and scattered over different resources. This review is aimed at computational chemists with knowledge of molecular modelling, who wish to possibly integrate DL approaches in their research and already have a basic understanding of the fundamentals of DL. This review focusses on a survey of recent applications of DL in molecular modelling techniques. The different sections are logically subdivided, based on where DL is integrated in the research: (1) for the improvement of VS workflows, (2) for the improvement of certain workflows in MD simulations, (3) for aiding in the calculations of interatomic forces, or (4) for data analysis of MD trajectories. It will become clear that DL has the capacity to completely transform the way molecular modelling is carried out.
Collapse
Affiliation(s)
- Stijn D'Hondt
- Laboratory of Medicinal Chemistry, Department of Pharmaceutical Sciences, IDLab, University of Antwerp, Universiteitsplein 1, 2610, Wilrijk, Belgium
| | - José Oramas
- Department of Computer Science, Sint-Pietersvliet 7, 2000, Antwerp, Belgium
| | - Hans De Winter
- Laboratory of Medicinal Chemistry, Department of Pharmaceutical Sciences, IDLab, University of Antwerp, Universiteitsplein 1, 2610, Wilrijk, Belgium.
| |
Collapse
|
9
|
Heinzinger M, Rost B. Teaching AI to speak protein. Curr Opin Struct Biol 2025; 91:102986. [PMID: 39985945 DOI: 10.1016/j.sbi.2025.102986] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2024] [Revised: 12/30/2024] [Accepted: 01/02/2025] [Indexed: 02/24/2025]
Abstract
Large Language Models for proteins, namely protein Language Models (pLMs), have begun to provide an important alternative to capturing the information encoded in a protein sequence in computers. Arguably, pLMs have advanced importantly to understanding aspects of the language of life as written in proteins, and through this understanding, they are becoming an increasingly powerful means of advancing protein prediction, e.g., in the prediction of molecular function as expressed by identifying binding residues or variant effects. While benefitting from the same technology, protein structure prediction remains one of the few applications for which only using pLM embeddings from single sequences appears not to improve over or match the state-of-the-art. Fine-tuning foundation pLMs enhances efficiency and accuracy of solutions, in particular in cases with few experimental annotations. pLMs facilitate the integration of computational and experimental biology, of AI and wet-lab, in particular toward a new era of protein design.
Collapse
Affiliation(s)
- Michael Heinzinger
- TUM (Technical University of Munich), School of Computation, Information and Technology (CIT), Faculty of Informatics, Chair of Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching, Munich, Germany.
| | - Burkhard Rost
- TUM (Technical University of Munich), School of Computation, Information and Technology (CIT), Faculty of Informatics, Chair of Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, 85748 Garching, Munich, Germany; Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748 Garching, Munich, Germany; TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany
| |
Collapse
|
10
|
Chaturvedi M, Rashid MA, Paliwal KK. RNA structure prediction using deep learning - A comprehensive review. Comput Biol Med 2025; 188:109845. [PMID: 39983363 DOI: 10.1016/j.compbiomed.2025.109845] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2024] [Revised: 02/09/2025] [Accepted: 02/10/2025] [Indexed: 02/23/2025]
Abstract
In computational biology, accurate RNA structure prediction offers several benefits, including facilitating a better understanding of RNA functions and RNA-based drug design. Implementing deep learning techniques for RNA structure prediction has led tremendous progress in this field, resulting in significant improvements in prediction accuracy. This comprehensive review aims to provide an overview of the diverse strategies employed in predicting RNA secondary structures, emphasizing deep learning methods. The article categorizes the discussion into three main dimensions: feature extraction methods, existing state-of-the-art learning model architectures, and prediction approaches. We present a comparative analysis of various techniques and models highlighting their strengths and weaknesses. Finally, we identify gaps in the literature, discuss current challenges, and suggest future approaches to enhance model performance and applicability in RNA structure prediction tasks. This review provides a deeper insight into the subject and paves the way for further progress in this dynamic intersection of life sciences and artificial intelligence.
Collapse
Affiliation(s)
- Mayank Chaturvedi
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD, 4111, Australia.
| | - Mahmood A Rashid
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD, 4111, Australia.
| | - Kuldip K Paliwal
- Signal Processing Laboratory, School of Engineering and Built Environment, Griffith University, Brisbane, QLD, 4111, Australia.
| |
Collapse
|
11
|
Bjerregaard A, Groth PM, Hauberg S, Krogh A, Boomsma W. Foundation models of protein sequences: A brief overview. Curr Opin Struct Biol 2025; 91:103004. [PMID: 39983412 DOI: 10.1016/j.sbi.2025.103004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2024] [Revised: 01/24/2025] [Accepted: 01/26/2025] [Indexed: 02/23/2025]
Abstract
Protein sequence models have evolved from simple statistics of aligned families to versatile foundation models of evolutionary scale. Enabled by self-supervised learning and an abundance of protein sequence data, such foundation models now play a central role in protein science. They facilitate rich representations, powerful generative design, and fine-tuning across diverse domains. In this review, we trace modeling developments and categorize them into methodological trends over the modalities they describe and the contexts they condition upon. Following a brief historical overview, we focus our attention on the most recent trends and outline future perspectives.
Collapse
Affiliation(s)
- Andreas Bjerregaard
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark; Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark
| | - Peter Mørch Groth
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark; Novonesis, Kgs, Lyngby, Denmark
| | - Søren Hauberg
- Section for Cognitive Systems, Technical University of Denmark, Kgs, Lyngby, Denmark
| | - Anders Krogh
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark; Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark
| | - Wouter Boomsma
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
12
|
Abbas Z, Kim S, Lee N, Kazmi SAW, Lee SW. A robust ensemble framework for anticancer peptide classification using multi-model voting approach. Comput Biol Med 2025; 188:109750. [PMID: 40032410 DOI: 10.1016/j.compbiomed.2025.109750] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2024] [Revised: 01/14/2025] [Accepted: 01/22/2025] [Indexed: 03/05/2025]
Abstract
Anticancer peptides (ACPs) hold great potential for cancer therapeutics, yet accurately identifying them remains a challenging task due to the complexity of peptide sequences and their interactions with biological systems. In this study, we propose a novel machine learning-based framework for ACP classification, integrating multiple feature sets, including sequence composition, physicochemical properties, and embedding features derived from pre-trained language models. We evaluate the performance of various classifiers on benchmark datasets and compare our model against state-of-the-art methods. The results demonstrate that our model outperforms existing methods such as UniDL4BioPep, ACPred-Fuse, and iACP with an accuracy of 75.58%, an AUC of 0.8272, and an MCC of 0.5119. Our approach provides a more balanced sensitivity of 0.7384 and specificity of 0.773, ensuring robust identification of both ACPs and non-ACPs. These findings suggest that incorporating diverse feature sets can significantly enhance ACP classification, potentially facilitating the discovery of novel anticancer peptides for therapeutic applications.
Collapse
Affiliation(s)
- Zeeshan Abbas
- Department of Precision Medicine, Sungkyunkwan University School of Medicine, Suwon, Republic of Korea; Department of Artificial Intelligence, Sungkyunkwan University, Suwon 16419, Republic of Korea
| | - Sunyeup Kim
- Department of Precision Medicine, Sungkyunkwan University School of Medicine, Suwon, Republic of Korea
| | - Nangkyeong Lee
- Department of Precision Medicine, Sungkyunkwan University School of Medicine, Suwon, Republic of Korea
| | | | - Seung Won Lee
- Department of Precision Medicine, Sungkyunkwan University School of Medicine, Suwon, Republic of Korea; Department of Artificial Intelligence, Sungkyunkwan University, Suwon 16419, Republic of Korea; Department of Metabiohealth, Sungkyunkwan University, Suwon 16419, Republic of Korea; Personalized Cancer Immunotherapy Research Center, Sungkyunkwan University School of Medicine, Suwon 16419, Republic of Korea.
| |
Collapse
|
13
|
Refahi M, Sokhansanj BA, Mell JC, Brown JR, Yoo H, Hearne G, Rosen GL. Enhancing nucleotide sequence representations in genomic analysis with contrastive optimization. Commun Biol 2025; 8:517. [PMID: 40155693 PMCID: PMC11953366 DOI: 10.1038/s42003-025-07902-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 03/07/2025] [Indexed: 04/01/2025] Open
Abstract
Analysis of genomic and metagenomic sequences is inherently more challenging than that of amino acid sequences due to the higher divergence among evolutionarily related nucleotide sequences, variable k-mer and codon usage within and among genomes of diverse species, and poorly understood selective constraints. We introduce Scorpio (Sequence Contrastive Optimization for Representation and Predictive Inference on DNA), a versatile framework designed for nucleotide sequences that employ contrastive learning to improve embeddings. By leveraging pre-trained genomic language models and k-mer frequency embeddings, Scorpio demonstrates competitive performance in diverse applications, including taxonomic and gene classification, antimicrobial resistance (AMR) gene identification, and promoter detection. A key strength of Scorpio is its ability to generalize to novel DNA sequences and taxa, addressing a significant limitation of alignment-based methods. Scorpio has been tested on multiple datasets with DNA sequences of varying lengths (long and short) and shows robust inference capabilities. Additionally, we provide an analysis of the biological information underlying this representation, including correlations between codon adaptation index as a gene expression factor, sequence similarity, and taxonomy, as well as the functional and structural information of genes.
Collapse
Affiliation(s)
| | - Bahrad A Sokhansanj
- Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Joshua C Mell
- College of Medicine, Drexel University, Philadelphia, PA, USA
| | - James R Brown
- Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Hyunwoo Yoo
- Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Gavin Hearne
- Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Gail L Rosen
- Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA.
| |
Collapse
|
14
|
Mao Y, Xu W, Shun Y, Chai L, Xue L, Yang Y, Li M. A multimodal model for protein function prediction. Sci Rep 2025; 15:10465. [PMID: 40140535 PMCID: PMC11947276 DOI: 10.1038/s41598-025-94612-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2025] [Accepted: 03/14/2025] [Indexed: 03/28/2025] Open
Abstract
Protein function, which is determined by sequence, structure, and other characteristics, plays a crucial role in an organism's performance. Existing protein function prediction methods mainly rely on sequence data and often ignore structural properties that are crucial for accurate prediction. Protein structure provides richer spatial and functional insights, which can significantly improve prediction accuracy. In this work, we propose a multi-modal protein function prediction model (MMPFP) that integrates protein sequence and structure information through the use of GCN, CNN, and Transformer models. We validate the model using the PDBest dataset, demonstrating that MMPFP outperforms traditional single-modal models in the molecular function (MF), biological process (BP), and cellular component (CC) prediction tasks. Specifically, MMPFP achieved AUPR scores of 0.693, 0.355, and 0.478; [Formula: see text] scores of 0.752, 0.629, and 0.691; and [Formula: see text] scores of 0.336, 0.488, and 0.459, showing a 3-5% improvement over single-modal models. Additionally, ablation studies confirm the effectiveness of the Transformer module within the GCN branch, further validating MMPFP's superior performance over existing methods. This multi-modal approach offers a more accurate and comprehensive framework for protein function prediction, addressing key limitations of current models.
Collapse
Affiliation(s)
- Yu Mao
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China
| | - WenHui Xu
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China
| | - Yue Shun
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China
| | - LongXin Chai
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China
| | - Lei Xue
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China
| | - Yong Yang
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China.
| | - Mei Li
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China.
| |
Collapse
|
15
|
Tyagi N, Vahab N, Tyagi S. Genome language modeling (GLM): a beginner's cheat sheet. Biol Methods Protoc 2025; 10:bpaf022. [PMID: 40370585 PMCID: PMC12077296 DOI: 10.1093/biomethods/bpaf022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Revised: 02/17/2025] [Accepted: 03/23/2025] [Indexed: 05/16/2025] Open
Abstract
Integrating genomics with diverse data modalities has the potential to revolutionize personalized medicine. However, this integration poses significant challenges due to the fundamental differences in data types and structures. The vast size of the genome necessitates transformation into a condensed representation containing key biomarkers and relevant features to ensure interoperability with other modalities. This commentary explores both conventional and state-of-the-art approaches to genome language modeling (GLM), with a focus on representing and extracting meaningful features from genomic sequences. We focus on the latest trends of applying language modeling techniques on genomics sequence data, treating it as a text modality. Effective feature extraction is essential in enabling machine learning models to effectively analyze large genomic datasets, particularly within multimodal frameworks. We first provide a step-by-step guide to various genomic sequence preprocessing and tokenization techniques. Then we explore feature extraction methods for the transformation of tokens using frequency, embedding, and neural network-based approaches. In the end, we discuss machine learning (ML) applications in genomics, focusing on classification, regression, language processing algorithms, and multimodal integration. Additionally, we explore the role of GLM in functional annotation, emphasizing how advanced ML models, such as Bidirectional encoder representations from transformers, enhance the interpretation of genomic data. To the best of our knowledge, we compile the first end-to-end analytic guide to convert complex genomic data into biologically interpretable information using GLM, thereby facilitating the development of novel data-driven hypotheses.
Collapse
Affiliation(s)
- Navya Tyagi
- AI and Data Science, Indian Institute of Technology, Madras, Chennai 600036, Tamil Nadu, India
- Amity Institute of Integrative Health Sciences, Amity University, Gurugram 122412, Haryana, India
| | - Naima Vahab
- School of Computing Technologies, Royal Melbourne Institute of Technology (RMIT) University, 3001 Melbourne, Australia
| | - Sonika Tyagi
- School of Computing Technologies, Royal Melbourne Institute of Technology (RMIT) University, 3001 Melbourne, Australia
| |
Collapse
|
16
|
Zheng S. Navigating the unstructured by evaluating alphafold's efficacy in predicting missing residues and structural disorder in proteins. PLoS One 2025; 20:e0313812. [PMID: 40131945 PMCID: PMC11936262 DOI: 10.1371/journal.pone.0313812] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2024] [Accepted: 02/18/2025] [Indexed: 03/27/2025] Open
Abstract
The study investigated regions with undefined structures, known as "missing" segments in X-ray crystallography and cryo-electron microscopy (Cryo-EM) data, by assessing their predicted structural confidence and disorder scores. Utilizing a comprehensive dataset from the Protein Data Bank (PDB), residues were categorized as "modeled", "hard missing" and "soft missing" based on their visibility in structural datasets. Key features were determined, including a confidence score predicted local distance difference test (pLDDT) from AlphaFold2, an advanced structural prediction tool, and a disorder score from IUPred, a traditional disorder prediction method. To enhance prediction performance for unstructured residues, we employed a Long Short-Term Memory (LSTM) model, integrating both scores with amino acid sequences. Notable patterns such as composition, region lengths and prediction scores were observed in unstructured residues and regions identified through structural experiments over our studied period. Our findings also indicate that "hard missing" residues often align with low confidence scores, whereas "soft missing" residues exhibit dynamic behavior that can complicate predictions. The incorporation of pLDDT, IUPred scores, and sequence data into the LSTM model has improved the differentiation between structured and unstructured residues, particularly for shorter unstructured regions. This research elucidates the relationship between established computational predictions and experimental structural data, enhancing our ability to target structurally significant areas for research and guiding experimental designs toward functionally relevant regions.
Collapse
Affiliation(s)
- Sen Zheng
- Bio-Electron Microscopy Facility, iHuman Institution, ShanghaiTech University, Shanghai, China
| |
Collapse
|
17
|
Michels J, Bandarupalli R, Ahangar Akbari A, Le T, Xiao H, Li J, Hom EFY. Natural Language Processing Methods for the Study of Protein-Ligand Interactions. J Chem Inf Model 2025; 65:2191-2213. [PMID: 39993834 PMCID: PMC11898065 DOI: 10.1021/acs.jcim.4c01907] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2024] [Revised: 02/05/2025] [Accepted: 02/06/2025] [Indexed: 02/26/2025]
Abstract
Natural Language Processing (NLP) has revolutionized the way computers are used to study and interact with human languages and is increasingly influential in the study of protein and ligand binding, which is critical for drug discovery and development. This review examines how NLP techniques have been adapted to decode the "language" of proteins and small molecule ligands to predict protein-ligand interactions (PLIs). We discuss how methods such as long short-term memory (LSTM) networks, transformers, and attention mechanisms can leverage different protein and ligand data types to identify potential interaction patterns. Significant challenges are highlighted including the scarcity of high-quality negative data, difficulties in interpreting model decisions, and sampling biases in existing data sets. We argue that focusing on improving data quality, enhancing model robustness, and fostering both collaboration and competition could catalyze future advances in machine-learning-based predictions of PLIs.
Collapse
Affiliation(s)
- James Michels
- Department
of Computer and Information Science, University
of Mississippi, University, Mississippi 38677, United States
| | - Ramya Bandarupalli
- Department
of BioMolecular Sciences, School of Pharmacy, University of Mississippi, University, Mississippi 38677, United States
| | - Amin Ahangar Akbari
- Department
of BioMolecular Sciences, School of Pharmacy, University of Mississippi, University, Mississippi 38677, United States
| | - Thai Le
- Department
of Computer Science, Indiana University, Bloomington, Indiana 47408, United States
| | - Hong Xiao
- Department
of Computer and Information Science and Institute for Data Science, University of Mississippi, University, Mississippi 38677, United States
| | - Jing Li
- Department
of BioMolecular Sciences, School of Pharmacy, University of Mississippi, University, Mississippi 38677, United States
| | - Erik F. Y. Hom
- Department
of Biology and Center for Biodiversity and Conservation Research, University of Mississippi, University, Mississippi 38677, United States
| |
Collapse
|
18
|
Wang J, Liu Z, Zhang C, Cao Y, Liu B, Shu Y, Thum Y, Zhang J. A deep learning approach to understanding controlled ovarian stimulation and in vitro fertilization dynamics. Sci Rep 2025; 15:7821. [PMID: 40050418 PMCID: PMC11885538 DOI: 10.1038/s41598-025-92186-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2024] [Accepted: 02/25/2025] [Indexed: 03/09/2025] Open
Abstract
Infertility, recognized by the World Health Organization (WHO) as a disease affecting the male or female reproductive system, presents a global challenge due to its impact on one in six individuals worldwide. Given the high prevalence of infertility and the limited available resources in fertility care, infertility creates substantial obstacles to reproductive autonomy and places a considerable burden on fertility care providers. While existing research are exploring to use artificial intelligence (AI) methods to assist fertility care providers in managing in vitro fertilization (IVF) cycles, these attempts fail in accurately predicting specific aspects such as medication dosage and intermediate ovarian responses during controlled ovarian stimulation (COS) within IVF cycles. Our current work developed Edwards, a deep learning model based on the Transformer-Encoder architecture to improve the prediction outcomes. Edwards is designed to capture the temporal features within the sequential process of IVF cycles, It could provide the options of treatment plans as well as predict hormone profiles, and ovarian responses at any stage upon both current and historical data. By considering the full context of the process, Edwards demonstrates improved accuracy in predicting the final outcomes of the IVF process compared to previous approaches based on traditional machine learning. The strength of our current deep learning model stems from its ability to learn the intricate endocrinological mechanisms of the female reproductive system, especially for the context of COS in IVF cycles.
Collapse
Affiliation(s)
- Jia Wang
- New Hope Fertility Center, New York, 10019, US.
| | - Zitao Liu
- New Hope Fertility Center, New York, 10019, US
| | | | - Yu Cao
- Department of Computer Science, University of Massachusetts Lowell, Lowell, 01854, US
| | - Benyuan Liu
- Department of Computer Science, University of Massachusetts Lowell, Lowell, 01854, US
| | - Yimin Shu
- University of Kansas Medical Center, Overland Park, 66211, US
| | - Yau Thum
- Lister Fertility Clinic, London, SW1W8RH, UK
| | - John Zhang
- New Hope Fertility Center, New York, 10019, US
| |
Collapse
|
19
|
Mao Y, Wu J, Weng J, Li M, Xiong Y, Gu W, Jiang R, Pang R, Lin X, Tang D. Inter-view contrastive learning and miRNA fusion for lncRNA-protein interaction prediction in heterogeneous graphs. Brief Bioinform 2025; 26:bbaf148. [PMID: 40194558 PMCID: PMC11975365 DOI: 10.1093/bib/bbaf148] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2024] [Revised: 02/19/2025] [Accepted: 03/16/2025] [Indexed: 04/09/2025] Open
Abstract
Predicting long non-coding RNA (lncRNA)-protein interactions is essential for understanding biological processes and discovering new therapeutic targets. In this study, we propose a novel model based on inter-view contrastive learning and miRNA fusion for lncRNA-protein interaction (LPI) prediction, called ICMF-LPI, which utilizes a heterogeneous information network to enhance LPI prediction. The model integrates miRNA as a mediator, constructing an lncRNA-miRNA-protein network, and employs metapath to extract diverse relationships from heterogeneous graphs. By fusing miRNA-related information and leveraging contrastive learning across inter-views, ICMF-LPI effectively captures potential interactions. Experimental results, including five-fold cross-validation, demonstrate the model's superior performance compared to several state-of-the-art methods, with significant improvements in the area under the receiver operating characteristic curve and the area under the precision-recall curve metrics. Notably, even when direct LPI connections are excluded, ICMF-LPI still achieves competitive predictive accuracy, performing comparably or better than some existing models. This demonstrates that the proposed model is effective in scenarios where direct interaction data are unavailable. This approach offers a promising direction for developing predictive models in bioinformatics, particularly in challenging conditions.
Collapse
Affiliation(s)
- Yijun Mao
- College of Mathematics and Informatics, South China Agricultural University, 483 Wushan Road, Tianhe District, GuangZhou 510642, China
- National Key Laboratory of Data Space Technology and System, 3 Minzhuang Road, Haidian District, Beijing 100195, China
| | - Jiale Wu
- College of Mathematics and Informatics, South China Agricultural University, 483 Wushan Road, Tianhe District, GuangZhou 510642, China
| | - Jian Weng
- College of Cyber Security, Jinan University, 601 West Huangpu Avenue, Tianhe District, GuangZhou 510632, China
| | - Ming Li
- College of Cyber Security, Jinan University, 601 West Huangpu Avenue, Tianhe District, GuangZhou 510632, China
| | - Yunyan Xiong
- School of Computer and Infomation Engineering, Guangdong Polytechnic of Industry and Commerce, 1098 North Guangzhou Avenue, Tianhe District, GuangZhou 510510, China
| | - Wanrong Gu
- College of Mathematics and Informatics, South China Agricultural University, 483 Wushan Road, Tianhe District, GuangZhou 510642, China
| | - Rongjin Jiang
- Department of Digital process, Wens Foodstuff Group Co., Ltd, 9 Dongdi North Road, Xinxing County, YunFu 527400, China
| | - Rui Pang
- State Key Laboratory of Green Pesticide, College of Plant Protection, South China Agricultural University, 483 Wushan Road, Tianhe District, GuangZhou 510642, China
| | - Xudong Lin
- College of Mathematics and Informatics, South China Agricultural University, 483 Wushan Road, Tianhe District, GuangZhou 510642, China
| | - Deyu Tang
- College of Mathematics and Informatics, South China Agricultural University, 483 Wushan Road, Tianhe District, GuangZhou 510642, China
| |
Collapse
|
20
|
Kalemati M, Zamani Emani M, Koohi S. InceptionDTA: Predicting drug-target binding affinity with biological context features and inception networks. Heliyon 2025; 11:e42476. [PMID: 40007773 PMCID: PMC11850134 DOI: 10.1016/j.heliyon.2025.e42476] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Revised: 01/23/2025] [Accepted: 02/04/2025] [Indexed: 02/27/2025] Open
Abstract
Predicting drug-target binding affinity via in silico methods is crucial in drug discovery. Traditional machine learning relies on manually engineered features from limited data, leading to suboptimal performance. In contrast, deep learning excels at extracting features from raw sequences but often overlooks essential biological context features, hindering effective binding prediction. Additionally, these models struggle to capture global and local feature distributions efficiently in protein sequences and drug SMILES. Previous state-of-the-art models, like transformers and graph-based approaches, face scalability and resource efficiency challenges. Transformers struggle with scalability, while graph-based methods have difficulty handling large datasets and complex molecular structures. In this paper, we introduce InceptionDTA, a novel drug-target binding affinity prediction model that leverages CharVec, an enhanced variant of Prot2Vec, to incorporate both biological context and categorical features into protein sequence encoding. InceptionDTA utilizes a multi-scale convolutional architecture based on the Inception network to capture features at various spatial resolutions, enabling the extraction of both local and global features from protein sequences and drug SMILES. We evaluate InceptionDTA across a range of benchmark datasets commonly used in drug-target binding affinity prediction. Our results demonstrate that InceptionDTA outperforms various sequence-based, transformer-based, and graph-based deep learning approaches across warm-start, refined, and cold-start splitting settings. In addition to using CharVec, which demonstrates greater accuracy in absolute predictions, InceptionDTA also includes a version that employs simple label encoding and excels in ranking and predicting relative binding affinities. This versatility highlights how InceptionDTA can effectively adapt to various predictive requirements. These results emphasize the promise of our approach in expediting drug repurposing initiatives, enabling the discovery of new drugs, and contributing to advancements in disease treatment.
Collapse
Affiliation(s)
- Mahmood Kalemati
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - Mojtaba Zamani Emani
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - Somayyeh Koohi
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| |
Collapse
|
21
|
Vural O, Jololian L. Machine learning approaches for predicting protein-ligand binding sites from sequence data. FRONTIERS IN BIOINFORMATICS 2025; 5:1520382. [PMID: 39963299 PMCID: PMC11830693 DOI: 10.3389/fbinf.2025.1520382] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2024] [Accepted: 01/10/2025] [Indexed: 02/20/2025] Open
Abstract
Proteins, composed of amino acids, are crucial for a wide range of biological functions. Proteins have various interaction sites, one of which is the protein-ligand binding site, essential for molecular interactions and biochemical reactions. These sites enable proteins to bind with other molecules, facilitating key biological functions. Accurate prediction of these binding sites is pivotal in computational drug discovery, helping to identify therapeutic targets and facilitate treatment development. Machine learning has made significant contributions to this field by improving the prediction of protein-ligand interactions. This paper reviews studies that use machine learning to predict protein-ligand binding sites from sequence data, focusing on recent advancements. The review examines various embedding methods and machine learning architectures, addressing current challenges and the ongoing debates in the field. Additionally, research gaps in the existing literature are highlighted, and potential future directions for advancing the field are discussed. This study provides a thorough overview of sequence-based approaches for predicting protein-ligand binding sites, offering insights into the current state of research and future possibilities.
Collapse
Affiliation(s)
- Orhun Vural
- Department of Electrical and Computer Engineering, The University of Alabama at Birmingham, Birmingham, AL, United States
| | | |
Collapse
|
22
|
Feng T, Chen X, Wu S, Tang W, Zhou H, Fang Z. Predicting the bacterial host range of plasmid genomes using the language model-based one-class support vector machine algorithm. Microb Genom 2025; 11. [PMID: 39932495 DOI: 10.1099/mgen.0.001355] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/08/2025] Open
Abstract
The prediction of the plasmid host range is crucial for investigating the dissemination of plasmids and the transfer of resistance and virulence genes mediated by plasmids. Several machine learning-based tools have been developed to predict plasmid host ranges. These tools have been trained and tested based on the bacterial host records of plasmids in related databases. Typically, a plasmid genome in databases such as the National Center for Biotechnology Information is annotated with only one or a few bacterial hosts, which does not encompass all possible hosts. Consequently, existing methods may significantly underestimate the host ranges of mobile plasmids. In this work, we propose a novel method named HRPredict, which employs a word vector model to digitally represent the encoded proteins on plasmid genomes. Since it is difficult to confirm which host a particular plasmid definitely cannot enter, we developed a machine learning approach for predicting whether a plasmid can enter a specific bacterium as a no-negative samples learning task. Using multiple one-class support vector machine (SVM) models that do not require negative samples for training, HRPredict predicts the host range of plasmids across 45 families, 56 genera and 56 species. In the benchmark test set, we constructed reliable negative samples for each host taxonomic unit via two indirect methods, and we found that the area under the curve (AUC), F1-score, recall, precision and accuracy of most taxonomic unit prediction models exceeded 0.9. Among the 13 broad-host-range plasmid types, HRPredict demonstrated greater coverage than HOTSPOT and PlasmidHostFinder, thus successfully predicting the majority of hosts previously reported. Through feature importance calculation for each SVM model, we found that genes closely related to the plasmid host range are involved in functions such as bacterial adaptability, pathogenicity and survival. These findings provide significant insight into the mechanisms through which bacteria adjust to diverse environments through plasmids. The HRPredict algorithm is expected to facilitate in-depth research on the spread of broad-host-range plasmids and enable host-range predictions for novel plasmids reconstructed from microbiome sequencing data.
Collapse
Affiliation(s)
- Tao Feng
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, 510280, PR China
- Guangzhou Chest Hospital, Hengzhigang Road 1066, Guangzhou, 510095, PR China
| | - Xirao Chen
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, 510280, PR China
| | - Shufang Wu
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, 510280, PR China
| | - Waijiao Tang
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, 510280, PR China
| | - Hongwei Zhou
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, 510280, PR China
| | - Zhencheng Fang
- Microbiome Medicine Center, Department of Laboratory Medicine, Zhujiang Hospital, Southern Medical University, Guangzhou, 510280, PR China
| |
Collapse
|
23
|
Liu J, Yang M, Yu Y, Xu H, Wang T, Li K, Zhou X. Advancing bioinformatics with large language models: components, applications and perspectives. ARXIV 2025:arXiv:2401.04155v2. [PMID: 38259343 PMCID: PMC10802675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
Large language models (LLMs) are a class of artificial intelligence models based on deep learning, which have great performance in various tasks, especially in natural language processing (NLP). Large language models typically consist of artificial neural networks with numerous parameters, trained on large amounts of unlabeled input using self-supervised or semi-supervised learning. However, their potential for solving bioinformatics problems may even exceed their proficiency in modeling human language. In this review, we will provide a comprehensive overview of the essential components of large language models (LLMs) in bioinformatics, spanning genomics, transcriptomics, proteomics, drug discovery, and single-cell analysis. Key aspects covered include tokenization methods for diverse data types, the architecture of transformer models, the core attention mechanism, and the pre-training processes underlying these models. Additionally, we will introduce currently available foundation models and highlight their downstream applications across various bioinformatics domains. Finally, drawing from our experience, we will offer practical guidance for both LLM users and developers, emphasizing strategies to optimize their use and foster further innovation in the field.
Collapse
Affiliation(s)
- Jiajia Liu
- Center for Computational Systems Medicine, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, 77030, USA
| | - Mengyuan Yang
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi’an Jiaotong University Health Science Center, Xi’an, China
| | - Yankai Yu
- School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, Sichuan 611756, China
| | - Haixia Xu
- Center for Computational Systems Medicine, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, 77030, USA
| | - Tiangang Wang
- Center for Computational Systems Medicine, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, 77030, USA
| | - Kang Li
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Xiaobo Zhou
- Center for Computational Systems Medicine, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, 77030, USA
- McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
- School of Dentistry, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| |
Collapse
|
24
|
Bhargav P, Mukherjee A. AlphaMut: A Deep Reinforcement Learning Model to Suggest Helix-Disrupting Mutations. J Chem Theory Comput 2025; 21:463-473. [PMID: 39702999 DOI: 10.1021/acs.jctc.4c01387] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2024]
Abstract
Helices are important secondary structural motifs within proteins and are pivotal in numerous physiological processes. While amino acids (AA) such as alanine and leucine are known to promote helix formation, proline and glycine disfavor it. Helical structure formation, however, also depends on its environment, and hence, prior prediction of a mutational effect on a helical structure is difficult. Here, we employ a reinforcement learning algorithm to develop a predictive model for helix-disrupting mutations. We start with a model to disrupt helices independent of their protein environment. Our results show that only a few mutations lead to a drastic disruption of the target helix. We further extend our approach to helices in proteins and validate the results using rigorous free energy calculations. Our strategy identifies amino acids crucial for maintaining structural integrity and predicts key mutations that could alter protein structure. Through our work, we present a new use case for reinforcement learning in protein structure disruption.
Collapse
Affiliation(s)
- Prathith Bhargav
- Department of Chemistry, Indian Institute of Science Education and Research Pune, Dr Homi Bhabha Road, Pashan, Pune, Maharashtra 411008, India
| | - Arnab Mukherjee
- Department of Chemistry, Indian Institute of Science Education and Research Pune, Dr Homi Bhabha Road, Pashan, Pune, Maharashtra 411008, India
- Department of Data Science, Indian Institute of Science Education and Research Pune, Dr Homi Bhabha Road, Pashan, Pune, Maharashtra 411008, India
| |
Collapse
|
25
|
Hu J, Hu S, Xia M, Zheng K, Zhang X. Drug-target binding affinity prediction based on power graph and word2vec. BMC Med Genomics 2025; 18:9. [PMID: 39806396 PMCID: PMC11730168 DOI: 10.1186/s12920-024-02073-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Accepted: 12/13/2024] [Indexed: 01/16/2025] Open
Abstract
BACKGROUND Drug and protein targets affect the physiological functions and metabolic effects of the body through bonding reactions, and accurate prediction of drug-protein target interactions is crucial for drug development. In order to shorten the drug development cycle and reduce costs, machine learning methods are gradually playing an important role in the field of drug-target interactions. RESULTS Compared with other methods, regression-based drug target affinity is more representative of the binding ability. Accurate prediction of drug target affinity can effectively reduce the time and cost of drug retargeting and new drug development. In this paper, a drug target affinity prediction model (WPGraphDTA) based on power graph and word2vec is proposed. CONCLUSIONS In this model, the drug molecular features in the power graph module are extracted by a graph neural network, and then the protein features are obtained by the Word2vec method. After feature fusion, they are input into the three full connection layers to obtain the drug target affinity prediction value. We conducted experiments on the Davis and Kiba datasets, and the experimental results showed that WPGraphDTA exhibited good prediction performance.
Collapse
Affiliation(s)
- Jing Hu
- School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, 430065, Hubei, China.
- Hubei Province Key Laboratory of Intelligent Information Processing and Real-Time Industrial System, Wuhan, China.
- Institute of Big Data Science and Engineering, Wuhan University of Science and Technology, Wuhan, Hubei, China.
| | - Shuo Hu
- School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, 430065, Hubei, China
| | - Minghao Xia
- School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, 430065, Hubei, China
| | - Kangxing Zheng
- School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, 430065, Hubei, China
| | - Xiaolong Zhang
- School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, 430065, Hubei, China.
- Hubei Province Key Laboratory of Intelligent Information Processing and Real-Time Industrial System, Wuhan, China.
- Institute of Big Data Science and Engineering, Wuhan University of Science and Technology, Wuhan, Hubei, China.
| |
Collapse
|
26
|
Zeng R, Li Z, Li J, Zhang Q. DNA promoter task-oriented dictionary mining and prediction model based on natural language technology. Sci Rep 2025; 15:153. [PMID: 39747934 PMCID: PMC11697570 DOI: 10.1038/s41598-024-84105-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2024] [Accepted: 12/19/2024] [Indexed: 01/04/2025] Open
Abstract
Promoters are essential DNA sequences that initiate transcription and regulate gene expression. Precisely identifying promoter sites is crucial for deciphering gene expression patterns and the roles of gene regulatory networks. Recent advancements in bioinformatics have leveraged deep learning and natural language processing (NLP) to enhance promoter prediction accuracy. Techniques such as convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and BERT models have been particularly impactful. However, current approaches often rely on arbitrary DNA sequence segmentation during BERT pre-training, which may not yield optimal results. To overcome this limitation, this article introduces a novel DNA sequence segmentation method. This approach develops a more refined dictionary for DNA sequences, utilizes it for BERT pre-training, and employs an Inception neural network as the foundational model. This BERT-Inception architecture captures information across multiple granularities. Experimental results show that the model improves the performance of several downstream tasks and introduces deep learning interpretability, providing new perspectives for interpreting and understanding DNA sequence information. The detailed source code is available at https://github.com/katouMegumiH/Promoter_BERT .
Collapse
Affiliation(s)
- Ruolei Zeng
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, 55455, USA
| | - Zihan Li
- National Engineering Research Centre for Agri-Product Quality Traceability, Beijing Technology and Business University, No.11 Fucheng Road, Beijing, 100048, China.
| | - Jialu Li
- National Engineering Research Centre for Agri-Product Quality Traceability, Beijing Technology and Business University, No.11 Fucheng Road, Beijing, 100048, China
| | - Qingchuan Zhang
- National Engineering Research Centre for Agri-Product Quality Traceability, Beijing Technology and Business University, No.11 Fucheng Road, Beijing, 100048, China.
| |
Collapse
|
27
|
Bowyer S, Allen DJ, Furnham N. Unveiling the ghost: machine learning's impact on the landscape of virology. J Gen Virol 2025; 106. [PMID: 39804261 DOI: 10.1099/jgv.0.002067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/02/2025] Open
Abstract
The complexity and speed of evolution in viruses with RNA genomes makes predictive identification of variants with epidemic or pandemic potential challenging. In recent years, machine learning has become an increasingly capable technology for addressing this challenge, as advances in methods and computational power have dramatically improved the performance of models and led to their widespread adoption across industries and disciplines. Nascent applications of machine learning technology to virus research have now expanded, providing new tools for handling large-scale datasets and leading to a reshaping of existing workflows for phenotype prediction, phylogenetic analysis, drug discovery and more. This review explores how machine learning has been applied to and has impacted the study of viruses, before addressing the strengths and limitations of its techniques and finally highlighting the next steps that are needed for the technology to reach its full potential in this challenging and ever-relevant research area.
Collapse
Affiliation(s)
- Sebastian Bowyer
- Department of Infection Biology, Faculty of Infectious and Tropical Diseases, London School of Hygiene and Tropical Medicine, London, UK
| | - David J Allen
- Department of Comparative Biomedical Sciences, Section Infection and Immunity, School of Veterinary Medicine, Faculty of Health and Medical Sciences, University of Surrey, Guildford, UK
| | - Nicholas Furnham
- Department of Infection Biology, Faculty of Infectious and Tropical Diseases, London School of Hygiene and Tropical Medicine, London, UK
| |
Collapse
|
28
|
Santa S, Kwofie SK, Agyenkwa-Mawuli K, Quaye O, Brown CA, Tagoe EA. Prediction of Human Papillomavirus-Host Oncoprotein Interactions Using Deep Learning. Bioinform Biol Insights 2024; 18:11779322241304666. [PMID: 39664297 PMCID: PMC11632871 DOI: 10.1177/11779322241304666] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Accepted: 11/16/2024] [Indexed: 12/13/2024] Open
Abstract
Background Human papillomavirus (HPV) causes disease through complex interactions between viral and host proteins, with the PI3K signaling pathway playing a key role. Proteins like AKT, IQGAP1, and MMP16 are involved in HPV-related cancer development. Traditional methods for studying protein-protein interactions (PPIs) are labor-intensive and time-consuming. Computational models are becoming more popular as they are less labor-intensive and often more efficient. This study aimed to develop a deep learning model to predict interactions between HPV and host proteins. Method To achieve this, available HPV and host protein interaction data was retrieved from the protocol of Eckhardt et al and used to train a Recurrent Neural Network algorithm. Training of the model was performed on the SPYDER (scientific python development environment) platform using python libraries; Scikit-learn, Pandas, NumPy, and TensorFlow. The data was split into training, validation, and testing sets in the ratio 7:1:2, respectively. After the training and validation, the model was then used to predict the possible interactions between HPV 31 and 18 E6 and E7, and host oncoproteins AKT, IQGAP1 and MMP16. Results The model showed good performance, with an MCC score of 0.7937 and all other metrics above 88%. The model predicted an interaction between E6 and E7 of both HPV types with AKT, while only HPV31 E7 was shown to interact with IQGAP1 and MMP16 with confidence scores of 0.9638 and 0.5793, respectively. Conclusion The current model strongly predicted HPVs E6 and E7 interactions with PI3K pathway, and the viral proteins may be involved in AKT activation, driving HPV-associated cancers. This model supports the robust prediction of interactomes for experimental validation.
Collapse
Affiliation(s)
- Sheila Santa
- Department of Biochemistry, Cell & Molecular Biology/West African Centre for Cell Biology of Infectious Pathogens (WACCBIP), College of Basic and Applied Sciences, University of Ghana, Accra, Ghana
- Department of Medical Laboratory Sciences, School of Biomedical and Allied Health Sciences, College of Health Sciences, University of Ghana, Accra, Ghana
| | - Samuel Kojo Kwofie
- Department of Biomedical Engineering, School of Engineering Sciences, College of Basic and Applied Sciences, University of Ghana, Accra, Ghana
| | - Kwasi Agyenkwa-Mawuli
- Noguchi Memorial Institute for Medical Research, College of Health Sciences, University of Ghana, Accra, Ghana
| | - Osbourne Quaye
- Department of Biochemistry, Cell & Molecular Biology/West African Centre for Cell Biology of Infectious Pathogens (WACCBIP), College of Basic and Applied Sciences, University of Ghana, Accra, Ghana
| | - Charles A Brown
- Department of Medical Laboratory Sciences, School of Biomedical and Allied Health Sciences, College of Health Sciences, University of Ghana, Accra, Ghana
| | - Emmanuel A Tagoe
- Department of Medical Laboratory Sciences, School of Biomedical and Allied Health Sciences, College of Health Sciences, University of Ghana, Accra, Ghana
| |
Collapse
|
29
|
Liang L, Duan Y, Zeng C, Wan B, Yao H, Liu H, Lu T, Zhang Y, Chen Y, Shen J. CPIScore: A Deep Learning Approach for Rapid Scoring and Interpretation of Protein-Ligand Binding Interactions. J Chem Inf Model 2024; 64:8809-8823. [PMID: 39563077 DOI: 10.1021/acs.jcim.4c01175] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2024]
Abstract
Protein-ligand binding affinity prediction is a crucial and challenging task in the field of drug discovery. However, traditional simulation-based computational approaches are often prohibitively time-consuming, limiting their practical utility. In this study, we introduce a novel deep learning method, CPIScore, which leverages the capabilities of Transformer and Graph Convolutional Networks (GCN) to enhance the prediction of protein-ligand binding affinity. CPIScore utilizes the Transformer architecture to capture comprehensive global contexts of protein and ligand sequences, while the GCN component effectively extracts local features from small molecular graphs. Our results demonstrate that CPIScore surpasses both traditional machine learning and other deep learning models in accuracy, achieving a Pearson's r of 0.74 on our test set. Furthermore, CPIScore has been validated across multiple targets, proving its ability to discern inhibitors from a diverse compound library with high enrichment rates. Notably, when applied to a generated focused library of compounds, CPIScore successfully identified six potent small-molecule inhibitors of ATR, which were tested experimentally and four small molecules exhibited inhibitory activity below ten nanomoles. These results highlight CPIScore's potential to significantly streamline and enhance the efficiency of drug discovery processes.
Collapse
Affiliation(s)
- Li Liang
- Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, China
| | - Yunxin Duan
- Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, China
| | - Chen Zeng
- Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, China
| | - Boheng Wan
- Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, China
| | - Huifeng Yao
- Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, China
| | - Haichun Liu
- Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, China
| | - Tao Lu
- Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, China
- State Key Laboratory of Natural Medicines, China Pharmaceutical University, 24 Tongjiaxiang, Nanjing 210009, China
| | - Yanmin Zhang
- Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, China
| | - Yadong Chen
- Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, China
- Hebei API Crystallization Technology Innovation Center, Shimen Building, No. 8 Xingye Street, Shijiazhuang 052165, China
| | - Jun Shen
- Laboratory of Molecular Design and Drug Discovery, School of Science, China Pharmaceutical University, 639 Longmian Avenue, Nanjing 211198, China
| |
Collapse
|
30
|
Ghazikhani H, Butler G. Ion channel classification through machine learning and protein language model embeddings. J Integr Bioinform 2024; 21:jib-2023-0047. [PMID: 39572876 PMCID: PMC11698620 DOI: 10.1515/jib-2023-0047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Accepted: 09/04/2024] [Indexed: 01/06/2025] Open
Abstract
Ion channels are critical membrane proteins that regulate ion flux across cellular membranes, influencing numerous biological functions. The resource-intensive nature of traditional wet lab experiments for ion channel identification has led to an increasing emphasis on computational techniques. This study extends our previous work on protein language models for ion channel prediction, significantly advancing the methodology and performance. We employ a comprehensive array of machine learning algorithms, including k-Nearest Neighbors, Random Forest, Support Vector Machines, and Feed-Forward Neural Networks, alongside a novel Convolutional Neural Network (CNN) approach. These methods leverage fine-tuned embeddings from ProtBERT, ProtBERT-BFD, and MembraneBERT to differentiate ion channels from non-ion channels. Our empirical findings demonstrate that TooT-BERT-CNN-C, which combines features from ProtBERT-BFD and a CNN, substantially surpasses existing benchmarks. On our original dataset, it achieves a Matthews Correlation Coefficient (MCC) of 0.8584 and an accuracy of 98.35 %. More impressively, on a newly curated, larger dataset (DS-Cv2), it attains an MCC of 0.9492 and an ROC AUC of 0.9968 on the independent test set. These results not only highlight the power of integrating protein language models with deep learning for ion channel classification but also underscore the importance of using up-to-date, comprehensive datasets in bioinformatics tasks. Our approach represents a significant advancement in computational methods for ion channel identification, with potential implications for accelerating research in ion channel biology and aiding drug discovery efforts.
Collapse
Affiliation(s)
- Hamed Ghazikhani
- Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada
| | - Gregory Butler
- Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada
| |
Collapse
|
31
|
Basu S, Yu J, Kihara D, Kurgan L. Twenty years of advances in prediction of nucleic acid-binding residues in protein sequences. Brief Bioinform 2024; 26:bbaf016. [PMID: 39833102 PMCID: PMC11745544 DOI: 10.1093/bib/bbaf016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2024] [Revised: 12/24/2024] [Accepted: 01/06/2025] [Indexed: 01/22/2025] Open
Abstract
Computational prediction of nucleic acid-binding residues in protein sequences is an active field of research, with over 80 methods that were released in the past 2 decades. We identify and discuss 87 sequence-based predictors that include dozens of recently published methods that are surveyed for the first time. We overview historical progress and examine multiple practical issues that include availability and impact of predictors, key features of their predictive models, and important aspects related to their training and assessment. We observe that the past decade has brought increased use of deep neural networks and protein language models, which contributed to substantial gains in the predictive performance. We also highlight advancements in vital and challenging issues that include cross-predictions between deoxyribonucleic acid (DNA)-binding and ribonucleic acid (RNA)-binding residues and targeting the two distinct sources of binding annotations, structure-based versus intrinsic disorder-based. The methods trained on the structure-annotated interactions tend to perform poorly on the disorder-annotated binding and vice versa, with only a few methods that target and perform well across both annotation types. The cross-predictions are a significant problem, with some predictors of DNA-binding or RNA-binding residues indiscriminately predicting interactions with both nucleic acid types. Moreover, we show that methods with web servers are cited substantially more than tools without implementation or with no longer working implementations, motivating the development and long-term maintenance of the web servers. We close by discussing future research directions that aim to drive further progress in this area.
Collapse
Affiliation(s)
- Sushmita Basu
- Department of Computer Science, Virginia Commonwealth University, 401 West Main Street, Richmond, VA 23284, United States
| | - Jing Yu
- Department of Computer Science, Virginia Commonwealth University, 401 West Main Street, Richmond, VA 23284, United States
| | - Daisuke Kihara
- Department of Biological Sciences, Purdue University, 915 Mitch Daniels Boulevard, West Lafayette, IN 47907, United States
- Department of Computer Science, Purdue University, 305 N. University Street, West Lafayette, IN 47907, United States
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, 401 West Main Street, Richmond, VA 23284, United States
| |
Collapse
|
32
|
Chang Y, Wu L. CapHLA: a comprehensive tool to predict peptide presentation and binding to HLA class I and class II. Brief Bioinform 2024; 26:bbae595. [PMID: 39688477 DOI: 10.1093/bib/bbae595] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2024] [Revised: 09/13/2024] [Accepted: 12/14/2024] [Indexed: 12/18/2024] Open
Abstract
Human leukocyte antigen class I (HLA-I) and class II (HLA-II) proteins play an essential role in epitope binding and presentation to initiate an immune response. Accurate prediction of peptide-HLA (pHLA) binding and presentation is critical for developing effective immunotherapies. However, current tools can predict antigens exclusively for pHLA-I or pHLA-II, but not both; have constraints on peptide length; and commonly show unsatisfactory predictive accuracy. Here, we developed a convolution and attention-based model, CapHLA, trained with eluted ligand and binding affinity mass spectrometry data, to predict peptide presentation probability (PB) and binding affinities (BA) for HLA-I and HLA-II. In comparison with 11 other methods, CapHLA consistently showed improved performance in predicting pHLA BA and PB, particularly in HLA-II and non-classical peptide length datasets. Using CapHLA PB and BA predictions in combination with antigen expression level (EP) from transcriptomic data, we developed a neoantigen quality model for predicting immunotherapy response. In analyses of clinical response among 276 cancer patients given immunotherapy and overall survival in 7228 cancer patients, our neoantigen quality model outperformed other genetics-based models in predicting response to checkpoint inhibitors and patient prognosis. This study provides a versatile neoantigen screening tool, illustrating the prognostic value of neoantigen quality.
Collapse
Affiliation(s)
- Yunjian Chang
- Key Laboratory of RNA Science and Engineering, Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, University of Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China
| | - Ligang Wu
- Key Laboratory of RNA Science and Engineering, Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, University of Chinese Academy of Sciences, 320 Yueyang Road, Shanghai 200031, China
| |
Collapse
|
33
|
Vu TTD, Kim J, Jung J. An experimental analysis of graph representation learning for Gene Ontology based protein function prediction. PeerJ 2024; 12:e18509. [PMID: 39553733 PMCID: PMC11569786 DOI: 10.7717/peerj.18509] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2024] [Accepted: 10/21/2024] [Indexed: 11/19/2024] Open
Abstract
Understanding protein function is crucial for deciphering biological systems and facilitating various biomedical applications. Computational methods for predicting Gene Ontology functions of proteins emerged in the 2000s to bridge the gap between the number of annotated proteins and the rapidly growing number of newly discovered amino acid sequences. Recently, there has been a surge in studies applying graph representation learning techniques to biological networks to enhance protein function prediction tools. In this review, we provide fundamental concepts in graph embedding algorithms. This study described graph representation learning methods for protein function prediction based on four principal data categories, namely PPI network, protein structure, Gene Ontology graph, and integrated graph. The commonly used approaches for each category were summarized and diagrammed, with the specific results of each method explained in detail. Finally, existing limitations and potential solutions were discussed, and directions for future research within the protein research community were suggested.
Collapse
Affiliation(s)
- Thi Thuy Duong Vu
- Faculty of Fundamental Sciences, University of Medicine and Pharmacy at Ho Chi Minh City, Ho Chi Minh City, Vietnam
| | - Jeongho Kim
- Department of Information and Communication Engineering, Myongji University, Yongin, Republic of South Korea
| | - Jaehee Jung
- Department of Information and Communication Engineering, Myongji University, Yongin, Republic of South Korea
| |
Collapse
|
34
|
Abdollahi S, Schaub DP, Barroso M, Laubach NC, Hutwelker W, Panzer U, Gersting SØW, Bonn S. A comprehensive comparison of deep learning-based compound-target interaction prediction models to unveil guiding design principles. J Cheminform 2024; 16:118. [PMID: 39468635 PMCID: PMC11520803 DOI: 10.1186/s13321-024-00913-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2024] [Accepted: 10/10/2024] [Indexed: 10/30/2024] Open
Abstract
The evaluation of compound-target interactions (CTIs) is at the heart of drug discovery efforts. Given the substantial time and monetary costs of classical experimental screening, significant efforts have been dedicated to develop deep learning-based models that can accurately predict CTIs. A comprehensive comparison of these models on a large, curated CTI dataset is, however, still lacking. Here, we perform an in-depth comparison of 12 state-of-the-art deep learning architectures that use different protein and compound representations. The models were selected for their reported performance and architectures. To reliably compare model performance, we curated over 300 thousand binding and non-binding CTIs and established several gold-standard datasets of varying size and information. Based on our findings, DeepConv-DTI consistently outperforms other models in CTI prediction performance across the majority of datasets. It achieves an MCC of 0.6 or higher for most of the datasets and is one of the fastest models in training and inference. These results indicate that utilizing convolutional-based windows as in DeepConv-DTI to traverse trainable embeddings is a highly effective approach for capturing informative protein features. We also observed that physicochemical embeddings of targets increased model performance. We therefore modified DeepConv-DTI to include normalized physicochemical properties, which resulted in the overall best performing model Phys-DeepConv-DTI. This work highlights how the systematic evaluation of input features of compounds and targets, as well as their corresponding neural network architectures, can serve as a roadmap for the future development of improved CTI models.Scientific contributionThis work features comprehensive CTI datasets to allow for the objective comparison and benchmarking of CTI prediction algorithms. Based on this dataset, we gained insights into which embeddings of compounds and targets and which deep learning-based algorithms perform best, providing a blueprint for the future development of CTI algorithms. Using the insights gained from this screen, we provide a novel CTI algorithm with state-of-the-art performance.
Collapse
Affiliation(s)
- Sina Abdollahi
- Institute of Medical Systems Biology, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany
| | - Darius P Schaub
- Institute of Medical Systems Biology, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany
- III. Department of Medicine, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany
| | - Madalena Barroso
- University Children's Research, UCR@Kinder-UKE, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany
| | - Nora C Laubach
- University Children's Research, UCR@Kinder-UKE, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany
| | - Wiebke Hutwelker
- University Children's Research, UCR@Kinder-UKE, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany
| | - Ulf Panzer
- III. Department of Medicine, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany
- Hamburg Center for Translational Immunology (HCTI), University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany
| | - S Øren W Gersting
- University Children's Research, UCR@Kinder-UKE, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany.
| | - Stefan Bonn
- Institute of Medical Systems Biology, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany.
- Hamburg Center for Translational Immunology (HCTI), University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany.
- Center for Biomedical AI, University Medical Center Hamburg-Eppendorf, Hamburg, 20251, Germany.
| |
Collapse
|
35
|
Michels J, Bandarupalli R, Akbari AA, Le T, Xiao H, Li J, Hom EFY. Natural Language Processing Methods for the Study of Protein-Ligand Interactions. ARXIV 2024:arXiv:2409.13057v2. [PMID: 39483353 PMCID: PMC11527106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/03/2024]
Abstract
Natural Language Processing (NLP) has revolutionized the way computers are used to study and interact with human languages and is increasingly influential in the study of protein and ligand binding, which is critical for drug discovery and development. This review examines how NLP techniques have been adapted to decode the "language" of proteins and small molecule ligands to predict protein-ligand interactions (PLIs). We discuss how methods such as long short-term memory (LSTM) networks, transformers, and attention mechanisms can leverage different protein and ligand data types to identify potential interaction patterns. Significant challenges are highlighted, including the scarcity of high-quality negative data, difficulties in interpreting model decisions, and sampling biases of existing datasets. We argue that focusing on improving data quality, enhancing model robustness, and fostering both collaboration and competition could catalyze future advances in machine-learning-based predictions of PLIs.
Collapse
Affiliation(s)
- James Michels
- Department of Computer Science, University of Mississippi, University, MS
| | - Ramya Bandarupalli
- Department of BioMolecular Sciences, School of Pharmacy, University of Mississippi, University, MS
| | - Amin Ahangar Akbari
- Department of BioMolecular Sciences, School of Pharmacy, University of Mississippi, University, MS
| | - Thai Le
- Department of Computer Science, Indiana University, Bloomington, IN
| | - Hong Xiao
- Department of Computer Science, University of Mississippi, University, MS
| | - Jing Li
- Department of BioMolecular Sciences, School of Pharmacy, University of Mississippi, University, MS
| | - Erik F Y Hom
- Department of Biology and Center for Biodiversity and Conservation Research, University of Mississippi, University, MS
| |
Collapse
|
36
|
Flamholz ZN, Li C, Kelly L. Improving viral annotation with artificial intelligence. mBio 2024; 15:e0320623. [PMID: 39230289 PMCID: PMC11481560 DOI: 10.1128/mbio.03206-23] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024] Open
Abstract
Viruses of bacteria, "phages," are fundamental, poorly understood components of microbial community structure and function. Additionally, their dependence on hosts for replication positions phages as unique sensors of ecosystem features and environmental pressures. High-throughput sequencing approaches have begun to give us access to the diversity and range of phage populations in complex microbial community samples, and metagenomics is currently the primary tool with which we study phage populations. The study of phages by metagenomic sequencing, however, is fundamentally limited by viral diversity, which results in the vast majority of viral genomes and metagenome-annotated genomes lacking annotation. To harness bacteriophages for applications in human and environmental health and disease, we need new methods to organize and annotate viral sequence diversity. We recently demonstrated that methods that leverage self-supervised representation learning can supplement statistical sequence representations for remote viral protein homology detection in the ocean virome and propose that consideration of the functional content of viral sequences allows for the identification of similarity in otherwise sequence-diverse viruses and viral-like elements for biological discovery. In this review, we describe the potential and pitfalls of large language models for viral annotation. We describe the need for new approaches to annotate viral sequences in metagenomes, the fundamentals of what protein language models are and how one can use them for sequence annotation, the strengths and weaknesses of these models, and future directions toward developing better models for viral annotation more broadly.
Collapse
Affiliation(s)
- Zachary N. Flamholz
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, New York, USA
| | - Charlotte Li
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, New York, USA
| | - Libusha Kelly
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, New York, USA
- Department of Microbiology and Immunology, Albert Einstein College of Medicine, Bronx, New York, USA
| |
Collapse
|
37
|
Mera-Banguero C, Orduz S, Cardona P, Orrego A, Muñoz-Pérez J, Branch-Bedoya JW. AmpClass: an Antimicrobial Peptide Predictor Based on Supervised Machine Learning. AN ACAD BRAS CIENC 2024; 96:e20230756. [PMID: 39383429 DOI: 10.1590/0001-3765202420230756] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Accepted: 04/07/2024] [Indexed: 10/11/2024] Open
Abstract
In the last decades, antibiotic resistance has been considered a severe problem worldwide. Antimicrobial peptides (AMPs) are molecules that have shown potential for the development of new drugs against antibiotic-resistant bacteria. Nowadays, medicinal drug researchers use supervised learning methods to screen new peptides with antimicrobial potency to save time and resources. In this work, we consolidate a database with 15945 AMPs and 12535 non-AMPs taken as the base to train a pool of supervised learning models to recognize peptides with antimicrobial activity. Results show that the proposed tool (AmpClass) outperforms classical state-of-the-art prediction models and achieves similar results compared with deep learning models.
Collapse
Affiliation(s)
- Carlos Mera-Banguero
- Instituto Tecnológico Metropolitano, Departamento de Sistemas de Información, Facultad de Ingeniería, Calle 54A # 30-01, 050013, Medellín, Antioquia, Colombia
- Universidad de Antioquia, Departamento de Ingeniería de Sistemas, Facultad de Ingenierías, Calle 67 # 53 - 108, 050010, Medellín, Antioquia, Colombia
| | - Sergio Orduz
- Universidad Nacional de Colombia, sede Medellín, Departamento de Biociencias, Facultad de Ciencias, Carrera 65 # 59A - 110, 050034, Medellín, Antioquia, Colombia
| | - Pablo Cardona
- Universidad Nacional de Colombia, sede Medellín, Departamento de Biociencias, Facultad de Ciencias, Carrera 65 # 59A - 110, 050034, Medellín, Antioquia, Colombia
| | - Andrés Orrego
- Universidad Nacional de Colombia, sede Medellín, Departamento de Ciencias de la Computación y de la Decisión, Facultad de Minas, Av. 80 # 65 - 223, 050041, Medellín, Antioquia, Colombia
| | - Jorge Muñoz-Pérez
- Universidad Nacional de Colombia, sede Medellín, Departamento de Biociencias, Facultad de Ciencias, Carrera 65 # 59A - 110, 050034, Medellín, Antioquia, Colombia
| | - John W Branch-Bedoya
- Universidad Nacional de Colombia, sede Medellín, Departamento de Ciencias de la Computación y de la Decisión, Facultad de Minas, Av. 80 # 65 - 223, 050041, Medellín, Antioquia, Colombia
| |
Collapse
|
38
|
Susanty M, Mursalim MKN, Hertadi R, Purwarianti A, LE Rajab T. Leveraging protein language model embeddings and logistic regression for efficient and accurate in-silico acidophilic proteins classification. Comput Biol Chem 2024; 112:108163. [PMID: 39098138 DOI: 10.1016/j.compbiolchem.2024.108163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2024] [Revised: 07/02/2024] [Accepted: 07/24/2024] [Indexed: 08/06/2024]
Abstract
The increasing demand for eco-friendly technologies in biotechnology necessitates effective and sustainable catalysts. Acidophilic proteins, functioning optimally in highly acidic environments, hold immense promise for various applications, including food production, biofuels, and bioremediation. However, limited knowledge about these proteins hinders their exploration. This study addresses this gap by employing in silico methods utilizing computational tools and machine learning. We propose a novel approach to predict acidophilic proteins using protein language models (PLMs), accelerating discovery without extensive lab work. Our investigation highlights the potential of PLMs in understanding and harnessing acidophilic proteins for scientific and industrial advancements. We introduce the ACE model, which combines a simple Logistic Regression model with embeddings derived from protein sequences processed by the ProtT5 PLM. This model achieves high performance on an independent test set, with accuracy (0.91), F1-score (0.93), and Matthew's correlation coefficient (0.76). To our knowledge, this is the first application of pre-trained PLM embeddings for acidophilic protein classification. The ACE model serves as a powerful tool for exploring protein acidophilicity, paving the way for future advancements in protein design and engineering.
Collapse
Affiliation(s)
- Meredita Susanty
- Institut Teknologi Bandung School of Electrical Engineering and Informatics, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia; Universitas Pertamina, School of Computer Science, Jl Teuku Nyak Arief Jakarta Selatan DKI, Jakarta, Indonesia
| | - Muhammad Khaerul Naim Mursalim
- Institut Teknologi Bandung School of Electrical Engineering and Informatics, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia; Universitas UniversalKompleks Maha Vihara Duta Maitreya Bukit Beruntung, Sei Panas Batam, Kepulauan, Riau 29456, Indonesia
| | - Rukman Hertadi
- Institut Teknologi Bandung Faculty of Math and Natural Sciences, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia
| | - Ayu Purwarianti
- Institut Teknologi Bandung School of Electrical Engineering and Informatics, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia; Center for Artificial Intelligence (U-CoE AI-VLB), Institut Teknologi Bandung, Bandung, Indonesia
| | - Tati LE Rajab
- Institut Teknologi Bandung School of Electrical Engineering and Informatics, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia.
| |
Collapse
|
39
|
Yang S, Cheng P, Liu Y, Feng D, Wang S. Exploring the Knowledge of an Outstanding Protein to Protein Interaction Transformer. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1287-1298. [PMID: 38536676 DOI: 10.1109/tcbb.2024.3381825] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/10/2024]
Abstract
Protein-to-protein interaction (PPI) prediction aims to predict whether two given proteins interact or not. Compared with traditional experimental methods of high cost and low efficiency, the current deep learning based approach makes it possible to discover massive potential PPIs from large-scale databases. However, deep PPI prediction models perform poorly on unseen species, as their proteins are not in the training set. Targetting on this issue, the paper first proposes PPITrans, a Transformer based PPI prediction model that exploits a language model pre-trained on proteins to conduct binary PPI prediction. To validate the effectiveness on unseen species, PPITrans is trained with Human PPIs and tested on PPIs of other species. Experimental results show that PPITrans significantly outperforms the previous state-of-the-art on various metrics, especially on PPIs of unseen species. For example, the AUPR improves 0.339 absolutely on Fly PPIs. Aiming to explore the knowledge learned by PPITrans from PPI data, this paper also designs a series of probes belonging to three categories. Their results reveal several interesting findings, like that although PPITrans cannot capture the spatial structure of proteins, it can obtain knowledge of PPI type and binding affinity, learning more than binary PPI.
Collapse
|
40
|
Cai C, Li J, Xia Y, Li W. FluPMT: Prediction of Predominant Strains of Influenza A Viruses via Multi-Task Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1254-1263. [PMID: 38498763 DOI: 10.1109/tcbb.2024.3378468] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/20/2024]
Abstract
Seasonal influenza vaccines play a crucial role in saving numerous lives annually. However, the constant evolution of the influenza A virus necessitates frequent vaccine updates to ensure its ongoing effectiveness. The decision to develop a new vaccine strain is generally based on the assessment of the current predominant strains. Nevertheless, the process of vaccine production and distribution is very time-consuming, leaving a window for the emergence of new variants that could decrease vaccine effectiveness, so predictions of influenza A virus evolution can inform vaccine evaluation and selection. Hence, we present FluPMT, a novel sequence prediction model that applies an encoder-decoder architecture to predict the hemagglutinin (HA) protein sequence of the upcoming season's predominant strain by capturing the patterns of evolution of influenza A viruses. Specifically, we employ time series to model the evolution of influenza A viruses, and utilize attention mechanisms to explore dependencies among residues of sequences. Additionally, antigenic distance prediction based on graph network representation learning is incorporated into the sequence prediction as an auxiliary task through a multi-task learning framework. Experimental results on two influenza datasets highlight the exceptional predictive performance of FluPMT, offering valuable insights into virus evolutionary dynamics, as well as vaccine evaluation and production.
Collapse
|
41
|
Ji M, Kan Y, Kim D, Lee S, Yi G. DeepPI: Alignment-Free Analysis of Flexible Length Proteins Based on Deep Learning and Image Generator. Interdiscip Sci 2024; 16:1-12. [PMID: 38568406 DOI: 10.1007/s12539-024-00618-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Revised: 02/01/2024] [Accepted: 02/03/2024] [Indexed: 09/19/2024]
Abstract
With the rapid development of NGS technology, the number of protein sequences has increased exponentially. Computational methods have been introduced in protein functional studies because the analysis of large numbers of proteins through biological experiments is costly and time-consuming. In recent years, new approaches based on deep learning have been proposed to overcome the limitations of conventional methods. Although deep learning-based methods effectively utilize features of protein function, they are limited to sequences of fixed-length and consider information from adjacent amino acids. Therefore, new protein analysis tools that extract functional features from proteins of flexible length and train models are required. We introduce DeepPI, a deep learning-based tool for analyzing proteins in large-scale database. The proposed model that utilizes Global Average Pooling is applied to proteins of flexible length and leads to reduced information loss compared to existing algorithms that use fixed sizes. The image generator converts a one-dimensional sequence into a distinct two-dimensional structure, which can extract common parts of various shapes. Finally, filtering techniques automatically detect representative data from the entire database and ensure coverage of large protein databases. We demonstrate that DeepPI has been successfully applied to large databases such as the Pfam-A database. Comparative experiments on four types of image generators illustrated the impact of structure on feature extraction. The filtering performance was verified by varying the parameter values and proved to be applicable to large databases. Compared to existing methods, DeepPI outperforms in family classification accuracy for protein function inference.
Collapse
Affiliation(s)
- Mingeun Ji
- Department of Multimedia Engineering, Dongguk University, Seoul, 04620, Korea
| | - Yejin Kan
- Department of Multimedia Engineering, Dongguk University, Seoul, 04620, Korea
| | - Dongyeon Kim
- Department of Artificial Intelligence, Dongguk University, Seoul, 04620, Korea
| | - Seungmin Lee
- Department of Multimedia Engineering, Dongguk University, Seoul, 04620, Korea
| | - Gangman Yi
- Department of Multimedia Engineering, Dongguk University, Seoul, 04620, Korea.
- Department of Artificial Intelligence, Dongguk University, Seoul, 04620, Korea.
- Division of AI Software Convergence, Dongguk University, Seoul, 04620, Korea.
| |
Collapse
|
42
|
Zhou J, Huang M. Navigating the landscape of enzyme design: from molecular simulations to machine learning. Chem Soc Rev 2024; 53:8202-8239. [PMID: 38990263 DOI: 10.1039/d4cs00196f] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/12/2024]
Abstract
Global environmental issues and sustainable development call for new technologies for fine chemical synthesis and waste valorization. Biocatalysis has attracted great attention as the alternative to the traditional organic synthesis. However, it is challenging to navigate the vast sequence space to identify those proteins with admirable biocatalytic functions. The recent development of deep-learning based structure prediction methods such as AlphaFold2 reinforced by different computational simulations or multiscale calculations has largely expanded the 3D structure databases and enabled structure-based design. While structure-based approaches shed light on site-specific enzyme engineering, they are not suitable for large-scale screening of potential biocatalysts. Effective utilization of big data using machine learning techniques opens up a new era for accelerated predictions. Here, we review the approaches and applications of structure-based and machine-learning guided enzyme design. We also provide our view on the challenges and perspectives on effectively employing enzyme design approaches integrating traditional molecular simulations and machine learning, and the importance of database construction and algorithm development in attaining predictive ML models to explore the sequence fitness landscape for the design of admirable biocatalysts.
Collapse
Affiliation(s)
- Jiahui Zhou
- School of Chemistry and Chemical Engineering, Queen's University, David Keir Building, Stranmillis Road, Belfast BT9 5AG, Northern Ireland, UK.
| | - Meilan Huang
- School of Chemistry and Chemical Engineering, Queen's University, David Keir Building, Stranmillis Road, Belfast BT9 5AG, Northern Ireland, UK.
| |
Collapse
|
43
|
Nambiar A, Forsyth JM, Liu S, Maslov S. DR-BERT: A protein language model to annotate disordered regions. Structure 2024; 32:1260-1268.e3. [PMID: 38701796 DOI: 10.1016/j.str.2024.04.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 06/16/2023] [Accepted: 04/08/2024] [Indexed: 05/05/2024]
Abstract
Despite their lack of a rigid structure, intrinsically disordered regions (IDRs) in proteins play important roles in cellular functions, including mediating protein-protein interactions. Therefore, it is important to computationally annotate IDRs with high accuracy. In this study, we present Disordered Region prediction using Bidirectional Encoder Representations from Transformers (DR-BERT), a compact protein language model. Unlike most popular tools, DR-BERT is pretrained on unannotated proteins and trained to predict IDRs without relying on explicit evolutionary or biophysical data. Despite this, DR-BERT demonstrates significant improvement over existing methods on the Critical Assessment of protein Intrinsic Disorder (CAID) evaluation dataset and outperforms competitors on two out of four test cases in the CAID 2 dataset, while maintaining competitiveness in the others. This performance is due to the information learned during pretraining and DR-BERT's ability to use contextual information.
Collapse
Affiliation(s)
- Ananthan Nambiar
- Department of Bioengineering, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA; Carl R. Woese Institute for Genomic Biology, Urbana, IL 61801, USA.
| | - John Malcolm Forsyth
- Carl R. Woese Institute for Genomic Biology, Urbana, IL 61801, USA; Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Simon Liu
- Carl R. Woese Institute for Genomic Biology, Urbana, IL 61801, USA; Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Sergei Maslov
- Department of Bioengineering, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA; Carl R. Woese Institute for Genomic Biology, Urbana, IL 61801, USA; Department of Physics, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA; Computing, Environment and Life Sciences, Argonne National Laboratory, Lemont, IL 60439, USA.
| |
Collapse
|
44
|
Ghazikhani H, Butler G. Exploiting protein language models for the precise classification of ion channels and ion transporters. Proteins 2024; 92:998-1055. [PMID: 38656743 DOI: 10.1002/prot.26694] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 03/26/2024] [Accepted: 04/08/2024] [Indexed: 04/26/2024]
Abstract
This study introduces TooT-PLM-ionCT, a comprehensive framework that consolidates three distinct systems, each meticulously tailored for one of the following tasks: distinguishing ion channels (ICs) from membrane proteins (MPs), segregating ion transporters (ITs) from MPs, and differentiating ICs from ITs. Drawing upon the strengths of six Protein Language Models (PLMs)-ProtBERT, ProtBERT-BFD, ESM-1b, ESM-2 (650M parameters), and ESM-2 (15B parameters), TooT-PLM-ionCT employs a combination of traditional classifiers and deep learning models for nuanced protein classification. Originally validated on an existing dataset by previous researchers, our systems demonstrated superior performance in identifying ITs from MPs and distinguishing ICs from ITs, with the IC-MP discrimination achieving state-of-the-art results. In light of recommendations for additional validation, we introduced a new dataset, significantly enhancing the robustness and generalization of our models across bioinformatics challenges. This new evaluation underscored the effectiveness of TooT-PLM-ionCT in adapting to novel data while maintaining high classification accuracy. Furthermore, this study explores critical factors affecting classification accuracy, such as dataset balancing, the impact of using frozen versus fine-tuned PLM representations, and the variance between half and full precision in floating-point computations. To facilitate broader application and accessibility, a web server (https://tootsuite.encs.concordia.ca/service/TooT-PLM-ionCT) has been developed, allowing users to evaluate unknown protein sequences through our specialized systems for IC-MP, IT-MP, and IC-IT classification tasks.
Collapse
Affiliation(s)
- Hamed Ghazikhani
- Department of Computer Science and Software Engineering, Concordia University, Montréal, Québec, Canada
| | - Gregory Butler
- Centre for Structural and Functional Genomics, Concordia University, Montréal, Québec, Canada
| |
Collapse
|
45
|
Ryan V WG, Imami AS, Ali Sajid H, Vergis J, Zhang X, Meller J, Shukla R, McCullumsmith R. Interpreting and visualizing pathway analyses using embedding representations with PAVER. Bioinformation 2024; 20:700-704. [PMID: 39309552 PMCID: PMC11414338 DOI: 10.6026/973206300200700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2024] [Revised: 07/31/2024] [Accepted: 07/31/2024] [Indexed: 09/25/2024] Open
Abstract
Omics studies use large-scale high-throughput data to explain changes underlying different traits or conditions. However, omics analysis often results in long lists of pathways that are difficult to interpret. Therefore, it is of interest to describe a tool named PAVER (Pathway Analysis Visualization with Embedding Representations) for large scale genomic analysis. PAVER curates similar pathways into groups, identifies the pathway most representative of each group, and provides publication-ready intuitive visualizations. PAVER clusters pathways defined by their vector embedding representations and then identifies the term most cosine similar to its respective cluster's average embedding. PAVER can integrate multiple pathway analyses, highlight relevant biological insights, and work with any pathway database.
Collapse
Affiliation(s)
- William G Ryan V
- Department of Neurosciences, College of Medicine and Life Sciences, University of Toledo, Toledo, OH, USA
| | - Ali Sajid Imami
- Department of Neurosciences, College of Medicine and Life Sciences, University of Toledo, Toledo, OH, USA
| | - Hunter Ali Sajid
- Department of Neurosciences, College of Medicine and Life Sciences, University of Toledo, Toledo, OH, USA
| | - John Vergis
- Department of Neurosciences, College of Medicine and Life Sciences, University of Toledo, Toledo, OH, USA
| | - Xiaolu Zhang
- Department of Microbiology and Immunology, Louisiana State University Health Sciences Center, Shreveport, LA, USA
| | - Jarek Meller
- Department of Environmental and Public Health Sciences, University of Cincinnati, Cincinnati, OH, USA
- Department of Computer Science, University of Cincinnati, Cincinnati, OH, USA
- Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
- Department of Informatics, Nicolaus Copernicus University, Torun, Poland
| | - Rammohan Shukla
- Department of Zoology & Physiology, College of Agriculture, Life Sciences and Natural Resources, University of Wyoming, Laramie, WY, USA
| | - Robert McCullumsmith
- Department of Neurosciences, College of Medicine and Life Sciences, University of Toledo, Toledo, OH, USA
- Neurosciences Institute, ProMedica, Toledo, OH, USA
- Department of Psychiatry, College of Medicine and Life Sciences, University of Toledo, Toledo, OH, USA
| |
Collapse
|
46
|
Cosentino S, Sriswasdi S, Iwasaki W. SonicParanoid2: fast, accurate, and comprehensive orthology inference with machine learning and language models. Genome Biol 2024; 25:195. [PMID: 39054525 PMCID: PMC11270883 DOI: 10.1186/s13059-024-03298-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Accepted: 06/04/2024] [Indexed: 07/27/2024] Open
Abstract
Accurate inference of orthologous genes constitutes a prerequisite for comparative and evolutionary genomics. SonicParanoid is one of the fastest tools for orthology inference; however, its scalability and accuracy have been hampered by time-consuming all-versus-all alignments and the existence of proteins with complex domain architectures. Here, we present a substantial update of SonicParanoid, where a gradient boosting predictor halves the execution time and a language model doubles the recall. Application to empirical large-scale and standardized benchmark datasets shows that SonicParanoid2 is much faster than comparable methods and also the most accurate. SonicParanoid2 is available at https://gitlab.com/salvo981/sonicparanoid2 and https://zenodo.org/doi/10.5281/zenodo.11371108 .
Collapse
Affiliation(s)
- Salvatore Cosentino
- Department of Integrated Biosciences, Graduate School of Frontier Sciences, the University of Tokyo, Kashiwa, Japan
| | - Sira Sriswasdi
- Center of Excellence in Computational Molecular Biology, Faculty of Medicine, Chulalongkorn University, Bangkok, Thailand
| | - Wataru Iwasaki
- Department of Integrated Biosciences, Graduate School of Frontier Sciences, the University of Tokyo, Kashiwa, Japan.
- Department of Biological Sciences, Graduate School of Science, the University of Tokyo, Bunkyo-ku, Japan.
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, the University of Tokyo, Kashiwa, Japan.
- Atmosphere and Ocean Research Institute, the University of Tokyo, Kashiwa, Japan.
- Institute for Quantitative Biosciences, the University of Tokyo, Bunkyo-ku, Japan.
- Collaborative Research Institute for Innovative Microbiology, the University of Tokyo, Bunkyo-ku, Japan.
| |
Collapse
|
47
|
Jia Q, Xia Y, Dong F, Li W. MetaFluAD: meta-learning for predicting antigenic distances among influenza viruses. Brief Bioinform 2024; 25:bbae395. [PMID: 39129362 PMCID: PMC11317534 DOI: 10.1093/bib/bbae395] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2024] [Revised: 06/24/2024] [Accepted: 07/27/2024] [Indexed: 08/13/2024] Open
Abstract
Influenza viruses rapidly evolve to evade previously acquired human immunity. Maintaining vaccine efficacy necessitates continuous monitoring of antigenic differences among strains. Traditional serological methods for assessing these differences are labor-intensive and time-consuming, highlighting the need for efficient computational approaches. This paper proposes MetaFluAD, a meta-learning-based method designed to predict quantitative antigenic distances among strains. This method models antigenic relationships between strains, represented by their hemagglutinin (HA) sequences, as a weighted attributed network. Employing a graph neural network (GNN)-based encoder combined with a robust meta-learning framework, MetaFluAD learns comprehensive strain representations within a unified space encompassing both antigenic and genetic features. Furthermore, the meta-learning framework enables knowledge transfer across different influenza subtypes, allowing MetaFluAD to achieve remarkable performance with limited data. MetaFluAD demonstrates excellent performance and overall robustness across various influenza subtypes, including A/H3N2, A/H1N1, A/H5N1, B/Victoria, and B/Yamagata. MetaFluAD synthesizes the strengths of GNN-based encoding and meta-learning to offer a promising approach for accurate antigenic distance prediction. Additionally, MetaFluAD can effectively identify dominant antigenic clusters within seasonal influenza viruses, aiding in the development of effective vaccines and efficient monitoring of viral evolution.
Collapse
Affiliation(s)
- Qitao Jia
- School of Information Science and Engineering, Yunnan University, Kunming 650500, China
| | - Yuanling Xia
- State Key Laboratory for Conservation and Utilization of Bio-Resources in Yunnan, Yunnan University, Kunming 650500, China
| | - Fanglin Dong
- School of Information Science and Engineering, Yunnan University, Kunming 650500, China
| | - Weihua Li
- School of Information Science and Engineering, Yunnan University, Kunming 650500, China
| |
Collapse
|
48
|
Ranjan A, Bess A, Alvin C, Mukhopadhyay S. MDF-DTA: A Multi-Dimensional Fusion Approach for Drug-Target Binding Affinity Prediction. J Chem Inf Model 2024; 64:4980-4990. [PMID: 38888163 PMCID: PMC11234358 DOI: 10.1021/acs.jcim.4c00310] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2024] [Revised: 05/15/2024] [Accepted: 05/29/2024] [Indexed: 06/20/2024]
Abstract
Drug-target affinity (DTA) prediction is an important task in the early stages of drug discovery. Traditional biological approaches are time-consuming, effort-consuming, and resource-consuming due to the large size of genomic and chemical spaces. Computational approaches using machine learning have emerged to narrow down the drug candidate search space. However, most of these prediction models focus on single feature encoding of drugs and targets, ignoring the importance of integrating different dimensions of these features. We propose a deep learning-based approach called Multi-Dimensional Fusion for Drug Target Affinity Prediction (MDF-DTA) incorporating different dimensional features. Our model fuses 1D, 2D, and 3D representations obtained from different pretrained models for both drugs and targets. We evaluated MDF-DTA on two standard benchmark data sets: DAVIS and KIBA. Experimental results show that MDF-DTA outperforms many state-of-the-art techniques in the DTA task across both data sets. Through ablation studies and performance evaluation metrics, we evaluate the importance of individual representations and the impact of each representation on MDF-DTA.
Collapse
Affiliation(s)
- Amit Ranjan
- Department
of Environmental Sciences, Louisiana State
University, Baton
Rouge, Louisiana 70803, United States
| | - Adam Bess
- Department
of Environmental Sciences, Louisiana State
University, Baton
Rouge, Louisiana 70803, United States
| | - Chris Alvin
- Department
of Computer Science, Furman University, Greenville, South Carolina 29613, United States
| | - Supratik Mukhopadhyay
- Department
of Environmental Sciences, Louisiana State
University, Baton
Rouge, Louisiana 70803, United States
| |
Collapse
|
49
|
Banerjee P, Eulenstein O, Friedberg I. Discovering genomic islands in unannotated bacterial genomes using sequence embedding. BIOINFORMATICS ADVANCES 2024; 4:vbae089. [PMID: 38911822 PMCID: PMC11193100 DOI: 10.1093/bioadv/vbae089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Revised: 05/26/2024] [Accepted: 06/11/2024] [Indexed: 06/25/2024]
Abstract
Motivation Genomic islands (GEIs) are clusters of genes in bacterial genomes that are typically acquired by horizontal gene transfer. GEIs play a crucial role in the evolution of bacteria by rapidly introducing genetic diversity and thus helping them adapt to changing environments. Specifically of interest to human health, many GEIs contain pathogenicity and antimicrobial resistance genes. Detecting GEIs is, therefore, an important problem in biomedical and environmental research. There have been many previous studies for computationally identifying GEIs. Still, most of these studies rely on detecting anomalies in the unannotated nucleotide sequences or on a fixed set of known features on annotated nucleotide sequences. Results Here, we present TreasureIsland, which uses a new unsupervised representation of DNA sequences to predict GEIs. We developed a high-precision boundary detection method featuring an incremental fine-tuning of GEI borders, and we evaluated the accuracy of this framework using a new comprehensive reference dataset, Benbow. We show that TreasureIsland's accuracy rivals other GEI predictors, enabling efficient and faster identification of GEIs in unannotated bacterial genomes. Availability and implementation TreasureIsland is available under an MIT license at: https://github.com/FriedbergLab/GenomicIslandPrediction.
Collapse
Affiliation(s)
- Priyanka Banerjee
- Department of Computer Science, Iowa State University, Ames, IA 50011, United States
| | - Oliver Eulenstein
- Department of Computer Science, Iowa State University, Ames, IA 50011, United States
| | - Iddo Friedberg
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA 50011, United States
| |
Collapse
|
50
|
Zhang B, Hou Z, Yang Y, Wong KC, Zhu H, Li X. SOFB is a comprehensive ensemble deep learning approach for elucidating and characterizing protein-nucleic-acid-binding residues. Commun Biol 2024; 7:679. [PMID: 38830995 PMCID: PMC11148103 DOI: 10.1038/s42003-024-06332-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2023] [Accepted: 05/15/2024] [Indexed: 06/05/2024] Open
Abstract
Proteins and nucleic-acids are essential components of living organisms that interact in critical cellular processes. Accurate prediction of nucleic acid-binding residues in proteins can contribute to a better understanding of protein function. However, the discrepancy between protein sequence information and obtained structural and functional data renders most current computational models ineffective. Therefore, it is vital to design computational models based on protein sequence information to identify nucleic acid binding sites in proteins. Here, we implement an ensemble deep learning model-based nucleic-acid-binding residues on proteins identification method, called SOFB, which characterizes protein sequences by learning the semantics of biological dynamics contexts, and then develop an ensemble deep learning-based sequence network to learn feature representation and classification by explicitly modeling dynamic semantic information. Among them, the language learning model, which is constructed from natural language to biological language, captures the underlying relationships of protein sequences, and the ensemble deep learning-based sequence network consisting of different convolutional layers together with Bi-LSTM refines various features for optimal performance. Meanwhile, to address the imbalanced issue, we adopt ensemble learning to train multiple models and then incorporate them. Our experimental results on several DNA/RNA nucleic-acid-binding residue datasets demonstrate that our proposed model outperforms other state-of-the-art methods. In addition, we conduct an interpretability analysis of the identified nucleic acid binding residue sequences based on the attention weights of the language learning model, revealing novel insights into the dynamic semantic information that supports the identified nucleic acid binding residues. SOFB is available at https://github.com/Encryptional/SOFB and https://figshare.com/articles/online_resource/SOFB_figshare_rar/25499452 .
Collapse
Affiliation(s)
- Bin Zhang
- School of Artificial Intelligence, Jilin University, Changchun, China
| | - Zilong Hou
- School of Artificial Intelligence, Jilin University, Changchun, China
| | - Yuning Yang
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Canada
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong, Hong Kong SAR
| | - Haoran Zhu
- School of Artificial Intelligence, Jilin University, Changchun, China.
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Changchun, China.
| |
Collapse
|