1
|
Liu J, Li K, Tang X, Zhang Y, Guan X. Grain protein function prediction based on improved FCN and bidirectional LSTM. Food Chem 2025; 482:143955. [PMID: 40209386 DOI: 10.1016/j.foodchem.2025.143955] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Revised: 03/10/2025] [Accepted: 03/17/2025] [Indexed: 04/12/2025]
Abstract
With the development of high-throughput sequencing technologies, predicting grain protein function from amino acid sequences based on intelligent model has become one of the significant tasks in bioinformatics. The soybean, maize, indica, and japonica are selected as grain dataset from the UniProtKB. Aiming at the problem of neglecting the sequence order of amino acids and the long-term dependence between amino acids, the PBiLSTM-FCN model is proposed for predicting grain protein function in this paper. The sequence of amino acid sequences is considered in the Fully Convolutional Networks (FCN), and the long-term dependence between amino acids is addressed by the bidirectional Long Short-Term Memory network (BiLSTM). The experimental results show that the PBiLSTM-FCN model is superior to existing models, and can predict more accurately by solving the problem of capturing long-range dependencies and the order of amino acid sequences. Finally, the interpretability analyses are performed by the actual protein function compared with the predicted protein function which proves the effectiveness of the PBiLSTM-FCN model in predicting grain protein function.
Collapse
Affiliation(s)
- Jing Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Kun Li
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Xinghua Tang
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Yu Zhang
- School of Health Science and Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China; National Grain Industry (Urban Grain and Oil Security) Technology Innovation Center, Shanghai 200093, China
| | - Xiao Guan
- School of Health Science and Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China; National Grain Industry (Urban Grain and Oil Security) Technology Innovation Center, Shanghai 200093, China.
| |
Collapse
|
2
|
Le VT, Yuune JPT, Vu TTP, Malik MS, Ou YY. DeepCR: predicting cytokine receptor proteins through pretrained language models and deep learning networks. J Biomol Struct Dyn 2025:1-18. [PMID: 40448687 DOI: 10.1080/07391102.2025.2512448] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2025] [Accepted: 05/21/2025] [Indexed: 06/02/2025]
Abstract
Cytokine receptors play a pivotal role in mediating the immune response and are critical in cytokine storms, which underlie the pathogenesis of conditions such as acute respiratory distress syndrome (ARDS) and autoimmune disorders. Identifying cytokine receptors is essential for understanding their biological functions, exploring therapeutic targets, and guiding clinical interventions. Traditional biochemical methods to identify cytokine receptors are labor-intensive, costly, and time-consuming, prompting the need for more efficient alternatives. Recent advances in computational biology have enabled the use of machine learning to classify cytokine receptor proteins. Most existing approaches focused on homologous features and protein composition to classify cytokine families, but no dedicated studies have been conducted on cytokine receptor proteins. This gap presents an opportunity to develop a method specifically for classifying cytokine receptors among other membrane proteins. In this study, we present a novel classification framework combining pre-trained language models (PLMs) with a multi-window convolutional neural network (mCNN) architecture for the fast and accurate identification of cytokine receptor proteins. PLMs, such as ProtTrans and ESM variants, capture biochemical context directly from raw protein sequences, while mCNN efficiently extracts local and global sequence patterns using convolutional layers with varying window sizes. Our model achieved an AUC of 0.96 in the training as well as 0.97 and 0.93 in two independent tests, demonstrating its effectiveness in distinguishing cytokine receptors from non-cytokine receptor proteins. By eliminating the need for manual feature extraction, this approach offers a robust and scalable solution for protein classification, paving the way for its application in drug discovery and understanding cytokine-mediated diseases.
Collapse
Affiliation(s)
- Van The Le
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan
| | | | - Thi Thu Phuong Vu
- Graduate Program in Biomedical Informatics, Yuan Ze University, Chung-Li, Taiwan
| | - Muhammad Shahid Malik
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan
- Department of Computer Sciences, Karakoram International University, Gilgit-Baltistan, Pakistan
| | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, Taiwan
- Graduate Program in Biomedical Informatics, Yuan Ze University, Chung-Li, Taiwan
| |
Collapse
|
3
|
Kong D, Qian J, Gao C, Wang Y, Shi T, Ye C. Machine Learning Empowering Microbial Cell Factory: A Comprehensive Review. Appl Biochem Biotechnol 2025:10.1007/s12010-025-05260-x. [PMID: 40397295 DOI: 10.1007/s12010-025-05260-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/02/2025] [Indexed: 05/22/2025]
Abstract
The wide application of machine learning has provided more possibilities for biological manufacturing, and the combination of machine learning and synthetic biology technology has ignited even more brilliant sparks, which has created an unpredictable value for the upgrading of microbial cell factories. The review delves into the synergies between machine learning and synthetic biology to create research worth investigating in biotechnology. We explore relevant databases, toolboxes, and machine learning-derived models. Furthermore, we examine specific applications of this combined approach in chemical production, human health, and environmental remediation. By elucidating these successful integrations, this review aims to provide valuable guidance for future research at the intersection of biomanufacturing and artificial intelligence.
Collapse
Affiliation(s)
- Dechun Kong
- School of Food Science and Pharmaceutical Engineering, Nanjing Normal University, Nanjing, 210023, People's Republic of China
| | - Jinyi Qian
- Ministry of Education Key Laboratory of NSLSCS, Nanjing Normal University, Nanjing, 210023, People's Republic of China
| | - Cong Gao
- School of Biotechnology and Key Laboratory of Industrial Biotechnology of Ministry of Education, Jiangnan University, Wuxi, 214122, People's Republic of China
| | - Yuetong Wang
- School of Food Science and Pharmaceutical Engineering, Nanjing Normal University, Nanjing, 210023, People's Republic of China.
| | - Tianqiong Shi
- School of Food Science and Pharmaceutical Engineering, Nanjing Normal University, Nanjing, 210023, People's Republic of China.
- State Key Laboratory of Microbial Technology, Nanjing Normal University, Nanjing, 210023, People's Republic of China.
| | - Chao Ye
- School of Food Science and Pharmaceutical Engineering, Nanjing Normal University, Nanjing, 210023, People's Republic of China.
- Ministry of Education Key Laboratory of NSLSCS, Nanjing Normal University, Nanjing, 210023, People's Republic of China.
| |
Collapse
|
4
|
Cui XC, Zheng Y, Liu Y, Yuchi Z, Yuan YJ. AI-driven de novo enzyme design: Strategies, applications, and future prospects. Biotechnol Adv 2025; 82:108603. [PMID: 40368118 DOI: 10.1016/j.biotechadv.2025.108603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2025] [Revised: 04/22/2025] [Accepted: 05/10/2025] [Indexed: 05/16/2025]
Abstract
Enzymes are indispensable for biological processes and diverse applications across industries. While top-down modification strategies, such as directed evolution, have achieved remarkable success in optimizing existing enzymes, bottom-up de novo enzyme design has emerged as a transformative approach for engineering novel enzymes with customized catalytic functions, independent of natural templates. Recent advancements in artificial intelligence (AI) and computational power have significantly accelerated this field, enabling breakthroughs in enzyme engineering. These technologies facilitate the rapid generation of enzyme structures and amino acid sequences optimized for specific functions, thereby enhancing design efficiency. They also support functional validation and activity optimization, improving the catalytic performance, stability, and robustness of de novo designed enzymes. This review highlights recent advancements in AI-driven de novo enzyme design, discusses strategies for validation and optimization, and examines the challenges and future prospects of integrating these technologies into enzyme development.
Collapse
Affiliation(s)
- Xi-Chen Cui
- State Key Laboratory of Synthetic Biology, Tianjin University, Tianjin 30072, PR China; Frontiers Science Center for Synthetic Biology(Ministry of Education), School of Synthetic Biology and Biomanufacturing, Tianjin University, Tianjin 300072, PR China
| | - Yan Zheng
- State Key Laboratory of Synthetic Biology, Tianjin University, Tianjin 30072, PR China; Frontiers Science Center for Synthetic Biology(Ministry of Education), School of Synthetic Biology and Biomanufacturing, Tianjin University, Tianjin 300072, PR China
| | - Ye Liu
- State Key Laboratory of Synthetic Biology, Tianjin University, Tianjin 30072, PR China; Frontiers Science Center for Synthetic Biology(Ministry of Education), School of Synthetic Biology and Biomanufacturing, Tianjin University, Tianjin 300072, PR China; School of Pharmaceutical Science and Technology, Tianjin University, Tianjin 300072, PR China
| | - Zhiguang Yuchi
- State Key Laboratory of Synthetic Biology, Tianjin University, Tianjin 30072, PR China; Frontiers Science Center for Synthetic Biology(Ministry of Education), School of Synthetic Biology and Biomanufacturing, Tianjin University, Tianjin 300072, PR China; School of Pharmaceutical Science and Technology, Tianjin University, Tianjin 300072, PR China.
| | - Ying-Jin Yuan
- State Key Laboratory of Synthetic Biology, Tianjin University, Tianjin 30072, PR China; Frontiers Science Center for Synthetic Biology(Ministry of Education), School of Synthetic Biology and Biomanufacturing, Tianjin University, Tianjin 300072, PR China.
| |
Collapse
|
5
|
Shao J, Chen J, Liu B. ProFun-SOM: Protein Function Prediction for Specific Ontology Based on Multiple Sequence Alignment Reconstruction. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:8060-8071. [PMID: 38980781 DOI: 10.1109/tnnls.2024.3419250] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/11/2024]
Abstract
Protein function prediction is crucial for understanding species evolution, including viral mutations. Gene ontology (GO) is a standardized representation framework for describing protein functions with annotated terms. Each ontology is a specific functional category containing multiple child ontologies, and the relationships of parent and child ontologies create a directed acyclic graph. Protein functions are categorized using GO, which divides them into three main groups: cellular component ontology, molecular function ontology, and biological process ontology. Therefore, the GO annotation of protein is a hierarchical multilabel classification problem. This hierarchical relationship introduces complexities such as mixed ontology problem, leading to performance bottlenecks in existing computational methods due to label dependency and data sparsity. To overcome bottleneck issues brought by mixed ontology problem, we propose ProFun-SOM, an innovative multilabel classifier that utilizes multiple sequence alignments (MSAs) to accurately annotate gene ontologies. ProFun-SOM enhances the initial MSAs through a reconstruction process and integrates them into a deep learning architecture. It then predicts annotations within the cellular component, molecular function, biological process, and mixed ontologies. Our evaluation results on three datasets (CAFA3, SwissProt, and NetGO2) demonstrate that ProFun-SOM surpasses state-of-the-art methods. This study confirmed that utilizing MSAs of proteins can effectively overcome the two main bottlenecks issues, label dependency and data sparsity, thereby alleviating the root problem, mixed ontology. A freely accessible web server is available at http://bliulab.net/ ProFun-SOM/.
Collapse
|
6
|
de Oliveira GB, Pedrini H, Dias Z. SUPERMAGO: Protein Function Prediction Based on Transformer Embeddings. Proteins 2025; 93:981-996. [PMID: 39711079 DOI: 10.1002/prot.26782] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2024] [Revised: 11/28/2024] [Accepted: 12/09/2024] [Indexed: 12/24/2024]
Abstract
Recent technological advancements have enabled the experimental determination of amino acid sequences for numerous proteins. However, analyzing protein functions, which is essential for understanding their roles within cells, remains a challenging task due to the associated costs and time constraints. To address this challenge, various computational approaches have been proposed to aid in the categorization of protein functions, mainly utilizing amino acid sequences. In this study, we introduce SUPERMAGO, a method that leverages amino acid sequences to predict protein functions. Our approach employs Transformer architectures, pre-trained on protein data, to extract features from the sequences. We use multilayer perceptrons for classification and a stacking neural network to aggregate the predictions, which significantly enhances the performance of our method. We also present SUPERMAGO+, an ensemble of SUPERMAGO and DIAMOND, based on neural networks that assign different weights to each term, offering a novel weighting mechanism compared with existing methods in the literature. Additionally, we introduce SUPERMAGO+Web, a web server-compatible version of SUPERMAGO+ designed to operate with reduced computational resources. Both SUPERMAGO and SUPERMAGO+ consistently outperformed state-of-the-art approaches in our evaluations, establishing them as leading methods for this task when considering only amino acid sequence information.
Collapse
Affiliation(s)
| | - Helio Pedrini
- Institute of Computing, University of Campinas, Campinas, Brazil
| | - Zanoni Dias
- Institute of Computing, University of Campinas, Campinas, Brazil
| |
Collapse
|
7
|
Zancolli G, Modica MV, Puillandre N, Kantor Y, Barua A, Campli G, Robinson-Rechavi M. Redistribution of Ancestral Functions Underlies the Evolution of Venom Production in Marine Predatory Snails. Mol Biol Evol 2025; 42:msaf095. [PMID: 40279537 PMCID: PMC12075767 DOI: 10.1093/molbev/msaf095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2024] [Revised: 03/21/2025] [Accepted: 04/17/2025] [Indexed: 04/27/2025] Open
Abstract
Venom-secreting glands are highly specialized organs evolved throughout the animal kingdom to synthetize and secrete toxins for predation and defense. Venom is extensively studied for its toxin components and application potential; yet, how animals become venomous remains poorly understood. Venom systems therefore offer a unique opportunity to understand the molecular mechanisms underlying functional innovation. Here, we conducted a multispecies multi-tissue comparative transcriptomics analysis of 12 marine predatory gastropod species, including species with venom glands and species with homologous non-venom-producing glands, to examine how specialized functions evolve through gene expression changes. We found that while the venom gland specialized for the mass production of toxins, its homologous glands retained the ancestral digestive functions. The functional divergence and specialization of the venom gland were achieved through a redistribution of its ancestral digestive functions to other organs, specifically the esophagus. This entailed concerted expression changes and accelerated transcriptome evolution across the entire digestive system. The increase in venom gland secretory capacity was achieved through the modulation of an ancient secretory machinery, particularly genes involved in endoplasmic reticulum stress and unfolded protein response. This study shifts the focus from the well-explored evolution of toxins to the lesser-known evolution of the organ and mechanisms responsible for venom production. As such, it contributes to elucidating the molecular mechanisms underlying organ evolution at a fine evolutionary scale, highlighting the specific events that lead to functional divergence.
Collapse
Affiliation(s)
- Giulia Zancolli
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland
- Evolutionary Bioinformatics, Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Maria Vittoria Modica
- Department of Biology and Evolution of Marine Organisms, Stazione Zoologica Anton Dohrn, 00198 Roma, Italy
| | - Nicolas Puillandre
- Institut Systématique Evolution Biodiversité (ISYEB), Muséum National d’Histoire Naturelle, CNRS, Sorbonne Université, EPHE, Université des Antilles, 75005 Paris, France
| | - Yuri Kantor
- Severtsov Institute of Ecology and Evolution, Russian Academy of Sciences, 119034 Moscow, Russian Federation
| | - Agneesh Barua
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland
- Evolutionary Bioinformatics, Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Giulia Campli
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland
- Evolutionary Bioinformatics, Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Marc Robinson-Rechavi
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland
- Evolutionary Bioinformatics, Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
8
|
Yang H, He G, Zhang M, Fu H, He G, Wang C, Liu Y, Zhang S, Wang T, He YO, Cheng L. OntoTiger: a platform of ontology-based application tools for integrative biomedical exploration. Nucleic Acids Res 2025:gkaf337. [PMID: 40297993 DOI: 10.1093/nar/gkaf337] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2025] [Revised: 03/27/2025] [Accepted: 04/16/2025] [Indexed: 04/30/2025] Open
Abstract
Biomedical ontologies, such as Gene Ontology (GO), Disease Ontology (DO), and the Human Phenotype Ontology (HPO), have been extensively applied to characterize molecular roles and their semantic relationships in biomedical research and clinical practice. Although numerous algorithms have been developed to quantify relationships between ontology terms or to explore molecular functions, the absence of a comprehensive tool to integrate these algorithms has limited effective ontology applications. To address this, we developed OntoTiger, a platform of Ontology-based application Tools for InteGrativE biomedical exploRation. OntoTiger combines >20 classic algorithms, supporting six prevalent molecular types as well as five widespread biomedical ontologies. The platform comprises four modules: (i) Annotation module, which qualifies the relationships between ontology terms and molecules; (ii) Similarity module, quantifying functional similarity between/across pairwise ontology terms or between molecules; (iii) Prediction module, characterizing the molecular roles from an ontological perspective; and (iv) Enrichment module, elucidating the potential biological significance of a particular list of molecules. OntoTiger provides a freely accessible, user-friendly web server dedicated to enabling one-stop ontology-based applications and is freely available at https://bio-computing.hrbmu.edu.cn/OntoTiger.
Collapse
Affiliation(s)
- Haixiu Yang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, Heilongjiang, China
| | - Guoyou He
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, Heilongjiang, China
| | - Meiyi Zhang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, Heilongjiang, China
| | - Hongyu Fu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, Heilongjiang, China
| | - Guanzhi He
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, Heilongjiang, China
| | - Chao Wang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, Heilongjiang, China
| | - Yangyang Liu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, Heilongjiang, China
| | - Sainan Zhang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, Heilongjiang, China
| | - Tao Wang
- School of Computer Science, Northwestern Polytechnical University, 1 Dongxiang Rd, Xi'an 710072, China
| | - Yongqun Oliver He
- Unit for Laboratory Animal Medicine, University of Michigan Medical School, Ann Arbor, MI 48109, United States
| | - Liang Cheng
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin 150081, Heilongjiang, China
- National Health Commission (NHC) Key Laboratory of Molecular Probes and Targeted Diagnosis and Therapy, Harbin Medical University, Harbin 150028, Heilongjiang, China
| |
Collapse
|
9
|
Zhang H, Sun Y, Wang Y, Luo X, Liu Y, Chen B, Jin X, Zhu D. GTPLM-GO: Enhancing Protein Function Prediction Through Dual-Branch Graph Transformer and Protein Language Model Fusing Sequence and Local-Global PPI Information. Int J Mol Sci 2025; 26:4088. [PMID: 40362328 PMCID: PMC12072039 DOI: 10.3390/ijms26094088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2025] [Revised: 04/21/2025] [Accepted: 04/23/2025] [Indexed: 05/15/2025] Open
Abstract
Currently, protein-protein interaction (PPI) networks have become an essential data source for protein function prediction. However, methods utilizing graph neural networks (GNNs) face significant challenges in modeling PPI networks. A primary issue is over-smoothing, which occurs when multiple GNN layers are stacked to capture global information. This architectural limitation inherently impairs the integration of local and global information within PPI networks, thereby limiting the accuracy of protein function prediction. To effectively utilize information within PPI networks, we propose GTPLM-GO, a protein function prediction method based on a dual-branch Graph Transformer and protein language model. The dual-branch Graph Transformer achieves the collaborative modeling of local and global information in PPI networks through two branches: a graph neural network and a linear attention-based Transformer encoder. GTPLM-GO integrates local-global PPI information with the functional semantic encoding constructed by the protein language model, overcoming the issue of inadequate information extraction in existing methods. Experimental results demonstrate that GTPLM-GO outperforms advanced network-based and sequence-based methods on PPI network datasets of varying scales.
Collapse
Affiliation(s)
- Haotian Zhang
- School of Computer Science and Technology, Harbin Institute of Technology, Weihai 264209, China; (H.Z.); (Y.S.); (Y.W.); (B.C.)
| | - Yundong Sun
- School of Computer Science and Technology, Harbin Institute of Technology, Weihai 264209, China; (H.Z.); (Y.S.); (Y.W.); (B.C.)
- Department of Electronic Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Yansong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Weihai 264209, China; (H.Z.); (Y.S.); (Y.W.); (B.C.)
| | - Xiaoling Luo
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China;
| | - Yumeng Liu
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen 518118, China;
| | - Bin Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Weihai 264209, China; (H.Z.); (Y.S.); (Y.W.); (B.C.)
| | - Xiaopeng Jin
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen 518118, China;
| | - Dongjie Zhu
- School of Computer Science and Technology, Harbin Institute of Technology, Weihai 264209, China; (H.Z.); (Y.S.); (Y.W.); (B.C.)
| |
Collapse
|
10
|
Zhao K, Ji Z, Zhang L, Quan N, Li Y, Yu G, Bi X. HPOseq: a deep ensemble model for predicting the protein-phenotype relationships based on protein sequences. BMC Bioinformatics 2025; 26:110. [PMID: 40263997 PMCID: PMC12013097 DOI: 10.1186/s12859-025-06122-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2024] [Accepted: 03/27/2025] [Indexed: 04/24/2025] Open
Abstract
BACKGROUND Understanding the relationships between proteins and specific disease phenotypes contributes to the early detection of diseases and advances the development of personalized medicine. The acquisition of a large amount of proteomics data has facilitated this process. To improve discovery efficiency and reduce the time and financial costs associated with biological experiments, various computational methods have yielded promising results. However, the lack of rich and reliable protein-related information still presents challenges in this process. RESULTS In this paper, we propose an ensemble prediction model, named HPOseq, which predicts human protein-phenotype relationships based only on sequence information. HPOseq establishes two base models to achieve objectives. One directly extracts internal information from amino acid sequences as protein features to predict the associated phenotypes. The other builds a protein-protein network based on sequence similarity, extracting information between proteins for phenotype prediction. Ultimately, an ensemble module is employed to integrate the predictions from both base models, resulting in the final prediction. CONCLUSION The results of 5-fold cross-validation reveal that HPOseq outperforms seven baseline methods for predicting protein-phenotype relationships. Moreover, we conduct case studies from the points of phenotype annotation and protein analysis to verify the practical significance of HPOseq.
Collapse
Affiliation(s)
- Kai Zhao
- School of Computer Science and Technology, Xinjiang University, Urumqi, 830011, China
| | - Zhuocheng Ji
- School of Computer Science and Technology, Xinjiang University, Urumqi, 830011, China
| | - Linlin Zhang
- School of Software, Xinjiang University, Urumqi, 830011, China
| | - Na Quan
- School of Computer Science and Technology, Xinjiang University, Urumqi, 830011, China
| | - Yuheng Li
- School of Computer Science and Technology, Xinjiang University, Urumqi, 830011, China
| | - Guanglei Yu
- College of Medical Engineering and Technology, Xinjiang Medical University, Urumqi, 830011, China
- School Of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Xuehua Bi
- College of Medical Engineering and Technology, Xinjiang Medical University, Urumqi, 830011, China.
- School Of Computer Science and Engineering, Central South University, Changsha, 410083, China.
| |
Collapse
|
11
|
Wang J, Chen J, Hu Y, Song C, Li X, Qian Y, Deng L. DeepMFFGO: A Protein Function Prediction Method for Large-Scale Multifeature Fusion. J Chem Inf Model 2025; 65:3841-3853. [PMID: 40116538 DOI: 10.1021/acs.jcim.5c00062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/23/2025]
Abstract
Protein functional studies are crucial in the fields of drug target discovery and drug design. However, the existing methods have significant bottlenecks in utilizing multisource data fusion and Gene Ontology (GO) hierarchy. To this end, this study innovatively proposes the DeepMFFGO model designed for protein function prediction under large-scale multifeature fusion. A fine-tuning strategy using intermediate-level feature selection is proposed to reduce redundancy in protein sequences and mitigate distortion of the top-level features. A hierarchical progressive fusion structure is designed to explore feature connections, optimize complementarity through dynamic weight allocation, and reduce redundant interference. On the CAFA3 data set, the Fmax values of the DeepMFFGO model on the MF, BP, and CC ontologies reach 0.702, 0.599, and 0.704, respectively, which are improved by 4.2%, 2.4%, and 0.07%, respectively, compared with state-of-the-art multisource methods.
Collapse
Affiliation(s)
- Jingfu Wang
- School of Software, Xinjiang University, Urumqi 830091, China
- Xinjiang Engineering Research Center of Big Data and Intelligent Software, School of Software, Xinjiang University, Urumqi 830091, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi 830091, China
| | - Jiaying Chen
- School of Software, Xinjiang University, Urumqi 830091, China
- Xinjiang Engineering Research Center of Big Data and Intelligent Software, School of Software, Xinjiang University, Urumqi 830091, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi 830091, China
| | - Yue Hu
- School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China
- Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, Xinjiang University, Urumqi, Xinjiang 830046, China
| | - Chaolin Song
- School of Software, Xinjiang University, Urumqi 830091, China
- Xinjiang Engineering Research Center of Big Data and Intelligent Software, School of Software, Xinjiang University, Urumqi 830091, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi 830091, China
| | - Xinhui Li
- School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China
- Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, Xinjiang University, Urumqi, Xinjiang 830046, China
| | - Yurong Qian
- Xinjiang Engineering Research Center of Big Data and Intelligent Software, School of Software, Xinjiang University, Urumqi 830091, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi 830091, China
- School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China
- Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, Xinjiang University, Urumqi, Xinjiang 830046, China
| | - Lei Deng
- School of Software, Xinjiang University, Urumqi 830091, China
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| |
Collapse
|
12
|
Gao Y, Zhang T, Wang Y, Lv H, Yan X, Fu L, Liu Y. Identification of hub genes associated with decreased fertility in male mice of advanced paternal age. Front Cell Dev Biol 2025; 13:1520387. [PMID: 40256767 PMCID: PMC12006134 DOI: 10.3389/fcell.2025.1520387] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2024] [Accepted: 03/25/2025] [Indexed: 04/22/2025] Open
Abstract
Introduction Aging and delayed parenthood are major social concerns. Men older than 35 years, which is an advanced paternal age, experience reduced sperm quality and fertility. Methods In this study, 12-month-old mice served as a model for males of advanced paternal age. RNA sequencing (RNA-seq) of epididymides from 2- and 12-month-old mice was performed. Results Spermatogonia and sperm counts were significantly lower in these mice. We identified 449 differentially expressed genes by RNA-seq. Altered pathways were enriched using Gene Ontology and Kyoto Encyclopedia of Genes and Genomes (KEGG) analyses. Moreover, nine hub genes were identified from the DEGs, along with DEGs associated with mitochondria. Discussion These results could enhance understanding of the molecular mechanisms underlying decreased male fertility in men of advanced paternal age and may aid in developing targeted treatment for male infertility related to aging.
Collapse
Affiliation(s)
- Yang Gao
- Institute of Pediatric Research, Children’s Hospital of Soochow University, Suzhou, Jiangsu, China
- Department of Pediatrics, The First People’s Hospital of Lianyungang, Xuzhou Medical University Affiliated Hospital of Lianyungang (Lianyungang Clinical College of Nanjing Medical University), Lianyungang, China
| | - Ting Zhang
- Department of Urology, Children’s Hospital of Soochow University, Suzhou, Jiangsu, China
| | - Yan Wang
- Institute of Pediatric Research, Children’s Hospital of Soochow University, Suzhou, Jiangsu, China
| | - Haitao Lv
- Institute of Pediatric Research, Children’s Hospital of Soochow University, Suzhou, Jiangsu, China
- Department of Cardiology, Children’s Hospital of Soochow University, Suzhou, Jiangsu, China
| | - Xiangming Yan
- Department of Urology, Children’s Hospital of Soochow University, Suzhou, Jiangsu, China
| | - Longlong Fu
- Reproductive Health Research Centre/Human Sperm Bank,NHC Key Laboratory of Frontiers and Technologies in Reproductive Health, National Research Institute for Family Planning, Beijing, China
| | - Ying Liu
- Institute of Pediatric Research, Children’s Hospital of Soochow University, Suzhou, Jiangsu, China
| |
Collapse
|
13
|
Kim HR, Ji H, Kim GB, Lee SY. Enzyme functional classification using artificial intelligence. Trends Biotechnol 2025:S0167-7799(25)00088-5. [PMID: 40155269 DOI: 10.1016/j.tibtech.2025.03.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2025] [Revised: 02/27/2025] [Accepted: 03/06/2025] [Indexed: 04/01/2025]
Abstract
Enzymes are essential for cellular metabolism, and elucidating their functions is critical for advancing biochemical research. However, experimental methods are often time consuming and resource intensive. To address this, significant efforts have been directed toward applying artificial intelligence (AI) to enzyme function prediction, enabling high-throughput and scalable approaches. In this review, we discuss advances in AI-driven enzyme functional annotation, transitioning from traditional machine learning (ML) methods to state-of-the-art deep learning approaches. We highlight how deep learning enables models to automatically extract features from raw data without manual intervention, leading to enhanced performance. Finally, we discuss the discovery of novel enzyme functions and generation of de novo enzymes through the integration of generative AIs and bio big data as future research directions.
Collapse
Affiliation(s)
- Ha Rim Kim
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 four), KAIST Institute for BioCentury, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea; Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea
| | - Hongkeun Ji
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 four), KAIST Institute for BioCentury, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea; Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea
| | - Gi Bae Kim
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 four), KAIST Institute for BioCentury, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea; Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea; BioProcess Engineering Research Center, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea
| | - Sang Yup Lee
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 four), KAIST Institute for BioCentury, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea; Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea; Graduate School of Engineering Biology, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea; BioProcess Engineering Research Center, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea; Center for Synthetic Biology, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea.
| |
Collapse
|
14
|
Mao Y, Xu W, Shun Y, Chai L, Xue L, Yang Y, Li M. A multimodal model for protein function prediction. Sci Rep 2025; 15:10465. [PMID: 40140535 PMCID: PMC11947276 DOI: 10.1038/s41598-025-94612-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2025] [Accepted: 03/14/2025] [Indexed: 03/28/2025] Open
Abstract
Protein function, which is determined by sequence, structure, and other characteristics, plays a crucial role in an organism's performance. Existing protein function prediction methods mainly rely on sequence data and often ignore structural properties that are crucial for accurate prediction. Protein structure provides richer spatial and functional insights, which can significantly improve prediction accuracy. In this work, we propose a multi-modal protein function prediction model (MMPFP) that integrates protein sequence and structure information through the use of GCN, CNN, and Transformer models. We validate the model using the PDBest dataset, demonstrating that MMPFP outperforms traditional single-modal models in the molecular function (MF), biological process (BP), and cellular component (CC) prediction tasks. Specifically, MMPFP achieved AUPR scores of 0.693, 0.355, and 0.478; [Formula: see text] scores of 0.752, 0.629, and 0.691; and [Formula: see text] scores of 0.336, 0.488, and 0.459, showing a 3-5% improvement over single-modal models. Additionally, ablation studies confirm the effectiveness of the Transformer module within the GCN branch, further validating MMPFP's superior performance over existing methods. This multi-modal approach offers a more accurate and comprehensive framework for protein function prediction, addressing key limitations of current models.
Collapse
Affiliation(s)
- Yu Mao
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China
| | - WenHui Xu
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China
| | - Yue Shun
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China
| | - LongXin Chai
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China
| | - Lei Xue
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China
| | - Yong Yang
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China.
| | - Mei Li
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China.
| |
Collapse
|
15
|
Song C, He S, Qian Y, Li X, Hu Y, Chen J, Wang J, Deng L. DeepMVD: A Novel Multiview Dynamic Feature Fusion Model for Accurate Protein Function Prediction. J Chem Inf Model 2025; 65:3077-3089. [PMID: 40053671 DOI: 10.1021/acs.jcim.4c02216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/09/2025]
Abstract
Proteins, as the fundamental macromolecules of life, play critical roles in various biological processes. Recent advancements in intelligent protein function prediction methods leverage sequences, structures, and biomedical literature data. Among them, function prediction methods for protein sequences remain an enduring and popular research direction. Existing studies have failed to effectively utilize the multilevel attribute features reflected in protein sequences. This limitation hinders the enrichment of protein descriptions needed for high-precision prediction of protein functions. To address this, we propose DeepMVD, a novel deep learning model that enhances prediction accuracy by dynamically fusing multiview features. DeepMVD employs specialized modules to extract unique features from each view and utilizes an adaptive fusion mechanism for optimal integration. Evaluation of the CAFA4 data set shows that DeepMVD significantly outperforms existing state-of-the-art models in terms of BP, MF, and CC terminology, all obtaining the highest Fmax (0.523, 0.712, 0.740). Ablation studies confirm the model's robustness. Source code and data sets are available at http://swanhub.co/scl/DeepMVD.
Collapse
Affiliation(s)
- Chaolin Song
- School of Software, Xinjiang University, Urumqi 830091, China
- Xinjiang Engineering Research Center of Big Data and Intelligent Software, School of Software, Xinjiang University, Urumqi 830091, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi 830091, China
| | - Shiwen He
- School of Software, Xinjiang University, Urumqi 830091, China
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Yurong Qian
- Xinjiang Engineering Research Center of Big Data and Intelligent Software, School of Software, Xinjiang University, Urumqi 830091, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi 830091, China
- School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China
- Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, Xinjiang University, Urumqi, Xinjiang 830046, China
| | - Xinhui Li
- School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China
- Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, Xinjiang University, Urumqi, Xinjiang 830046, China
| | - Yue Hu
- School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China
- Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, Xinjiang University, Urumqi, Xinjiang 830046, China
| | - Jiaying Chen
- School of Software, Xinjiang University, Urumqi 830091, China
- Xinjiang Engineering Research Center of Big Data and Intelligent Software, School of Software, Xinjiang University, Urumqi 830091, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi 830091, China
| | - Jingfu Wang
- School of Software, Xinjiang University, Urumqi 830091, China
- Xinjiang Engineering Research Center of Big Data and Intelligent Software, School of Software, Xinjiang University, Urumqi 830091, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi 830091, China
| | - Lei Deng
- School of Software, Xinjiang University, Urumqi 830091, China
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| |
Collapse
|
16
|
Khanduja A, Mohanty D. SProtFP: a machine learning-based method for functional classification of small ORFs in prokaryotes. NAR Genom Bioinform 2025; 7:lqae186. [PMID: 39781515 PMCID: PMC11704790 DOI: 10.1093/nargab/lqae186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2024] [Revised: 11/07/2024] [Accepted: 12/17/2024] [Indexed: 01/12/2025] Open
Abstract
Small proteins (≤100 amino acids) play important roles across all life forms, ranging from unicellular bacteria to higher organisms. In this study, we have developed SProtFP which is a machine learning-based method for functional annotation of prokaryotic small proteins into selected functional categories. SProtFP uses independent artificial neural networks (ANNs) trained using a combination of physicochemical descriptors for classifying small proteins into antitoxin type 2, bacteriocin, DNA-binding, metal-binding, ribosomal protein, RNA-binding, type 1 toxin and type 2 toxin proteins. We have also trained a model for identification of small open reading frame (smORF)-encoded antimicrobial peptides (AMPs). Comprehensive benchmarking of SProtFP revealed an average area under the receiver operator curve (ROC-AUC) of 0.92 during 10-fold cross-validation and an ROC-AUC of 0.94 and 0.93 on held-out balanced and imbalanced test sets. Utilizing our method to annotate bacterial isolates from the human gut microbiome, we could identify thousands of remote homologs of known small protein families and assign putative functions to uncharacterized proteins. This highlights the utility of SProtFP for large-scale functional annotation of microbiome datasets, especially in cases where sequence homology is low. SProtFP is freely available at http://www.nii.ac.in/sprotfp.html and can be combined with genome annotation tools such as ProsmORF-pred to uncover the functional repertoire of novel small proteins in bacteria.
Collapse
Affiliation(s)
- Akshay Khanduja
- National Institute of Immunology, Aruna Asaf Ali Marg, New Delhi 110067, India
| | - Debasisa Mohanty
- National Institute of Immunology, Aruna Asaf Ali Marg, New Delhi 110067, India
| |
Collapse
|
17
|
Luo J, Luo Y. Learning maximally spanning representations improves protein function annotation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.13.638156. [PMID: 40027840 PMCID: PMC11870436 DOI: 10.1101/2025.02.13.638156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Automated protein function annotation is a fundamental problem in computational biology, crucial for understanding the functional roles of proteins in biological processes, with broad implications in medicine and biotechnology. A persistent challenge in this problem is the imbalanced, long-tail distribution of available function annotations: a small set of well-studied function classes account for most annotated proteins, while many other classes have few annotated proteins, often due to investigative bias, experimental limitations, or intrinsic biases in protein evolution. As a result, existing machine learning models for protein function prediction tend to only optimize the prediction accuracy for well-studied function classes overrepresented in the training data, leading to poor accuracy for understudied functions. In this work, we develop MSRep, a novel deep learning-based protein function annotation framework designed to address this imbalance issue and improve annotation accuracy. MSRep is inspired by an intriguing phenomenon, called neural collapse (NC), commonly observed in high-accuracy deep neural networks used for classification tasks, where hidden representations in the final layer collapse to class-specific mean embeddings, while maintaining maximal inter-class separation. Given that NC consistently emerges across diverse architectures and tasks for high-accuracy models, we hypothesize that inducing NC structure in models trained on imbalanced data can enhance both prediction accuracy and generalizability. To achieve this, MSRep refines a pre-trained protein language model to produce NC-like representations by optimizing an NC-inspired loss function, which ensures that minority functions are equally represented in the embedding space as majority functions, in contrast to conventional classification methods whose embedding spaces are dominated by overrepresented classes. In evaluations across four protein function annotation tasks on the prediction of Enzyme Commission numbers, Gene3D codes, Pfam families, and Gene Ontology terms, MSRep demonstrates superior predictive performance for both well- and underrepresented classes, outperforming several state-of-the-art annotation tools. We anticipate that MSRep will enhance the annotation of understudied functions and novel, uncharacterized proteins, advancing future protein function studies and accelerating the discovery of new functional proteins. The source code of MSRep is available at https://github.com/luo-group/MSRep.
Collapse
Affiliation(s)
- Jiaqi Luo
- School of Computational Science and Engineering, Georgia Institute of Technology
| | - Yunan Luo
- School of Computational Science and Engineering, Georgia Institute of Technology
| |
Collapse
|
18
|
Kennedy L, Sandhu JK, Harper ME, Cuperlovic-Culf M. A hybrid machine learning framework for functional annotation of mitochondrial glutathione transport and metabolism proteins in cancers. BMC Bioinformatics 2025; 26:48. [PMID: 39934670 PMCID: PMC11817629 DOI: 10.1186/s12859-025-06051-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2024] [Accepted: 01/15/2025] [Indexed: 02/13/2025] Open
Abstract
BACKGROUND Alterations of metabolism, including changes in mitochondrial metabolism as well as glutathione (GSH) metabolism are a well appreciated hallmark of many cancers. Mitochondrial GSH (mGSH) transport is a poorly characterized aspect of GSH metabolism, which we investigate in the context of cancer. Existing functional annotation approaches from machine (ML) or deep learning (DL) models based only on protein sequences, were unable to annotate functions in biological contexts. RESULTS We develop a flexible ML framework for functional annotation from diverse feature data. This hybrid ML framework leverages cancer cell line multi-omics data and other biological knowledge data as features, to uncover potential genes involved in mGSH metabolism and membrane transport in cancers. This framework achieves strong performance across functional annotation tasks and several cell line and primary tumor cancer samples. For our application, classification models predict the known mGSH transporter SLC25A39 but not SLC25A40 as being highly probably related to mGSH metabolism in cancers. SLC25A10, SLC25A50, and orphan SLC25A24, SLC25A43 are predicted to be associated with mGSH metabolism in multiple biological contexts and structural analysis of these proteins reveal similarities in potential substrate binding regions to the binding residues of SLC25A39. CONCLUSION These findings have implications for a better understanding of cancer cell metabolism and novel therapeutic targets with respect to GSH metabolism through potential novel functional annotations of genes. The hybrid ML framework proposed here can be applied to other biological function classifications or multi-omics datasets to generate hypotheses in various biological contexts. Code and a tutorial for generating models and predictions in this framework are available at: https://github.com/lkenn012/mGSH_cancerClassifiers .
Collapse
Affiliation(s)
- Luke Kennedy
- Department of Biochemistry, Microbiology and Immunology, Faculty of Medicine, University of Ottawa, 451 Smyth Road, Ottawa, ON, K1H 8M5, Canada
- Ottawa Institute of Systems Biology, University of Ottawa, 451 Smyth Road, Ottawa, ON, K1H 8M5, Canada
| | - Jagdeep K Sandhu
- Department of Biochemistry, Microbiology and Immunology, Faculty of Medicine, University of Ottawa, 451 Smyth Road, Ottawa, ON, K1H 8M5, Canada
- Human Health Therapeutics Research Centre, National Research Council Canada, 1200 Montreal Road, Bldg M54, Ottawa, ON, K1A 0R6, Canada
| | - Mary-Ellen Harper
- Department of Biochemistry, Microbiology and Immunology, Faculty of Medicine, University of Ottawa, 451 Smyth Road, Ottawa, ON, K1H 8M5, Canada.
- Ottawa Institute of Systems Biology, University of Ottawa, 451 Smyth Road, Ottawa, ON, K1H 8M5, Canada.
| | - Miroslava Cuperlovic-Culf
- Department of Biochemistry, Microbiology and Immunology, Faculty of Medicine, University of Ottawa, 451 Smyth Road, Ottawa, ON, K1H 8M5, Canada.
- Digital Technologies Research Centre, National Research Council Canada, 1200 Montreal Road, Bldg M50, Ottawa, ON, K1A 0R6, Canada.
| |
Collapse
|
19
|
Wang Y, Sun Y, Lin B, Zhang H, Luo X, Liu Y, Jin X, Zhu D. SEGT-GO: a graph transformer method based on PPI serialization and explanatory artificial intelligence for protein function prediction. BMC Bioinformatics 2025; 26:46. [PMID: 39930351 PMCID: PMC11808960 DOI: 10.1186/s12859-025-06059-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2024] [Accepted: 01/20/2025] [Indexed: 02/14/2025] Open
Abstract
BACKGROUND A massive amount of protein sequences have been obtained, but their functions remain challenging to discern. In recent research on protein function prediction, Protein-Protein Interaction (PPI) Networks have played a crucial role. Uncovering potential function relationships between distant proteins within PPI networks is essential for improving the accuracy of protein function prediction. Most current studies attempt to capture these distant relationships by stacking graph network layers, but performance gains diminish as the number of layers increases. RESULTS To further explore the potential functional relationships between multi-hop proteins in PPI networks, this paper proposes SEGT-GO, a Graph Transformer method based on PPI multi-hop neighborhood Serialization and Explainable artificial intelligence for large-scale multispecies protein function prediction. The multi-hop neighborhood serialization maps multi-hop information in the PPI Network into serialized feature embeddings, enabling the Graph Transformer to learn deeper functional features within the PPI Network. Based on game theory, the SHAP eXplainable Artificial Intelligence (XAI) framework optimizes model input and filters out feature noise, enhancing model performance. CONCLUSIONS Compared to the advanced network method DeepGraphGO, SEGT-GO achieves more competitive results in standard large-scale datasets and superior results on small ones, validating its ability to extract functional information from deep proteins. Furthermore, SEGT-GO achieves superior results in cross-species learning and prediction of the functions of unseen proteins, further proving the method's strong generalization.
Collapse
Affiliation(s)
- Yansong Wang
- School of Computer Science and Technology, Harbin Institute of Technology Weihai Campus, Weihai, 264209, China
| | - Yundong Sun
- School of Computer Science and Technology, Harbin Institute of Technology Weihai Campus, Weihai, 264209, China
- Department of Electronic Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
| | - Baohui Lin
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen, 518118, China
| | - Haotian Zhang
- School of Computer Science and Technology, Harbin Institute of Technology Weihai Campus, Weihai, 264209, China
| | - Xiaoling Luo
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060, China
| | - Yumeng Liu
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen, 518118, China
| | - Xiaopeng Jin
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen, 518118, China.
| | - Dongjie Zhu
- School of Computer Science and Technology, Harbin Institute of Technology Weihai Campus, Weihai, 264209, China.
| |
Collapse
|
20
|
Prabakaran R, Bromberg Y. Functional profiling of the sequence stockpile: a protein pair-based assessment of in silico prediction tools. Bioinformatics 2025; 41:btaf035. [PMID: 39854283 PMCID: PMC11821270 DOI: 10.1093/bioinformatics/btaf035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2024] [Revised: 11/04/2024] [Accepted: 01/22/2025] [Indexed: 01/26/2025] Open
Abstract
MOTIVATION In silico functional annotation of proteins is crucial to narrowing the sequencing-accelerated gap in our understanding of protein activities. Numerous function annotation methods exist, and their ranks have been growing, particularly so with the recent deep learning-based developments. However, it is unclear if these tools are truly predictive. As we are not aware of any methods that can identify new terms in functional ontologies, we ask if they can, at least, identify molecular functions of proteins that are non-homologous to or far-removed from known protein families. RESULTS Here, we explore the potential and limitations of the existing methods in predicting the molecular functions of thousands of such proteins. Lacking the "ground truth" functional annotations, we transformed the assessment of function prediction into evaluation of functional similarity of protein pairs that likely share function but are unlike any of the currently functionally annotated sequences. Notably, our approach transcends the limitations of functional annotation vocabularies, providing a means to assess different-ontology annotation methods. We find that most existing methods are limited to identifying functional similarity of homologous sequences and fail to predict the function of proteins lacking reference. Curiously, despite their seemingly unlimited by-homology scope, deep learning methods also have trouble capturing the functional signal encoded in protein sequence. We believe that our work will inspire the development of a new generation of methods that push boundaries and promote exploration and discovery in the molecular function domain. AVAILABILITY AND IMPLEMENTATION The data underlying this article are available at https://doi.org/10.6084/m9.figshare.c.6737127.v3. The code used to compute siblings is available openly at https://bitbucket.org/bromberglab/siblings-detector/.
Collapse
Affiliation(s)
- R Prabakaran
- Department of Biology, Emory University, Atlanta, GA 30322, United States
- Department of Computer Science, Emory University, Atlanta, GA 30322, United States
| | - Yana Bromberg
- Department of Biology, Emory University, Atlanta, GA 30322, United States
- Department of Computer Science, Emory University, Atlanta, GA 30322, United States
| |
Collapse
|
21
|
Vural O, Jololian L. Machine learning approaches for predicting protein-ligand binding sites from sequence data. FRONTIERS IN BIOINFORMATICS 2025; 5:1520382. [PMID: 39963299 PMCID: PMC11830693 DOI: 10.3389/fbinf.2025.1520382] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2024] [Accepted: 01/10/2025] [Indexed: 02/20/2025] Open
Abstract
Proteins, composed of amino acids, are crucial for a wide range of biological functions. Proteins have various interaction sites, one of which is the protein-ligand binding site, essential for molecular interactions and biochemical reactions. These sites enable proteins to bind with other molecules, facilitating key biological functions. Accurate prediction of these binding sites is pivotal in computational drug discovery, helping to identify therapeutic targets and facilitate treatment development. Machine learning has made significant contributions to this field by improving the prediction of protein-ligand interactions. This paper reviews studies that use machine learning to predict protein-ligand binding sites from sequence data, focusing on recent advancements. The review examines various embedding methods and machine learning architectures, addressing current challenges and the ongoing debates in the field. Additionally, research gaps in the existing literature are highlighted, and potential future directions for advancing the field are discussed. This study provides a thorough overview of sequence-based approaches for predicting protein-ligand binding sites, offering insights into the current state of research and future possibilities.
Collapse
Affiliation(s)
- Orhun Vural
- Department of Electrical and Computer Engineering, The University of Alabama at Birmingham, Birmingham, AL, United States
| | | |
Collapse
|
22
|
Li H, Chen Y, Xia Z, Zhuang D, Cong F, Lian YX. Metagenomic investigation of viruses in green sea turtles ( Chelonia mydas). Front Microbiol 2025; 16:1492038. [PMID: 39911250 PMCID: PMC11794262 DOI: 10.3389/fmicb.2025.1492038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2024] [Accepted: 01/07/2025] [Indexed: 02/07/2025] Open
Abstract
Green sea turtles are listed on the International Union for Conservation of Nature's Red List of Threatened Species. Thus, conservation efforts, including investigation of factors affecting the health of green sea turtles, are critical. Viral communities play vital roles in maintaining animal health. In the present study, shotgun metagenomics was used for the first time to survey viruses in the feces of green sea turtles. Most viral contigs were DNA viruses that mainly belonged to Caudoviricetes, followed by Crassvirales. Additionally, most of the viral contigs were not assigned to any known family or genus, implying a large knowledge gap in the taxonomy of green sea turtle gut viruses. Host prediction showed that most viruses were connected to two phyla: Bacteroidetes and Firmicutes. Furthermore, KEGG enrichment analysis showed that the viral genes were mainly involved in phage-associated and metabolic pathways. Phylogenetic tree reconstruction of Caudovirales terminase large-subunit (TerL) protein showed that most of the sequences were phylogenetically distant. This study expands our understanding of the viral diversity in green sea turtles. In particular, analysis of the virome RNA fraction is exceedingly important for investigating intestinal viromes; therefore, future studies could use metatranscriptomics to study RNA viruses.
Collapse
Affiliation(s)
- Hongwei Li
- School of Life Science, Huizhou University, Huizhou, China
| | - Yuan Chen
- School of Life Science, Huizhou University, Huizhou, China
| | - Zhongrong Xia
- Guangdong Huidong Sea Turtle National Nature Reserve Bureau, Sea Turtle Bay, Huizhou, China
| | - Daohua Zhuang
- State Key Laboratory for Conservation and Utilization of Bio-Resources in Yunnan, School of Life Sciences, Yunnan University, Kunming, China
| | - Feng Cong
- Guangdong Laboratory Animal Monitoring Institute and Guangdong Provincial Key Laboratory of Laboratory Animals, Guangzhou, China
| | - Yue-Xiao Lian
- Guangdong Laboratory Animal Monitoring Institute and Guangdong Provincial Key Laboratory of Laboratory Animals, Guangzhou, China
| |
Collapse
|
23
|
Chen JY, Wang JF, Hu Y, Li XH, Qian YR, Song CL. Evaluating the advancements in protein language models for encoding strategies in protein function prediction: a comprehensive review. Front Bioeng Biotechnol 2025; 13:1506508. [PMID: 39906415 PMCID: PMC11790633 DOI: 10.3389/fbioe.2025.1506508] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2024] [Accepted: 01/02/2025] [Indexed: 02/06/2025] Open
Abstract
Protein function prediction is crucial in several key areas such as bioinformatics and drug design. With the rapid progress of deep learning technology, applying protein language models has become a research focus. These models utilize the increasing amount of large-scale protein sequence data to deeply mine its intrinsic semantic information, which can effectively improve the accuracy of protein function prediction. This review comprehensively combines the current status of applying the latest protein language models in protein function prediction. It provides an exhaustive performance comparison with traditional prediction methods. Through the in-depth analysis of experimental results, the significant advantages of protein language models in enhancing the accuracy and depth of protein function prediction tasks are fully demonstrated.
Collapse
Affiliation(s)
- Jia-Ying Chen
- School of Software, Xinjiang University, Urumqi, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi, China
- Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Xinjiang University, Urumqi, China
| | - Jing-Fu Wang
- School of Software, Xinjiang University, Urumqi, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi, China
- Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Xinjiang University, Urumqi, China
| | - Yue Hu
- School of Software, Xinjiang University, Urumqi, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi, China
- Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Xinjiang University, Urumqi, China
| | - Xin-Hui Li
- School of Software, Xinjiang University, Urumqi, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi, China
- Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Xinjiang University, Urumqi, China
| | - Yu-Rong Qian
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi, China
- Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Xinjiang University, Urumqi, China
- School of Computer Science and Technology, Xinjiang University, Urumqi, China
| | - Chao-Lin Song
- School of Software, Xinjiang University, Urumqi, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi, China
- Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Xinjiang University, Urumqi, China
| |
Collapse
|
24
|
Chou JC, Dassama LMK. Lipid Trafficking in Diverse Bacteria. Acc Chem Res 2025; 58:36-46. [PMID: 39680024 PMCID: PMC11713862 DOI: 10.1021/acs.accounts.4c00540] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2024] [Revised: 11/27/2024] [Accepted: 12/02/2024] [Indexed: 12/17/2024]
Abstract
Lipids are essential for life and serve as cell envelope components, signaling molecules, and nutrients. For lipids to achieve their required functions, they need to be correctly localized. This requires the action of transporter proteins and an energy source. The current understanding of bacterial lipid transporters is limited to a few classes. Given the diversity of lipid species and the predicted existence of specific lipid transporters, many more transporters await discovery and characterization. These proteins could be prime targets for modulators that control bacterial cell proliferation and pathogenesis. One overarching goal of our research is to understand the molecular mechanisms of bacterial metabolite trafficking, including lipids, and to leverage that understanding to identify or engineer inhibitory ligands. In recent years, our work has revealed two novel lipid transport systems in bacteria: bacterial sterol transporters (Bst) A, B, and C in Methylococcus capsulatus and the TatT proteins in Enhygromyxa salina and Treponema pallidum. Both systems are composed of transporters bioinformatically identified as being involved in the transport of other metabolites, but substrates were never revealed. However, the genetic colocalization of the genes encoding BstABC with sterol biosynthetic enzymes in M. capsulatus suggested that they might recognize sterols as substrates. Also, homologues of TatTs are present in diverse bacteria but are overrepresented in bacteria deficient in de novo lipid synthesis or residing in nutrient-poor environments; we reasoned that these proteins might facilitate the transport of lipids. Our efforts to reveal the substrate scope of two TatT proteins revealed their engagement with long-chain fatty acids. Enabling the discovery of the BstABC system and the TatT proteins were bioinformatic analyses, quantitative measurements of protein-ligand equilibrium affinities, and high-resolution structural studies that provided remarkable insights into ligand binding cavities and the structural basis for ligand interaction. These approaches, in particular our bioinformatics and structural work, highlighted the diversity of protein sequence and structures amenable to lipid engagement. These observations allowed the hypothesis that lipid handling proteins, in general and especially so in the bacterial domain, can have diverse amino acid compositions and three-dimensional structures. As such, bioinformatics geared at identifying them in poorly characterized genomes is likely to miss many candidates that diverge from well-characterized family members. This realization spurred efforts to understand the unifying features in all of the lipid handling proteins we have characterized to date. To do this, we inspected the ligand binding sites of the proteins: they were remarkably hydrophobic and sometimes displayed a dichotomy of hydrophobic and hydrophilic amino acids, akin to the ligands that they accommodate in those cavities. Because of this, we reasoned that the physicochemical features of ligand binding cavities could be accurate predictors of a protein's propensity to bind lipids. This finding was leveraged to create structure-based lipid-interacting pocket predictor (SLiPP), a machine-learning algorithm capable of identifying ligand cavities with physico-chemical features consistent with those of known lipid binding sites. SLiPP is especially useful in poorly annotated genomes (such as with bacterial pathogens), where it could reveal candidate proteins to be targeted for the development of antimicrobials.
Collapse
Affiliation(s)
- Jonathan
Chiu-Chun Chou
- Department
of Chemistry and Sarafan ChEM-H Institute, Stanford University, Stanford, California 94305, United States
| | - Laura M. K. Dassama
- Department
of Chemistry and Sarafan ChEM-H Institute, Stanford University, Stanford, California 94305, United States
- Department
of Microbiology and Immunology, Stanford
School of Medicine, Stanford, California 94305, United States
| |
Collapse
|
25
|
Wang W, Shuai Y, Zeng M, Fan W, Li M. DPFunc: accurately predicting protein function via deep learning with domain-guided structure information. Nat Commun 2025; 16:70. [PMID: 39746897 PMCID: PMC11697396 DOI: 10.1038/s41467-024-54816-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Accepted: 11/21/2024] [Indexed: 01/04/2025] Open
Abstract
Computational methods for predicting protein function are of great significance in understanding biological mechanisms and treating complex diseases. However, existing computational approaches of protein function prediction lack interpretability, making it difficult to understand the relations between protein structures and functions. In this study, we propose a deep learning-based solution, named DPFunc, for accurate protein function prediction with domain-guided structure information. DPFunc can detect significant regions in protein structures and accurately predict corresponding functions under the guidance of domain information. It outperforms current state-of-the-art methods and achieves a significant improvement over existing structure-based methods. Detailed analyses demonstrate that the guidance of domain information contributes to DPFunc for protein function prediction, enabling our method to detect key residues or regions in protein structures, which are closely related to their functions. In summary, DPFunc serves as an effective tool for large-scale protein function prediction, which pushes the border of protein understanding in biological systems.
Collapse
Affiliation(s)
- Wenkang Wang
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Yunyan Shuai
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Min Zeng
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Wei Fan
- Nuffield Department of Women's and Reproductive Health, University of Oxford, Oxford, OX39DU, UK
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China.
| |
Collapse
|
26
|
Boadu F, Lee A, Cheng J. Deep learning methods for protein function prediction. Proteomics 2025; 25:e2300471. [PMID: 38996351 PMCID: PMC11735672 DOI: 10.1002/pmic.202300471] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Revised: 06/15/2024] [Accepted: 06/18/2024] [Indexed: 07/14/2024]
Abstract
Predicting protein function from protein sequence, structure, interaction, and other relevant information is important for generating hypotheses for biological experiments and studying biological systems, and therefore has been a major challenge in protein bioinformatics. Numerous computational methods had been developed to advance protein function prediction gradually in the last two decades. Particularly, in the recent years, leveraging the revolutionary advances in artificial intelligence (AI), more and more deep learning methods have been developed to improve protein function prediction at a faster pace. Here, we provide an in-depth review of the recent developments of deep learning methods for protein function prediction. We summarize the significant advances in the field, identify several remaining major challenges to be tackled, and suggest some potential directions to explore. The data sources and evaluation metrics widely used in protein function prediction are also discussed to assist the machine learning, AI, and bioinformatics communities to develop more cutting-edge methods to advance protein function prediction.
Collapse
Affiliation(s)
- Frimpong Boadu
- Department of Electrical Engineering and Computer ScienceUniversity of MissouriColumbiaMissouriUSA
| | - Ahhyun Lee
- Department of Electrical Engineering and Computer ScienceUniversity of MissouriColumbiaMissouriUSA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer ScienceUniversity of MissouriColumbiaMissouriUSA
| |
Collapse
|
27
|
Wang Z, Yuan H, Yan J, Liu J. Identification, characterization, and design of plant genome sequences using deep learning. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2025; 121:e17190. [PMID: 39666835 DOI: 10.1111/tpj.17190] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/03/2024] [Revised: 11/11/2024] [Accepted: 11/23/2024] [Indexed: 12/14/2024]
Abstract
Due to its excellent performance in processing large amounts of data and capturing complex non-linear relationships, deep learning has been widely applied in many fields of plant biology. Here we first review the application of deep learning in analyzing genome sequences to predict gene expression, chromatin interactions, and epigenetic features (open chromatin, transcription factor binding sites, and methylation sites) in plants. Then, current motif mining and functional component design and synthesis based on generative adversarial networks, large models, and attention mechanisms are elaborated in detail. The progress of protein structure and function prediction, genomic prediction, and large model applications based on deep learning is also discussed. Finally, this work provides prospects for the future development of deep learning in plants with regard to multiple omics data, algorithm optimization, large language models, sequence design, and intelligent breeding.
Collapse
Affiliation(s)
- Zhenye Wang
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, 430070, China
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Hao Yuan
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, 430070, China
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Jianbing Yan
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Hongshan Laboratory, Wuhan, 430070, China
| | - Jianxiao Liu
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, 430070, China
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Hongshan Laboratory, Wuhan, 430070, China
| |
Collapse
|
28
|
Ma W, Bi X, Jiang H, Wei Z, Zhang S. Annotating protein functions via fusing multiple biological modalities. Commun Biol 2024; 7:1705. [PMID: 39730886 DOI: 10.1038/s42003-024-07411-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Accepted: 12/17/2024] [Indexed: 12/29/2024] Open
Abstract
Understanding the function of proteins is of great significance for revealing disease pathogenesis and discovering new targets. Benefiting from the explosive growth of the protein universal, deep learning has been applied to accelerate the protein annotation cycle from different biological modalities. However, most existing deep learning-based methods not only fail to effectively fuse different biological modalities, resulting in low-quality protein representations, but also suffer from the convergence of suboptimal solution caused by sparse label representations. Aiming at the above issue, we propose a multiprocedural approach for fusing heterogeneous biological modalities and annotating protein functions, i.e., MIF2GO (Multimodal Information Fusion to infer Gene Ontology terms), which sequentially fuses up to six biological modalities ranging from different biological levels in three steps, thus leading to powerful protein representations. Evaluation results on seven benchmark datasets show that the proposed method not only considerably outperforms state-of-the-art performance, but also demonstrates great robustness and generalizability across species. Besides, we also present biological insights into the associations between those modalities and protein functions. This research provides a robust framework for integrating multimodal biological data, offering a scalable solution for protein function annotation, ultimately facilitating advancements in precision medicine and the discovery of novel therapeutic strategies.
Collapse
Affiliation(s)
- Wenjian Ma
- College of Computer Science and Technology, Ocean University of China, Qingdao, China
| | - Xiangpeng Bi
- College of Computer Science and Technology, Ocean University of China, Qingdao, China
| | - Huasen Jiang
- College of Computer Science and Technology, Ocean University of China, Qingdao, China
| | - Zhiqiang Wei
- College of Computer Science and Technology, Ocean University of China, Qingdao, China
| | - Shugang Zhang
- College of Computer Science and Technology, Ocean University of China, Qingdao, China.
| |
Collapse
|
29
|
Mo W, Vaiana CA, Myers CJ. The need for adaptability in detection, characterization, and attribution of biosecurity threats. Nat Commun 2024; 15:10699. [PMID: 39702312 PMCID: PMC11659417 DOI: 10.1038/s41467-024-55436-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2024] [Accepted: 12/12/2024] [Indexed: 12/21/2024] Open
Abstract
Modern biotechnology necessitates robust biosecurity protocols to address the risk of engineered biological threats. Current efforts focus on screening DNA and rejecting the synthesis of dangerous elements but face technical and logistical barriers. Screening should integrate into a broader strategy that addresses threats at multiple stages of development and deployment. The success of this approach hinges upon reliable detection, characterization, and attribution of engineered DNA. Recent advances notably aid the potential to both develop threats and analyze them. However, further work is needed to translate developments into biosecurity applications. This work reviews cutting-edge methods for DNA analysis and recommends avenues to improve biosecurity in an adaptable manner.
Collapse
Affiliation(s)
- William Mo
- Draper Scholar, The Charles Stark Draper Laboratory, Inc., 555 Technology Square, Cambridge, MA, USA
- Department of Electrical, Computer, and Energy Engineering, University of Colorado Boulder, 1111 Engineering Dr, Boulder, CO, USA
| | - Christopher A Vaiana
- The Charles Stark Draper Laboratory, Inc., 555 Technology Square, Cambridge, MA, USA
| | - Chris J Myers
- Department of Electrical, Computer, and Energy Engineering, University of Colorado Boulder, 1111 Engineering Dr, Boulder, CO, USA.
| |
Collapse
|
30
|
Xiang W, Xiong Z, Chen H, Xiong J, Zhang W, Fu Z, Zheng M, Liu B, Shi Q. FAPM: functional annotation of proteins using multimodal models beyond structural modeling. Bioinformatics 2024; 40:btae680. [PMID: 39540736 PMCID: PMC11630832 DOI: 10.1093/bioinformatics/btae680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2024] [Revised: 10/12/2024] [Accepted: 11/12/2024] [Indexed: 11/16/2024] Open
Abstract
MOTIVATION Assigning accurate property labels to proteins, like functional terms and catalytic activity, is challenging, especially for proteins without homologs and "tail labels" with few known examples. Previous methods mainly focused on protein sequence features, overlooking the semantic meaning of protein labels. RESULTS We introduce functional annotation of proteins using multimodal models (FAPM), a contrastive multimodal model that links natural language with protein sequence language. This model combines a pretrained protein sequence model with a pretrained large language model to generate labels, such as Gene Ontology (GO) functional terms and catalytic activity predictions, in natural language. Our results show that FAPM excels in understanding protein properties, outperforming models based solely on protein sequences or structures. It achieves state-of-the-art performance on public benchmarks and in-house experimentally annotated phage proteins, which often have few known homologs. Additionally, FAPM's flexibility allows it to incorporate extra text prompts, like taxonomy information, enhancing both its predictive performance and explainability. This novel approach offers a promising alternative to current methods that rely on multiple sequence alignment for protein annotation. AVAILABILITY AND IMPLEMENTATION The online demo is at: https://huggingface.co/spaces/wenkai/FAPM_demo.
Collapse
Affiliation(s)
- Wenkai Xiang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
- Lingang Laboratory, Shanghai 200031, China
| | | | - Huan Chen
- BioBank, The First Affiliated Hospital of Xi’an Jiaotong University, Xi’an 710061, China
| | - Jiacheng Xiong
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Wei Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zunyun Fu
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
| | - Mingyue Zheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
- Lingang Laboratory, Shanghai 200031, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Bing Liu
- BioBank, The First Affiliated Hospital of Xi’an Jiaotong University, Xi’an 710061, China
| | - Qian Shi
- Lingang Laboratory, Shanghai 200031, China
| |
Collapse
|
31
|
Wang H, Ren Z, Sun J, Chen Y, Bo X, Xue J, Gao J, Ni M. DeepPFP: a multi-task-aware architecture for protein function prediction. Brief Bioinform 2024; 26:bbae579. [PMID: 39905954 PMCID: PMC11794456 DOI: 10.1093/bib/bbae579] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2024] [Revised: 09/14/2024] [Accepted: 01/31/2025] [Indexed: 02/06/2025] Open
Abstract
Deriving protein function from protein sequences poses a significant challenge due to the intricate relationship between sequence and function. Deep learning has made remarkable strides in predicting sequence-function relationships. However, models tailored for specific tasks or protein types encounter difficulties when using transfer learning across domains. This is attributed to the fact that protein function relies heavily on structural characteristics rather than mere sequence information. Consequently, there is a pressing need for a model capable of capturing shared features among diverse sequence-function mapping tasks to address the generalization issue. In this study, we explore the potential of Model-Agnostic Meta-Learning combined with a protein language model called Evolutionary Scale Modeling to tackle this challenge. Our approach involves training the architecture on five out-domain deep mutational scanning (DMS) datasets and evaluating its performance across four key dimensions. Our findings demonstrate that the proposed architecture exhibits satisfactory performance in terms of generalization and employs an effective few-shot learning strategy. To explain further, Compared to the best results, the Pearson's correlation coefficient (PCC) in the final stage increased by ~0.31%. Furthermore, we leverage the trained architecture to predict binding affinity scores of the DMS dataset of SARS-CoV-2 using transfer learning. Notably, training on a subset of the Ube4b dataset with 500 samples resulted in a notable improvement of 0.11 in the PCC. These results underscore the potential of our conceptual architecture as a promising methodology for multi-task protein function prediction.
Collapse
Affiliation(s)
- Han Wang
- College of Information Science and Technology, Beijing University of Chemical Technology, No. 15 North Third Ring East Road, Chaoyang District, Beijing 100029, China
| | - Zilin Ren
- Changchun Veterinary Research Institute, Chinese Academy of Agricultural Sciences, State Key Laboratory of Pathogen and Biosecurity, Key Laboratory of Jilin Province for Zoonosis Prevention and Control, Changchun 130122, China
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Jinghong Sun
- College of Information Science and Technology, Beijing University of Chemical Technology, No. 15 North Third Ring East Road, Chaoyang District, Beijing 100029, China
| | - Yongbing Chen
- Changchun Veterinary Research Institute, Chinese Academy of Agricultural Sciences, State Key Laboratory of Pathogen and Biosecurity, Key Laboratory of Jilin Province for Zoonosis Prevention and Control, Changchun 130122, China
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Xiaochen Bo
- Advanced & Interdisciplinary Biotechnology, Academy of Military Medical Sciences, No. 27 Taiping Road, Haidian District, Beijing 100850, China
| | - JiGuo Xue
- Advanced & Interdisciplinary Biotechnology, Academy of Military Medical Sciences, No. 27 Taiping Road, Haidian District, Beijing 100850, China
| | - Jingyang Gao
- College of Information Science and Technology, Beijing University of Chemical Technology, No. 15 North Third Ring East Road, Chaoyang District, Beijing 100029, China
| | - Ming Ni
- Advanced & Interdisciplinary Biotechnology, Academy of Military Medical Sciences, No. 27 Taiping Road, Haidian District, Beijing 100850, China
| |
Collapse
|
32
|
Luo X, Chi ASY, Lin AH, Ong TJ, Wong L, Rahman CR. Benchmarking recent computational tools for DNA-binding protein identification. Brief Bioinform 2024; 26:bbae634. [PMID: 39657630 PMCID: PMC11630855 DOI: 10.1093/bib/bbae634] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2024] [Revised: 10/29/2024] [Accepted: 11/20/2024] [Indexed: 12/12/2024] Open
Abstract
Identification of DNA-binding proteins (DBPs) is a crucial task in genome annotation, as it aids in understanding gene regulation, DNA replication, transcriptional control, and various cellular processes. In this paper, we conduct an unbiased benchmarking of 11 state-of-the-art computational tools as well as traditional tools such as ScanProsite, BLAST, and HMMER for identifying DBPs. We highlight the data leakage issue in conventional datasets leading to inflated performance. We introduce new evaluation datasets to support further development. Through a comprehensive evaluation pipeline, we identify potential limitations in models, feature extraction techniques, and training methods, and recommend solutions regarding these issues. We show that combining the predictions of the two best computational tools with BLAST-based prediction significantly enhances DBP identification capability. We provide this consensus method as user-friendly software. The datasets and software are available at https://github.com/Rafeed-bot/DNA_BP_Benchmarking.
Collapse
Affiliation(s)
- Xizi Luo
- School of Computing, National University of Singapore, Singapore 119077, Singapore
| | - Amadeus Song Yi Chi
- School of Computing, National University of Singapore, Singapore 119077, Singapore
| | - Andre Huikai Lin
- School of Computing, National University of Singapore, Singapore 119077, Singapore
| | - Tze Jet Ong
- School of Computing, National University of Singapore, Singapore 119077, Singapore
| | - Limsoon Wong
- School of Computing, National University of Singapore, Singapore 119077, Singapore
| | | |
Collapse
|
33
|
Guan J, Ji Y, Peng C, Zou W, Tang X, Shang J, Sun Y. GOPhage: protein function annotation for bacteriophages by integrating the genomic context. Brief Bioinform 2024; 26:bbaf014. [PMID: 39838963 PMCID: PMC11751364 DOI: 10.1093/bib/bbaf014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2024] [Revised: 12/15/2024] [Accepted: 01/06/2025] [Indexed: 01/23/2025] Open
Abstract
Bacteriophages are viruses that target bacteria, playing a crucial role in microbial ecology. Phage proteins are important in understanding phage biology, such as virus infection, replication, and evolution. Although a large number of new phages have been identified via metagenomic sequencing, many of them have limited protein function annotation. Accurate function annotation of phage proteins presents several challenges, including their inherent diversity and the scarcity of annotated ones. Existing tools have yet to fully leverage the unique properties of phages in annotating protein functions. In this work, we propose a new protein function annotation tool for phages by leveraging the modular genomic structure of phage genomes. By employing embeddings from the latest protein foundation models and Transformer to capture contextual information between proteins in phage genomes, GOPhage surpasses state-of-the-art methods in annotating diverged proteins and proteins with uncommon functions by 6.78% and 13.05% improvement, respectively. GOPhage can annotate proteins lacking homology search results, which is critical for characterizing the rapidly accumulating phage genomes. We demonstrate the utility of GOPhage by identifying 688 potential holins in phages, which exhibit high structural conservation with known holins. The results show the potential of GOPhage to extend our understanding of newly discovered phages.
Collapse
Affiliation(s)
- Jiaojiao Guan
- Department of Electrical Engineering, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong (SAR), China
| | - Yongxin Ji
- Department of Electrical Engineering, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong (SAR), China
| | - Cheng Peng
- Department of Electrical Engineering, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong (SAR), China
| | - Wei Zou
- Department of Electrical Engineering, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong (SAR), China
| | - Xubo Tang
- Department of Electrical Engineering, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong (SAR), China
| | - Jiayu Shang
- Department of Information Engineering, Chinese University of Hong Kong, Shatin, New Territories, Hong Kong (SAR), China
| | - Yanni Sun
- Department of Electrical Engineering, City University of Hong Kong, 83 Tat Chee Ave, Kowloon Tong, Hong Kong (SAR), China
| |
Collapse
|
34
|
Vu TTD, Kim J, Jung J. An experimental analysis of graph representation learning for Gene Ontology based protein function prediction. PeerJ 2024; 12:e18509. [PMID: 39553733 PMCID: PMC11569786 DOI: 10.7717/peerj.18509] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2024] [Accepted: 10/21/2024] [Indexed: 11/19/2024] Open
Abstract
Understanding protein function is crucial for deciphering biological systems and facilitating various biomedical applications. Computational methods for predicting Gene Ontology functions of proteins emerged in the 2000s to bridge the gap between the number of annotated proteins and the rapidly growing number of newly discovered amino acid sequences. Recently, there has been a surge in studies applying graph representation learning techniques to biological networks to enhance protein function prediction tools. In this review, we provide fundamental concepts in graph embedding algorithms. This study described graph representation learning methods for protein function prediction based on four principal data categories, namely PPI network, protein structure, Gene Ontology graph, and integrated graph. The commonly used approaches for each category were summarized and diagrammed, with the specific results of each method explained in detail. Finally, existing limitations and potential solutions were discussed, and directions for future research within the protein research community were suggested.
Collapse
Affiliation(s)
- Thi Thuy Duong Vu
- Faculty of Fundamental Sciences, University of Medicine and Pharmacy at Ho Chi Minh City, Ho Chi Minh City, Vietnam
| | - Jeongho Kim
- Department of Information and Communication Engineering, Myongji University, Yongin, Republic of South Korea
| | - Jaehee Jung
- Department of Information and Communication Engineering, Myongji University, Yongin, Republic of South Korea
| |
Collapse
|
35
|
Xia Z, Ma S, Li J, Guo Y, Jiang L, Tang J. RecGOBD: accurate recognition of gene ontology related brain development protein functions through multi-feature fusion and attention mechanisms. BIOINFORMATICS ADVANCES 2024; 4:vbae163. [PMID: 39678209 PMCID: PMC11639192 DOI: 10.1093/bioadv/vbae163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/25/2024] [Revised: 09/30/2024] [Accepted: 10/23/2024] [Indexed: 12/17/2024]
Abstract
Motivation Protein function prediction is crucial in bioinformatics, driven by the growth of protein sequence data from high-throughput technologies. Traditional methods are costly and slow, underscoring the need for computational solutions. While deep learning offers powerful tools, many models lack optimization for brain development datasets, critical for neurodevelopmental disorder research. To address this, we developed RecGOBD (Recognition of Gene Ontology-related Brain Development protein function), a model tailored to predict protein functions essential to brain development. Result RecGOBD targets 10 key gene ontology (GO) terms for brain development, embedding protein sequences associated with these terms. Leveraging advanced pre-trained models, it captures both sequence and structure data, aligning them with GO terms through attention mechanisms. The category attention layer enhances prediction accuracy. RecGOBD surpassed five benchmark models in AUROC, AUPR, and Fmax metrics and was further used to predict autism-related protein functions and assess mutation impacts on GO terms. These findings highlight RecGOBD's potential in advancing protein function prediction for neurodevelopmental disorders. Availability and implementation All Python codes associated with this study are available at https://github.com/ZL-Xia/RECGOBD.git.
Collapse
Affiliation(s)
- Zhiliang Xia
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Shiqiang Ma
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, China
| | - Jiawei Li
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, China
| | - Yan Guo
- Department of Public Health Sciences, University of Miami, Miami, FL 33136, United States
| | - Limin Jiang
- Department of Public Health Sciences, University of Miami, Miami, FL 33136, United States
| | - Jijun Tang
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong 518055, China
| |
Collapse
|
36
|
Liu Q, Zhang C, Freddolino L. InterLabelGO+: unraveling label correlations in protein function prediction. Bioinformatics 2024; 40:btae655. [PMID: 39499152 PMCID: PMC11568131 DOI: 10.1093/bioinformatics/btae655] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2024] [Revised: 10/07/2024] [Accepted: 11/01/2024] [Indexed: 11/07/2024] Open
Abstract
MOTIVATION Accurate protein function prediction is crucial for understanding biological processes and advancing biomedical research. However, the rapid growth of protein sequences far outpaces the experimental characterization of their functions, necessitating the development of automated computational methods. RESULTS We present InterLabelGO+, a hybrid approach that integrates a deep learning-based method with an alignment-based method for improved protein function prediction. InterLabelGO+ incorporates a novel loss function that addresses label dependency and imbalance and further enhances performance through dynamic weighting of the alignment-based component. A preliminary version of InterLabelGO+ achieved a strong performance in the CAFA5 challenge, ranking sixth out of 1625 participating teams. Comprehensive evaluations on large-scale protein function prediction tasks demonstrate InterLabelGO+'s ability to accurately predict Gene Ontology terms across various functional categories and evaluation metrics. AVAILABILITY AND IMPLEMENTATION The source code and datasets for InterLabelGO+ are freely available on GitHub at https://github.com/QuanEvans/InterLabelGO. A web-server is available at https://seq2fun.dcmb.med.umich.edu/InterLabelGO/. The software is implemented in Python and PyTorch, and is supported on Linux and macOS.
Collapse
Affiliation(s)
- Quancheng Liu
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48109, USA
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Lydia Freddolino
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48109, USA
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, 48109, USA
| |
Collapse
|
37
|
Kumar V, Deepak A, Ranjan A, Prakash A. CrossPredGO: A Novel Light-Weight Cross-Modal Multi-Attention Framework for Protein Function Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1709-1720. [PMID: 38843056 DOI: 10.1109/tcbb.2024.3410696] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/27/2024]
Abstract
Proteins are represented in various ways, each contributing differently to protein-related tasks. Here, information from each representation (protein sequence, 3D structure, and interaction data) is combined for an efficient protein function prediction task. Recently, uni-modal has produced promising results with state-of-the-art attention mechanisms that learn the relative importance of features, whereas multi-modal approaches have produced promising results by simply concatenating obtained features using a computational approach from different representations which leads to an increase in the overall trainable parameters. In this paper, we propose a novel, light-weight cross-modal multi-attention (CrMoMulAtt) mechanism that captures the relative contribution of each modality with a lower number of trainable parameters. The proposed mechanism shows a higher contribution from PPI and a lower contribution from structure data. The results obtained from the proposed CrossPredGO mechanism demonstrate an increment in in the range of +(3.29 to 7.20)% with at most 31% lower trainable parameters compared with DeepGO and MultiPredGO.
Collapse
|
38
|
Kumar V, Deepak A, Ranjan A, Prakash A. Bi-SeqCNN: A Novel Light-Weight Bi-Directional CNN Architecture for Protein Function Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1922-1933. [PMID: 38990747 DOI: 10.1109/tcbb.2024.3426491] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/13/2024]
Abstract
Deep learning approaches, such as convolution neural networks (CNNs) and deep recurrent neural networks (RNNs), have been the backbone for predicting protein function, with promising state-of-the-art (SOTA) results. RNNs with an in-built ability (i) focus on past information, (ii) collect both short-and-long range dependency information, and (iii) bi-directional processing offers a strong sequential processing mechanism. CNNs, however, are confined to focusing on short-term information from both the past and the future, although they offer parallelism. Therefore, a novel bi-directional CNN that strictly complies with the sequential processing mechanism of RNNs is introduced and is used for developing a protein function prediction framework, Bi-SeqCNN. This is a sub-sequence-based framework. Further, Bi-SeqCNN is an ensemble approach to better the prediction results. To our knowledge, this is the first time bi-directional CNNs are employed for general temporal data analysis and not just for protein sequences. The proposed architecture produces improvements up to +5.5% over contemporary SOTA methods on three benchmark protein sequence datasets. Moreover, it is substantially lighter and attain these results with (0.50-0.70 times) fewer parameters than the SOTA methods.
Collapse
|
39
|
Taha K. Employing Machine Learning Techniques to Detect Protein Function: A Survey, Experimental, and Empirical Evaluations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1965-1986. [PMID: 39008392 DOI: 10.1109/tcbb.2024.3427381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/17/2024]
Abstract
This review article delves deeply into the various machine learning (ML) methods and algorithms employed in discerning protein functions. Each method discussed is assessed for its efficacy, limitations, potential improvements, and future prospects. We present an innovative hierarchical classification system that arranges algorithms into intricate categories and unique techniques. This taxonomy is based on a tri-level hierarchy, starting with the methodology category and narrowing down to specific techniques. Such a framework allows for a structured and comprehensive classification of algorithms, assisting researchers in understanding the interrelationships among diverse algorithms and techniques. The study incorporates both empirical and experimental evaluations to differentiate between the techniques. The empirical evaluation ranks the techniques based on four criteria. The experimental assessments rank: (1) individual techniques under the same methodology sub-category, (2) different sub-categories within the same category, and (3) the broad categories themselves. Integrating the innovative methodological classification, empirical findings, and experimental assessments, the article offers a well-rounded understanding of ML strategies in protein function identification. The paper also explores techniques for multi-task and multi-label detection of protein functions, in addition to focusing on single-task methods. Moreover, the paper sheds light on the future avenues of ML in protein function determination.
Collapse
|
40
|
Gómez-Valadés A, Martínez-Tomás R, García-Herranz S, Bjørnerud A, Rincón M. Early detection of mild cognitive impairment through neuropsychological tests in population screenings: a decision support system integrating ontologies and machine learning. Front Neuroinform 2024; 18:1378281. [PMID: 39478874 PMCID: PMC11522961 DOI: 10.3389/fninf.2024.1378281] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Accepted: 10/04/2024] [Indexed: 11/02/2024] Open
Abstract
Machine learning (ML) methodologies for detecting Mild Cognitive Impairment (MCI) are progressively gaining prevalence to manage the vast volume of processed information. Nevertheless, the black-box nature of ML algorithms and the heterogeneity within the data may result in varied interpretations across distinct studies. To avoid this, in this proposal, we present the design of a decision support system that integrates a machine learning model represented using the Semantic Web Rule Language (SWRL) in an ontology with specialized knowledge in neuropsychological tests, the NIO ontology. The system's ability to detect MCI subjects was evaluated on a database of 520 neuropsychological assessments conducted in Spanish and compared with other well-established ML methods. Using the F2 coefficient to minimize false negatives, results indicate that the system performs similarly to other well-established ML methods (F2TE2 = 0.830, only below bagging, F2BAG = 0.832) while exhibiting other significant attributes such as explanation capability and data standardization to a common framework thanks to the ontological part. On the other hand, the system's versatility and ease of use were demonstrated with three additional use cases: evaluation of new cases even if the acquisition stage is incomplete (the case records have missing values), incorporation of a new database into the integrated system, and use of the ontology capabilities to relate different domains. This makes it a useful tool to support physicians and neuropsychologists in population-based screenings for early detection of MCI.
Collapse
Affiliation(s)
- Alba Gómez-Valadés
- Department of Artificial Intelligence, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain
| | - Rafael Martínez-Tomás
- Department of Artificial Intelligence, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain
| | | | - Atle Bjørnerud
- Computational Radiology and Artificial Intelligence Unit, Department of Physics and Computational Radiology, Clinic for Radiology and Nuclear Medicine, Oslo University Hospital, Oslo, Norway
- Department of Physics, University of Oslo, Oslo, Norway
| | - Mariano Rincón
- Department of Artificial Intelligence, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain
| |
Collapse
|
41
|
Li L, Dannenfelser R, Cruz C, Yao V. A best-match approach for gene set analyses in embedding spaces. Genome Res 2024; 34:1421-1433. [PMID: 39231608 PMCID: PMC11529866 DOI: 10.1101/gr.279141.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Accepted: 08/29/2024] [Indexed: 09/06/2024]
Abstract
Embedding methods have emerged as a valuable class of approaches for distilling essential information from complex high-dimensional data into more accessible lower-dimensional spaces. Applications of embedding methods to biological data have demonstrated that gene embeddings can effectively capture physical, structural, and functional relationships between genes. However, this utility has been primarily realized by using gene embeddings for downstream machine-learning tasks. Much less has been done to examine the embeddings directly, especially analyses of gene sets in embedding spaces. Here, we propose an Algorithm for Network Data Embedding and Similarity (ANDES), a novel best-match approach that can be used with existing gene embeddings to compare gene sets while reconciling gene set diversity. This intuitive method has important downstream implications for improving the utility of embedding spaces for various tasks. Specifically, we show how ANDES, when applied to different gene embeddings encoding protein-protein interactions, can be used as a novel overrepresentation- and rank-based gene set enrichment analysis method that achieves state-of-the-art performance. Additionally, ANDES can use multiorganism joint gene embeddings to facilitate functional knowledge transfer across organisms, allowing for phenotype mapping across model systems. Our flexible, straightforward best-match methodology can be extended to other embedding spaces with diverse community structures between set elements.
Collapse
Affiliation(s)
- Lechuan Li
- Department of Computer Science, Rice University, Houston, Texas 77005, USA
| | - Ruth Dannenfelser
- Department of Computer Science, Rice University, Houston, Texas 77005, USA
| | - Charlie Cruz
- Department of Computer Science, Rice University, Houston, Texas 77005, USA
| | - Vicky Yao
- Department of Computer Science, Rice University, Houston, Texas 77005, USA
| |
Collapse
|
42
|
Meng L, Wang X. TAWFN: a deep learning framework for protein function prediction. Bioinformatics 2024; 40:btae571. [PMID: 39312678 PMCID: PMC11639667 DOI: 10.1093/bioinformatics/btae571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Revised: 08/27/2024] [Accepted: 09/19/2024] [Indexed: 09/25/2024] Open
Abstract
MOTIVATION Proteins play pivotal roles in biological systems, and precise prediction of their functions is indispensable for practical applications. Despite the surge in protein sequence data facilitated by high-throughput techniques, unraveling the exact functionalities of proteins still demands considerable time and resources. Currently, numerous methods rely on protein sequences for prediction, while methods targeting protein structures are scarce, often employing convolutional neural networks (CNN) or graph convolutional networks (GCNs) individually. RESULTS To address these challenges, our approach starts from protein structures and proposes a method that combines CNN and GCN into a unified framework called the two-model adaptive weight fusion network (TAWFN) for protein function prediction. First, amino acid contact maps and sequences are extracted from the protein structure. Then, the sequence is used to generate one-hot encoded features and deep semantic features. These features, along with the constructed graph, are fed into the adaptive graph convolutional networks (AGCN) module and the multi-layer convolutional neural network (MCNN) module as needed, resulting in preliminary classification outcomes. Finally, the preliminary classification results are inputted into the adaptive weight computation network, where adaptive weights are calculated to fuse the initial predictions from both networks, yielding the final prediction result. To evaluate the effectiveness of our method, experiments were conducted on the PDBset and AFset datasets. For molecular function, biological process, and cellular component tasks, TAWFN achieved area under the precision-recall curve (AUPR) values of 0.718, 0.385, and 0.488 respectively, with corresponding Fmax scores of 0.762, 0.628, and 0.693, and Smin scores of 0.326, 0.483, and 0.454. The experimental results demonstrate that TAWFN exhibits promising performance, outperforming existing methods. AVAILABILITY AND IMPLEMENTATION The TAWFN source code can be found at: https://github.com/ss0830/TAWFN.
Collapse
Affiliation(s)
- Lu Meng
- College of Information Science and Engineering, Northeastern University, Shenyang, Liaoning, 110000, China
| | - Xiaoran Wang
- College of Information Science and Engineering, Northeastern University, Shenyang, Liaoning, 110000, China
| |
Collapse
|
43
|
Meher PK, Pradhan UK, Sethi PL, Naha S, Gupta A, Parsad R. PredPSP: a novel computational tool to discover pathway-specific photosynthetic proteins in plants. PLANT MOLECULAR BIOLOGY 2024; 114:106. [PMID: 39316155 DOI: 10.1007/s11103-024-01500-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Accepted: 09/04/2024] [Indexed: 09/25/2024]
Abstract
Photosynthetic proteins play a crucial role in agricultural productivity by harnessing light energy for plant growth. Understanding these proteins, especially within C3 and C4 pathways, holds promise for improving crops in challenging environments. Despite existing models, a comprehensive computational framework specifically targeting plant photosynthetic proteins is lacking. The underutilization of plant datasets in computational algorithms accentuates the gap this study aims to fill by introducing a novel sequence-based computational method for identifying these proteins. The scope of this study encompassed diverse plant species, ensuring comprehensive representation across C3 and C4 pathways. Utilizing six deep learning models and seven shallow learning algorithms, paired with six sequence-derived feature sets followed by feature selection strategy, this study developed a comprehensive model for prediction of plant-specific photosynthetic proteins. Following 5-fold cross-validation analysis, LightGBM with 65 and 90 LGBM-VIM selected features respectively emerged as the best models for C3 (auROC: 91.78%, auPRC: 92.55%) and C4 (auROC: 99.05%, auPRC: 99.18%) plants. Validation using an independent dataset confirmed the robustness of the proposed model for both C3 (auROC: 87.23%, auPRC: 88.40%) and C4 (auROC: 92.83%, auPRC: 92.29%) categories. Comparison with existing methods demonstrated the superiority of the proposed model in predicting plant-specific photosynthetic proteins. This study further established a free online prediction server PredPSP ( https://iasri-sg.icar.gov.in/predpsp/ ) to facilitate ongoing efforts for identifying photosynthetic proteins in C3 and C4 plants. Being first of its kind, this study offers valuable insights into predicting plant-specific photosynthetic proteins which holds significant implications for plant biology.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India.
| | - Upendra Kumar Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India
| | - Padma Lochan Sethi
- Department of Bioinformatics, Odisha University of Agriculture & Technology, Bhubaneswar, 751003, Odisha, India
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India
| | - Rajender Parsad
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India
| |
Collapse
|
44
|
Bai P, Li G, Luo J, Liang C. Deep learning model for protein multi-label subcellular localization and function prediction based on multi-task collaborative training. Brief Bioinform 2024; 25:bbae568. [PMID: 39489606 PMCID: PMC11531862 DOI: 10.1093/bib/bbae568] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2024] [Revised: 09/24/2024] [Accepted: 10/22/2024] [Indexed: 11/05/2024] Open
Abstract
The functional study of proteins is a critical task in modern biology, playing a pivotal role in understanding the mechanisms of pathogenesis, developing new drugs, and discovering novel drug targets. However, existing computational models for subcellular localization face significant challenges, such as reliance on known Gene Ontology (GO) annotation databases or overlooking the relationship between GO annotations and subcellular localization. To address these issues, we propose DeepMTC, an end-to-end deep learning-based multi-task collaborative training model. DeepMTC integrates the interrelationship between subcellular localization and the functional annotation of proteins, leveraging multi-task collaborative training to eliminate dependence on known GO databases. This strategy gives DeepMTC a distinct advantage in predicting newly discovered proteins without prior functional annotations. First, DeepMTC leverages pre-trained language model with high accuracy to obtain the 3D structure and sequence features of proteins. Additionally, it employs a graph transformer module to encode protein sequence features, addressing the problem of long-range dependencies in graph neural networks. Finally, DeepMTC uses a functional cross-attention mechanism to efficiently combine upstream learned functional features to perform the subcellular localization task. The experimental results demonstrate that DeepMTC outperforms state-of-the-art models in both protein function prediction and subcellular localization. Moreover, interpretability experiments revealed that DeepMTC can accurately identify the key residues and functional domains of proteins, confirming its superior performance. The code and dataset of DeepMTC are freely available at https://github.com/ghli16/DeepMTC.
Collapse
Affiliation(s)
- Peihao Bai
- School of Information and Software Engineering, East China Jiaotong University, No. 808 Shuanggang East Road, Nanchang 330013, China
| | - Guanghui Li
- School of Information and Software Engineering, East China Jiaotong University, No. 808 Shuanggang East Road, Nanchang 330013, China
| | - Jiawei Luo
- College of Computer Science and Electronic Engineering, Hunan University, No. 2 Lushan Road, Changsha 410082, China
| | - Cheng Liang
- School of Information Science and Engineering, Shandong Normal University, No. 1 University Road, Jinan 250358, China
- Shandong Key Laboratory of Biophysics, Dezhou University, No. 566 University Road, Dezhou 253023, China
| |
Collapse
|
45
|
Mi J, Wang H, Li J, Sun J, Li C, Wan J, Zeng Y, Gao J. GGN-GO: geometric graph networks for predicting protein function by multi-scale structure features. Brief Bioinform 2024; 25:bbae559. [PMID: 39487084 PMCID: PMC11530295 DOI: 10.1093/bib/bbae559] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2024] [Revised: 10/03/2024] [Accepted: 10/17/2024] [Indexed: 11/04/2024] Open
Abstract
Recent advances in high-throughput sequencing have led to an explosion of genomic and transcriptomic data, offering a wealth of protein sequence information. However, the functions of most proteins remain unannotated. Traditional experimental methods for annotation of protein functions are costly and time-consuming. Current deep learning methods typically rely on Graph Convolutional Networks to propagate features between protein residues. However, these methods fail to capture fine atomic-level geometric structural features and cannot directly compute or propagate structural features (such as distances, directions, and angles) when transmitting features, often simplifying them to scalars. Additionally, difficulties in capturing long-range dependencies limit the model's ability to identify key nodes (residues). To address these challenges, we propose a geometric graph network (GGN-GO) for predicting protein function that enriches feature extraction by capturing multi-scale geometric structural features at the atomic and residue levels. We use a geometric vector perceptron to convert these features into vector representations and aggregate them with node features for better understanding and propagation in the network. Moreover, we introduce a graph attention pooling layer captures key node information by adaptively aggregating local functional motifs, while contrastive learning enhances graph representation discriminability through random noise and different views. The experimental results show that GGN-GO outperforms six comparative methods in tasks with the most labels for both experimentally validated and predicted protein structures. Furthermore, GGN-GO identifies functional residues corresponding to those experimentally confirmed, showcasing its interpretability and the ability to pinpoint key protein regions. The code and data are available at: https://github.com/MiJia-ID/GGN-GO.
Collapse
Affiliation(s)
- Jia Mi
- The College of Information Science and Technology, Beijing University of Chemical Technology, Beijing
| | - Han Wang
- The College of Information Science and Technology, Beijing University of Chemical Technology, Beijing
| | - Jing Li
- The College of Life Science and Technology, Beijing University of Chemical Technology, Beijing
| | - Jinghong Sun
- The College of Information Science and Technology, Beijing University of Chemical Technology, Beijing
| | - Chang Li
- The College of Information Science and Technology, Beijing University of Chemical Technology, Beijing
| | - Jing Wan
- The College of Information Science and Technology, Beijing University of Chemical Technology, Beijing
| | - Yuan Zeng
- Microbial Resource and Big Data Center, Institute of Microbiology, Chinese Academy of Sciences
- Chinese National Microbiology Data Center (NMDC)
| | - Jingyang Gao
- The College of Information Science and Technology, Beijing University of Chemical Technology, Beijing
| |
Collapse
|
46
|
Barrios-Núñez I, Martínez-Redondo G, Medina-Burgos P, Cases I, Fernández R, Rojas A. Decoding functional proteome information in model organisms using protein language models. NAR Genom Bioinform 2024; 6:lqae078. [PMID: 38962255 PMCID: PMC11217674 DOI: 10.1093/nargab/lqae078] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2024] [Revised: 05/31/2024] [Accepted: 06/26/2024] [Indexed: 07/05/2024] Open
Abstract
Protein language models have been tested and proved to be reliable when used on curated datasets but have not yet been applied to full proteomes. Accordingly, we tested how two different machine learning-based methods performed when decoding functional information from the proteomes of selected model organisms. We found that protein language models are more precise and informative than deep learning methods for all the species tested and across the three gene ontologies studied, and that they better recover functional information from transcriptomic experiments. The results obtained indicate that these language models are likely to be suitable for large-scale annotation and downstream analyses, and we recommend a guide for their use.
Collapse
Affiliation(s)
- Israel Barrios-Núñez
- Computational Biology and Bioinformatics Group, Andalusian Center for Developmental Biology (CABD-CSIC), 41013 Sevilla, Spain
| | | | - Patricia Medina-Burgos
- Computational Biology and Bioinformatics Group, Andalusian Center for Developmental Biology (CABD-CSIC), 41013 Sevilla, Spain
| | - Ildefonso Cases
- Bioinformatics Unit, Andalusian Center for Developmental Biology (CABD-CSIC), 41013 Sevilla, Spain
| | - Rosa Fernández
- Metazoa Phylogenomics Lab, Institute of Evolutionary Biology (CSIC-UPF), 08003 Barcelona, Spain
| | - Ana M Rojas
- Computational Biology and Bioinformatics Group, Andalusian Center for Developmental Biology (CABD-CSIC), 41013 Sevilla, Spain
| |
Collapse
|
47
|
Yan H, Wang S, Liu H, Mamitsuka H, Zhu S. GORetriever: reranking protein-description-based GO candidates by literature-driven deep information retrieval for protein function annotation. Bioinformatics 2024; 40:ii53-ii61. [PMID: 39230707 PMCID: PMC11520413 DOI: 10.1093/bioinformatics/btae401] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024] Open
Abstract
SUMMARY The vast majority of proteins still lack experimentally validated functional annotations, which highlights the importance of developing high-performance automated protein function prediction/annotation (AFP) methods. While existing approaches focus on protein sequences, networks, and structural data, textual information related to proteins has been overlooked. However, roughly 82% of SwissProt proteins already possess literature information that experts have annotated. To efficiently and effectively use literature information, we present GORetriever, a two-stage deep information retrieval-based method for AFP. Given a target protein, in the first stage, candidate Gene Ontology (GO) terms are retrieved by using annotated proteins with similar descriptions. In the second stage, the GO terms are reranked based on semantic matching between the GO definitions and textual information (literature and protein description) of the target protein. Extensive experiments over benchmark datasets demonstrate the remarkable effectiveness of GORetriever in enhancing the AFP performance. Note that GORetriever is the key component of GOCurator, which has achieved first place in the latest critical assessment of protein function annotation (CAFA5: over 1600 teams participated), held in 2023-2024. AVAILABILITY AND IMPLEMENTATION GORetriever is publicly available at https://github.com/ZhuLab-Fudan/GORetriever.
Collapse
Affiliation(s)
- Huiying Yan
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, China
| | - Shaojun Wang
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, China
| | - Hancheng Liu
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, China
| | - Hiroshi Mamitsuka
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto Prefecture 611-0011, Japan
- Department of Computer Science, Aalto University, Espoo 00076, Finland
| | - Shanfeng Zhu
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, China
- Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, 200433, China
- Shanghai Key Lab of Intelligent Information Processing and Shanghai Institute of Artificial Intelligence Algorithm, Fudan University, Shanghai, 200433, China
- Zhangjiang Fudan International Innovation Center, Shanghai, 200433, China
| |
Collapse
|
48
|
Boadu F, Cheng J. Improving protein function prediction by learning and integrating representations of protein sequences and function labels. BIOINFORMATICS ADVANCES 2024; 4:vbae120. [PMID: 39233898 PMCID: PMC11374024 DOI: 10.1093/bioadv/vbae120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/07/2024] [Revised: 07/31/2024] [Accepted: 08/12/2024] [Indexed: 09/06/2024]
Abstract
Motivation As fewer than 1% of proteins have protein function information determined experimentally, computationally predicting the function of proteins is critical for obtaining functional information for most proteins and has been a major challenge in protein bioinformatics. Despite the significant progress made in protein function prediction by the community in the last decade, the general accuracy of protein function prediction is still not high, particularly for rare function terms associated with few proteins in the protein function annotation database such as the UniProt. Results We introduce TransFew, a new transformer model, to learn the representations of both protein sequences and function labels [Gene Ontology (GO) terms] to predict the function of proteins. TransFew leverages a large pre-trained protein language model (ESM2-t48) to learn function-relevant representations of proteins from raw protein sequences and uses a biological natural language model (BioBert) and a graph convolutional neural network-based autoencoder to generate semantic representations of GO terms from their textual definition and hierarchical relationships, which are combined together to predict protein function via the cross-attention. Integrating the protein sequence and label representations not only enhances overall function prediction accuracy, but delivers a robust performance of predicting rare function terms with limited annotations by facilitating annotation transfer between GO terms. Availability and implementation https://github.com/BioinfoMachineLearning/TransFew.
Collapse
Affiliation(s)
- Frimpong Boadu
- Department of Electrical Engineering and Computer Science, NextGen Precision Health Institute, University of Missouri, Columbia, MO 65211, United States
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, NextGen Precision Health Institute, University of Missouri, Columbia, MO 65211, United States
| |
Collapse
|
49
|
Jang YJ, Qin QQ, Huang SY, Peter ATJ, Ding XM, Kornmann B. Accurate prediction of protein function using statistics-informed graph networks. Nat Commun 2024; 15:6601. [PMID: 39097570 PMCID: PMC11297950 DOI: 10.1038/s41467-024-50955-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Accepted: 07/15/2024] [Indexed: 08/05/2024] Open
Abstract
Understanding protein function is pivotal in comprehending the intricate mechanisms that underlie many crucial biological activities, with far-reaching implications in the fields of medicine, biotechnology, and drug development. However, more than 200 million proteins remain uncharacterized, and computational efforts heavily rely on protein structural information to predict annotations of varying quality. Here, we present a method that utilizes statistics-informed graph networks to predict protein functions solely from its sequence. Our method inherently characterizes evolutionary signatures, allowing for a quantitative assessment of the significance of residues that carry out specific functions. PhiGnet not only demonstrates superior performance compared to alternative approaches but also narrows the sequence-function gap, even in the absence of structural information. Our findings indicate that applying deep learning to evolutionary data can highlight functional sites at the residue level, providing valuable support for interpreting both existing properties and new functionalities of proteins in research and biomedicine.
Collapse
Affiliation(s)
- Yaan J Jang
- Department of Biochemistry, University of Oxford, Oxford, UK.
- AmoAi Technologies, Oxford, UK.
| | - Qi-Qi Qin
- AmoAi Technologies, Oxford, UK
- School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China
| | - Si-Yu Huang
- AmoAi Technologies, Oxford, UK
- Oxford Martin School, University of Oxford, Oxford, UK
- School of Systems Science, Beijing Normal University, Beijing, China
| | | | - Xue-Ming Ding
- School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China
| | - Benoît Kornmann
- Department of Biochemistry, University of Oxford, Oxford, UK.
| |
Collapse
|
50
|
Dickson A, Mofrad MRK. Fine-tuning protein embeddings for functional similarity evaluation. Bioinformatics 2024; 40:btae445. [PMID: 38985218 PMCID: PMC11299545 DOI: 10.1093/bioinformatics/btae445] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Revised: 06/25/2024] [Accepted: 07/09/2024] [Indexed: 07/11/2024] Open
Abstract
MOTIVATION Proteins with unknown function are frequently compared to better characterized relatives, either using sequence similarity, or recently through similarity in a learned embedding space. Through comparison, protein sequence embeddings allow for interpretable and accurate annotation of proteins, as well as for downstream tasks such as clustering for unsupervised discovery of protein families. However, it is unclear whether embeddings can be deliberately designed to improve their use in these downstream tasks. RESULTS We find that for functional annotation of proteins, as represented by Gene Ontology (GO) terms, direct fine-tuning of language models on a simple classification loss has an immediate positive impact on protein embedding quality. Fine-tuned embeddings show stronger performance as representations for K-nearest neighbor classifiers, reaching stronger performance for GO annotation than even directly comparable fine-tuned classifiers, while maintaining interpretability through protein similarity comparisons. They also maintain their quality in related tasks, such as rediscovering protein families with clustering. AVAILABILITY AND IMPLEMENTATION github.com/mofradlab/go_metric.
Collapse
Affiliation(s)
- Andrew Dickson
- Departments of Bioengineering and Mechanical Engineering, Molecular Cell Biomechanics Laboratory, University of California, Berkeley, CA 94720, United States
| | - Mohammad R K Mofrad
- Departments of Bioengineering and Mechanical Engineering, Molecular Cell Biomechanics Laboratory, University of California, Berkeley, CA 94720, United States
| |
Collapse
|