1
|
Liu J, Li K, Tang X, Zhang Y, Guan X. Grain protein function prediction based on improved FCN and bidirectional LSTM. Food Chem 2025; 482:143955. [PMID: 40209386 DOI: 10.1016/j.foodchem.2025.143955] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Revised: 03/10/2025] [Accepted: 03/17/2025] [Indexed: 04/12/2025]
Abstract
With the development of high-throughput sequencing technologies, predicting grain protein function from amino acid sequences based on intelligent model has become one of the significant tasks in bioinformatics. The soybean, maize, indica, and japonica are selected as grain dataset from the UniProtKB. Aiming at the problem of neglecting the sequence order of amino acids and the long-term dependence between amino acids, the PBiLSTM-FCN model is proposed for predicting grain protein function in this paper. The sequence of amino acid sequences is considered in the Fully Convolutional Networks (FCN), and the long-term dependence between amino acids is addressed by the bidirectional Long Short-Term Memory network (BiLSTM). The experimental results show that the PBiLSTM-FCN model is superior to existing models, and can predict more accurately by solving the problem of capturing long-range dependencies and the order of amino acid sequences. Finally, the interpretability analyses are performed by the actual protein function compared with the predicted protein function which proves the effectiveness of the PBiLSTM-FCN model in predicting grain protein function.
Collapse
Affiliation(s)
- Jing Liu
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Kun Li
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Xinghua Tang
- College of Information Engineering, Shanghai Maritime University, Shanghai 201306, China
| | - Yu Zhang
- School of Health Science and Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China; National Grain Industry (Urban Grain and Oil Security) Technology Innovation Center, Shanghai 200093, China
| | - Xiao Guan
- School of Health Science and Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China; National Grain Industry (Urban Grain and Oil Security) Technology Innovation Center, Shanghai 200093, China.
| |
Collapse
|
2
|
Cross MCG, Aboulnaga E, TerAvest MA. A small number of point mutations confer formate tolerance in Shewanella oneidensis. Appl Environ Microbiol 2025; 91:e0196824. [PMID: 40207971 DOI: 10.1128/aem.01968-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2024] [Accepted: 01/18/2025] [Indexed: 04/11/2025] Open
Abstract
Microbial electrosynthesis (MES) is a sustainable approach to chemical production from CO2 and clean electricity. However, limitations in electron transfer efficiency and gaps in understanding of electron transfer pathways in MES systems prevent full realization of this technology. Shewanella oneidensis could serve as an MES biocatalyst because it has a well-studied, efficient transmembrane electron transfer pathway. A key first step in MES in this organism could be CO2 reduction to formate. However, we report that wild-type S. oneidensis does not tolerate high levels of formate. In this work, we created and characterized formate-tolerant strains of S. oneidensis for further engineering and future use in MES systems through adaptive laboratory evolution. Two different point mutations in a gene encoding a predicted sodium-dependent bicarbonate transporter and a DUF2721-containing protein separately confer formate tolerance to S. oneidensis. The mutations were further evaluated to understand their role in improving formate tolerance. We also show that the wild-type and mutant versions of the putative sodium-dependent bicarbonate transporter improve formate tolerance of Zymomonas mobilis, indicating the potential of transferring this formate tolerance phenotype to other organisms. IMPORTANCE Shewanella oneidensis is a bacterium with a well-studied, efficient extracellular electron transfer pathway. This capability could make this organism a suitable host for microbial electrosynthesis using CO2 or formate as feedstocks. However, we report here that formate is toxic to S. oneidensis, limiting the potential for its use in these systems. In this work, we evolve several strains of S. oneidensis that have improved formate tolerance, and we investigate some mutations that confer this phenotype. The phenotype is confirmed to be attributed to several single point mutations by transferring the wild-type and mutant versions of each gene to the wild-type strain. Finally, the formate tolerance mechanism of one variant is studied using structural modeling and expression in another host. This study, therefore, presents a simple method for conferring formate tolerance to bacterial hosts.
Collapse
Affiliation(s)
- Megan C Gruenberg Cross
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan, USA
| | - Elhussiny Aboulnaga
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan, USA
- Faculty of Agriculture, Mansoura University, Mansoura, Egypt
| | - Michaela A TerAvest
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan, USA
| |
Collapse
|
3
|
Kong D, Qian J, Gao C, Wang Y, Shi T, Ye C. Machine Learning Empowering Microbial Cell Factory: A Comprehensive Review. Appl Biochem Biotechnol 2025:10.1007/s12010-025-05260-x. [PMID: 40397295 DOI: 10.1007/s12010-025-05260-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/02/2025] [Indexed: 05/22/2025]
Abstract
The wide application of machine learning has provided more possibilities for biological manufacturing, and the combination of machine learning and synthetic biology technology has ignited even more brilliant sparks, which has created an unpredictable value for the upgrading of microbial cell factories. The review delves into the synergies between machine learning and synthetic biology to create research worth investigating in biotechnology. We explore relevant databases, toolboxes, and machine learning-derived models. Furthermore, we examine specific applications of this combined approach in chemical production, human health, and environmental remediation. By elucidating these successful integrations, this review aims to provide valuable guidance for future research at the intersection of biomanufacturing and artificial intelligence.
Collapse
Affiliation(s)
- Dechun Kong
- School of Food Science and Pharmaceutical Engineering, Nanjing Normal University, Nanjing, 210023, People's Republic of China
| | - Jinyi Qian
- Ministry of Education Key Laboratory of NSLSCS, Nanjing Normal University, Nanjing, 210023, People's Republic of China
| | - Cong Gao
- School of Biotechnology and Key Laboratory of Industrial Biotechnology of Ministry of Education, Jiangnan University, Wuxi, 214122, People's Republic of China
| | - Yuetong Wang
- School of Food Science and Pharmaceutical Engineering, Nanjing Normal University, Nanjing, 210023, People's Republic of China.
| | - Tianqiong Shi
- School of Food Science and Pharmaceutical Engineering, Nanjing Normal University, Nanjing, 210023, People's Republic of China.
- State Key Laboratory of Microbial Technology, Nanjing Normal University, Nanjing, 210023, People's Republic of China.
| | - Chao Ye
- School of Food Science and Pharmaceutical Engineering, Nanjing Normal University, Nanjing, 210023, People's Republic of China.
- Ministry of Education Key Laboratory of NSLSCS, Nanjing Normal University, Nanjing, 210023, People's Republic of China.
| |
Collapse
|
4
|
Percudani R, De Rito C. Predicting Protein Function in the AI and Big Data Era. Biochemistry 2025. [PMID: 40380914 DOI: 10.1021/acs.biochem.5c00186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/19/2025]
Abstract
It is an exciting time for researchers working to link proteins to their functions. Most techniques for extracting functional information from genomic sequences were developed several years ago, with major progress driven by the availability of big data. Now, groundbreaking advances in deep-learning and AI-based methods have enriched protein databases with three-dimensional information and offer the potential to predict biochemical properties and biomolecular interactions, providing key functional insights. This progress is expected to increase the proportion of functionally bright proteins in databases and deepen our understanding of life at the molecular level.
Collapse
Affiliation(s)
- Riccardo Percudani
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, 43124 Parma, Italy
| | - Carlo De Rito
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, 43124 Parma, Italy
| |
Collapse
|
5
|
Shao J, Chen J, Liu B. ProFun-SOM: Protein Function Prediction for Specific Ontology Based on Multiple Sequence Alignment Reconstruction. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:8060-8071. [PMID: 38980781 DOI: 10.1109/tnnls.2024.3419250] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/11/2024]
Abstract
Protein function prediction is crucial for understanding species evolution, including viral mutations. Gene ontology (GO) is a standardized representation framework for describing protein functions with annotated terms. Each ontology is a specific functional category containing multiple child ontologies, and the relationships of parent and child ontologies create a directed acyclic graph. Protein functions are categorized using GO, which divides them into three main groups: cellular component ontology, molecular function ontology, and biological process ontology. Therefore, the GO annotation of protein is a hierarchical multilabel classification problem. This hierarchical relationship introduces complexities such as mixed ontology problem, leading to performance bottlenecks in existing computational methods due to label dependency and data sparsity. To overcome bottleneck issues brought by mixed ontology problem, we propose ProFun-SOM, an innovative multilabel classifier that utilizes multiple sequence alignments (MSAs) to accurately annotate gene ontologies. ProFun-SOM enhances the initial MSAs through a reconstruction process and integrates them into a deep learning architecture. It then predicts annotations within the cellular component, molecular function, biological process, and mixed ontologies. Our evaluation results on three datasets (CAFA3, SwissProt, and NetGO2) demonstrate that ProFun-SOM surpasses state-of-the-art methods. This study confirmed that utilizing MSAs of proteins can effectively overcome the two main bottlenecks issues, label dependency and data sparsity, thereby alleviating the root problem, mixed ontology. A freely accessible web server is available at http://bliulab.net/ ProFun-SOM/.
Collapse
|
6
|
de Oliveira GB, Pedrini H, Dias Z. SUPERMAGO: Protein Function Prediction Based on Transformer Embeddings. Proteins 2025; 93:981-996. [PMID: 39711079 DOI: 10.1002/prot.26782] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2024] [Revised: 11/28/2024] [Accepted: 12/09/2024] [Indexed: 12/24/2024]
Abstract
Recent technological advancements have enabled the experimental determination of amino acid sequences for numerous proteins. However, analyzing protein functions, which is essential for understanding their roles within cells, remains a challenging task due to the associated costs and time constraints. To address this challenge, various computational approaches have been proposed to aid in the categorization of protein functions, mainly utilizing amino acid sequences. In this study, we introduce SUPERMAGO, a method that leverages amino acid sequences to predict protein functions. Our approach employs Transformer architectures, pre-trained on protein data, to extract features from the sequences. We use multilayer perceptrons for classification and a stacking neural network to aggregate the predictions, which significantly enhances the performance of our method. We also present SUPERMAGO+, an ensemble of SUPERMAGO and DIAMOND, based on neural networks that assign different weights to each term, offering a novel weighting mechanism compared with existing methods in the literature. Additionally, we introduce SUPERMAGO+Web, a web server-compatible version of SUPERMAGO+ designed to operate with reduced computational resources. Both SUPERMAGO and SUPERMAGO+ consistently outperformed state-of-the-art approaches in our evaluations, establishing them as leading methods for this task when considering only amino acid sequence information.
Collapse
Affiliation(s)
| | - Helio Pedrini
- Institute of Computing, University of Campinas, Campinas, Brazil
| | - Zanoni Dias
- Institute of Computing, University of Campinas, Campinas, Brazil
| |
Collapse
|
7
|
Kong Y, Chen H, Huang X, Chang L, Yang B, Chen W. Precise metabolic modeling in post-omics era: accomplishments and perspectives. Crit Rev Biotechnol 2025; 45:683-701. [PMID: 39198033 DOI: 10.1080/07388551.2024.2390089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 07/18/2024] [Accepted: 07/23/2024] [Indexed: 09/01/2024]
Abstract
Microbes have been extensively utilized for their sustainable and scalable properties in synthesizing desired bio-products. However, insufficient knowledge about intracellular metabolism has impeded further microbial applications. The genome-scale metabolic models (GEMs) play a pivotal role in facilitating a global understanding of cellular metabolic mechanisms. These models enable rational modification by exploring metabolic pathways and predicting potential targets in microorganisms, enabling precise cell regulation without experimental costs. Nonetheless, simplified GEM only considers genome information and network stoichiometry while neglecting other important bio-information, such as enzyme functions, thermodynamic properties, and kinetic parameters. Consequently, uncertainties persist particularly when predicting microbial behaviors in complex and fluctuant systems. The advent of the omics era with its massive quantification of genes, proteins, and metabolites under various conditions has led to the flourishing of multi-constrained models and updated algorithms with improved predicting power and broadened dimension. Meanwhile, machine learning (ML) has demonstrated exceptional analytical and predictive capacities when applied to training sets of biological big data. Incorporating the discriminant strength of ML with GEM facilitates mechanistic modeling efficiency and improves predictive accuracy. This paper provides an overview of research innovations in the GEM, including multi-constrained modeling, analytical approaches, and the latest applications of ML, which may contribute comprehensive knowledge toward genetic refinement, strain development, and yield enhancement for a broad range of biomolecules.
Collapse
Affiliation(s)
- Yawen Kong
- State Key Laboratory of Food Science and Resources, Jiangnan University, Wuxi, P. R. China
- School of Food Science and Technology, Jiangnan University, Wuxi, P. R. China
| | - Haiqin Chen
- State Key Laboratory of Food Science and Resources, Jiangnan University, Wuxi, P. R. China
- School of Food Science and Technology, Jiangnan University, Wuxi, P. R. China
| | - Xinlei Huang
- The Key Laboratory of Industrial Biotechnology, School of Biotechnology, Jiangnan University, Wuxi, P. R. China
| | - Lulu Chang
- State Key Laboratory of Food Science and Resources, Jiangnan University, Wuxi, P. R. China
- School of Food Science and Technology, Jiangnan University, Wuxi, P. R. China
| | - Bo Yang
- State Key Laboratory of Food Science and Resources, Jiangnan University, Wuxi, P. R. China
- School of Food Science and Technology, Jiangnan University, Wuxi, P. R. China
| | - Wei Chen
- State Key Laboratory of Food Science and Resources, Jiangnan University, Wuxi, P. R. China
- School of Food Science and Technology, Jiangnan University, Wuxi, P. R. China
- National Engineering Research Center for Functional Food, Jiangnan University, Wuxi, P. R. China
| |
Collapse
|
8
|
Kelly T, Xia S, Lu J, Zhang Y. Unified Deep Learning of Molecular and Protein Language Representations with T5ProtChem. J Chem Inf Model 2025; 65:3990-3998. [PMID: 40197028 PMCID: PMC12042257 DOI: 10.1021/acs.jcim.5c00051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2025] [Revised: 03/19/2025] [Accepted: 03/31/2025] [Indexed: 04/09/2025]
Abstract
Deep learning has revolutionized difficult tasks in chemistry and biology, yet existing language models often treat these domains separately, relying on concatenated architectures and independently pretrained weights. These approaches fail to fully exploit the shared atomic foundations of molecular and protein sequences. Here, we introduce T5ProtChem, a unified model based on the T5 architecture, designed to simultaneously process molecular and protein sequences. Using a new pretraining objective, ProtiSMILES, T5ProtChem bridges the molecular and protein domains, enabling efficient, generalizable protein-chemical modeling. The model achieves a state-of-the-art performance in tasks such as binding affinity prediction and reaction prediction, while having a strong performance in protein function prediction. Additionally, it supports novel applications, including covalent binder classification and sequence-level adduct prediction. These results demonstrate the versatility of unified language models for drug discovery, protein engineering, and other interdisciplinary efforts in computational biology and chemistry.
Collapse
Affiliation(s)
- Thomas Kelly
- Department
of Chemistry, New York University, New York, New York 10003, United States
| | - Song Xia
- Department
of Chemistry, New York University, New York, New York 10003, United States
| | - Jieyu Lu
- Department
of Chemistry, New York University, New York, New York 10003, United States
| | - Yingkai Zhang
- Department
of Chemistry, New York University, New York, New York 10003, United States
- Simons
Center for Computational Physical Chemistry at New York University, New York, New York 10003, United States
- NYU-ECNU
Center for Computational Chemistry at NYU Shanghai, Shanghai 200062, China
| |
Collapse
|
9
|
Zhang H, Sun Y, Wang Y, Luo X, Liu Y, Chen B, Jin X, Zhu D. GTPLM-GO: Enhancing Protein Function Prediction Through Dual-Branch Graph Transformer and Protein Language Model Fusing Sequence and Local-Global PPI Information. Int J Mol Sci 2025; 26:4088. [PMID: 40362328 PMCID: PMC12072039 DOI: 10.3390/ijms26094088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2025] [Revised: 04/21/2025] [Accepted: 04/23/2025] [Indexed: 05/15/2025] Open
Abstract
Currently, protein-protein interaction (PPI) networks have become an essential data source for protein function prediction. However, methods utilizing graph neural networks (GNNs) face significant challenges in modeling PPI networks. A primary issue is over-smoothing, which occurs when multiple GNN layers are stacked to capture global information. This architectural limitation inherently impairs the integration of local and global information within PPI networks, thereby limiting the accuracy of protein function prediction. To effectively utilize information within PPI networks, we propose GTPLM-GO, a protein function prediction method based on a dual-branch Graph Transformer and protein language model. The dual-branch Graph Transformer achieves the collaborative modeling of local and global information in PPI networks through two branches: a graph neural network and a linear attention-based Transformer encoder. GTPLM-GO integrates local-global PPI information with the functional semantic encoding constructed by the protein language model, overcoming the issue of inadequate information extraction in existing methods. Experimental results demonstrate that GTPLM-GO outperforms advanced network-based and sequence-based methods on PPI network datasets of varying scales.
Collapse
Affiliation(s)
- Haotian Zhang
- School of Computer Science and Technology, Harbin Institute of Technology, Weihai 264209, China; (H.Z.); (Y.S.); (Y.W.); (B.C.)
| | - Yundong Sun
- School of Computer Science and Technology, Harbin Institute of Technology, Weihai 264209, China; (H.Z.); (Y.S.); (Y.W.); (B.C.)
- Department of Electronic Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Yansong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Weihai 264209, China; (H.Z.); (Y.S.); (Y.W.); (B.C.)
| | - Xiaoling Luo
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China;
| | - Yumeng Liu
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen 518118, China;
| | - Bin Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Weihai 264209, China; (H.Z.); (Y.S.); (Y.W.); (B.C.)
| | - Xiaopeng Jin
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen 518118, China;
| | - Dongjie Zhu
- School of Computer Science and Technology, Harbin Institute of Technology, Weihai 264209, China; (H.Z.); (Y.S.); (Y.W.); (B.C.)
| |
Collapse
|
10
|
Chen S, Zheng P, Zheng L, Yao Q, Meng Z, Lin L, Chen X, Liu R. BERT-DomainAFP: Antifreeze protein recognition and classification model based on BERT and structural domain annotation. iScience 2025; 28:112077. [PMID: 40241758 PMCID: PMC12002629 DOI: 10.1016/j.isci.2025.112077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2024] [Revised: 01/03/2025] [Accepted: 02/17/2025] [Indexed: 04/18/2025] Open
Abstract
Antifreeze proteins (AFPs) are crucial for organisms to adapt to low temperatures, with applications in medicine, food storage, aquaculture, and agriculture. Accurate AFP identification is challenging due to structural and sequence diversity. To improve prediction and classification, we propose BERT-DomainAFP, a deep learning model trained on the AntiFreezeDomains dataset created with a novel annotation strategy. The model uses pre-trained ProteinBERT and incorporates oversampling and undersampling techniques to handle unbalanced data, ensuring high predictive ability. BERT-DomainAFP achieves 98.48% accuracy, the highest among existing models, and can classify different AFP types based on structural domain features. This model outperforms current tools, offering a promising solution for AFP recognition and classification in research and applications.
Collapse
Affiliation(s)
- Shengzhen Chen
- State Key Laboratory of Mariculture Breeding, Key Laboratory of Marine Biotechnology of Fujian Province, Institute of Oceanology, College of Marine Sciences, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Ping Zheng
- State Key Laboratory of Mariculture Breeding, Key Laboratory of Marine Biotechnology of Fujian Province, Institute of Oceanology, College of Marine Sciences, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Lele Zheng
- State Key Laboratory of Mariculture Breeding, Key Laboratory of Marine Biotechnology of Fujian Province, Institute of Oceanology, College of Marine Sciences, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Qinglong Yao
- State Key Laboratory of Mariculture Breeding, Key Laboratory of Marine Biotechnology of Fujian Province, Institute of Oceanology, College of Marine Sciences, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Ziyu Meng
- State Key Laboratory of Mariculture Breeding, Key Laboratory of Marine Biotechnology of Fujian Province, Institute of Oceanology, College of Marine Sciences, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Longshan Lin
- Laboratory of Marine Biodiversity Research, Third Institute of Oceanography, Ministry of Natural Resources, Xiamen 361005, China
| | - Xinhua Chen
- State Key Laboratory of Mariculture Breeding, Key Laboratory of Marine Biotechnology of Fujian Province, Institute of Oceanology, College of Marine Sciences, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| | - Ruoyu Liu
- State Key Laboratory of Mariculture Breeding, Key Laboratory of Marine Biotechnology of Fujian Province, Institute of Oceanology, College of Marine Sciences, Haixia Institute of Science and Technology, Fujian Agriculture and Forestry University, Fuzhou 350002, China
| |
Collapse
|
11
|
Xiong Y, Yuan S, Xiong Y, Li L, Peng J, Zhang J, Fan X, Jiang C, Sha LN, Wang Z, Peng X, Zhang Z, Yu Q, Lei X, Dong Z, Liu Y, Zhao J, Li G, Yang Z, Jia S, Li D, Sun M, Bai S, Liu J, Yang Y, Ma X. Analysis of allohexaploid wheatgrass genome reveals its Y haplome origin in Triticeae and high-altitude adaptation. Nat Commun 2025; 16:3104. [PMID: 40164609 PMCID: PMC11958778 DOI: 10.1038/s41467-025-58341-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2024] [Accepted: 03/19/2025] [Indexed: 04/02/2025] Open
Abstract
Phylogenetic origin of the Y haplome present in allopolyploid Triticeae species remains unknown. Here, we report the 10.47 Gb chromosome-scale genome of allohexaploid Elymus nutans (StStYYHH). Phylogenomic analyses reveal that the Y haplome is sister to the clade comprising V and Jv haplomes from Dasypyrum and Thinopyum. In addition, H haplome from the Hordeum-like ancestor, St haplome from the Pseudoroegneria-like ancestor and Y haplome are placed in the successively diverged clades. Resequencing data reveal the allopolyploid origins with St, Y, and H haplome combinations in Elymus. Population genomic analyses indicate that E. nutans has expanded from medium to high/low-altitude regions. Phenotype/environmental association analyses identify MAPKKK18 promoter mutations reducing its expression, aiding UV-B adaptation in high-altitude populations. These findings enhance understanding of allopolyploid evolution and aid in breeding forage and cereal crops through intergeneric hybridization within Triticeae.
Collapse
Affiliation(s)
- Yi Xiong
- College of Grassland Science and Technology, Sichuan Agricultural University, Chengdu, Sichuan, 611130, China
| | - Shuai Yuan
- State Key Laboratory of Herbage Improvement and Grassland Agro-Ecosystem, College of Ecology, Lanzhou University, Lanzhou, 730000, China
| | - Yanli Xiong
- College of Grassland Science and Technology, Sichuan Agricultural University, Chengdu, Sichuan, 611130, China
- Sichuan Academy of Grassland Sciences, Chengdu, Sichuan, 611700, China
| | - Lizuiyue Li
- National Plateau Wetlands Research Center, Southwest Forestry University, Kunming, 650224, China
- Yunnan Key Laboratory of Plateau Wetland Conservation Restoration and Ecological Services, Southwest Forestry University, Kunming, 650224, China
| | - Jinghan Peng
- College of Grassland Science and Technology, Sichuan Agricultural University, Chengdu, Sichuan, 611130, China
| | - Jin Zhang
- State Key Laboratory of Herbage Improvement and Grassland Agro-Ecosystem, College of Ecology, Lanzhou University, Lanzhou, 730000, China
| | - Xing Fan
- Triticeae Research Institute, Sichuan Agricultural University, Chengdu, Sichuan, 611130, China
| | - Chengzhi Jiang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| | - Li-Na Sha
- College of Grassland Science and Technology, Sichuan Agricultural University, Chengdu, Sichuan, 611130, China
| | - Zhaoting Wang
- College of Grassland Science and Technology, Sichuan Agricultural University, Chengdu, Sichuan, 611130, China
| | - Xue Peng
- College of Grassland Science and Technology, Sichuan Agricultural University, Chengdu, Sichuan, 611130, China
| | - Zecheng Zhang
- College of Grassland Science and Technology, Sichuan Agricultural University, Chengdu, Sichuan, 611130, China
| | - Qingqing Yu
- Sichuan Academy of Grassland Sciences, Chengdu, Sichuan, 611700, China
| | - Xiong Lei
- Sichuan Academy of Grassland Sciences, Chengdu, Sichuan, 611700, China
| | - Zhixiao Dong
- College of Grassland Science and Technology, Sichuan Agricultural University, Chengdu, Sichuan, 611130, China
| | - Yingjie Liu
- College of Grassland Science and Technology, Sichuan Agricultural University, Chengdu, Sichuan, 611130, China
| | - Junming Zhao
- College of Grassland Science and Technology, Sichuan Agricultural University, Chengdu, Sichuan, 611130, China
| | - Guangrong Li
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| | - Zujun Yang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, Sichuan, 610054, China
| | - Shangang Jia
- College of Grassland Science and Technology, China Agricultural University, Beijing, 100193, China
| | - Daxu Li
- Sichuan Academy of Grassland Sciences, Chengdu, Sichuan, 611700, China
| | - Ming Sun
- School of Life Science and Engineering, Southwest University of Science and Technology, Mianyang, Sichuan, 621010, China
| | - Shiqie Bai
- School of Life Science and Engineering, Southwest University of Science and Technology, Mianyang, Sichuan, 621010, China.
| | - Jianquan Liu
- State Key Laboratory of Herbage Improvement and Grassland Agro-Ecosystem, College of Ecology, Lanzhou University, Lanzhou, 730000, China.
| | - Yongzhi Yang
- State Key Laboratory of Herbage Improvement and Grassland Agro-Ecosystem, College of Ecology, Lanzhou University, Lanzhou, 730000, China.
| | - Xiao Ma
- College of Grassland Science and Technology, Sichuan Agricultural University, Chengdu, Sichuan, 611130, China.
| |
Collapse
|
12
|
Kim HR, Ji H, Kim GB, Lee SY. Enzyme functional classification using artificial intelligence. Trends Biotechnol 2025:S0167-7799(25)00088-5. [PMID: 40155269 DOI: 10.1016/j.tibtech.2025.03.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2025] [Revised: 02/27/2025] [Accepted: 03/06/2025] [Indexed: 04/01/2025]
Abstract
Enzymes are essential for cellular metabolism, and elucidating their functions is critical for advancing biochemical research. However, experimental methods are often time consuming and resource intensive. To address this, significant efforts have been directed toward applying artificial intelligence (AI) to enzyme function prediction, enabling high-throughput and scalable approaches. In this review, we discuss advances in AI-driven enzyme functional annotation, transitioning from traditional machine learning (ML) methods to state-of-the-art deep learning approaches. We highlight how deep learning enables models to automatically extract features from raw data without manual intervention, leading to enhanced performance. Finally, we discuss the discovery of novel enzyme functions and generation of de novo enzymes through the integration of generative AIs and bio big data as future research directions.
Collapse
Affiliation(s)
- Ha Rim Kim
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 four), KAIST Institute for BioCentury, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea; Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea
| | - Hongkeun Ji
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 four), KAIST Institute for BioCentury, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea; Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea
| | - Gi Bae Kim
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 four), KAIST Institute for BioCentury, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea; Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea; BioProcess Engineering Research Center, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea
| | - Sang Yup Lee
- Metabolic and Biomolecular Engineering National Research Laboratory, Department of Chemical and Biomolecular Engineering (BK21 four), KAIST Institute for BioCentury, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea; Systems Metabolic Engineering and Systems Healthcare Cross-Generation Collaborative Laboratory, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea; Graduate School of Engineering Biology, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea; BioProcess Engineering Research Center, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea; Center for Synthetic Biology, Korea Advanced Institute of Science and Technology (KAIST), Daejeon 34141, Republic of Korea.
| |
Collapse
|
13
|
Mao Y, Xu W, Shun Y, Chai L, Xue L, Yang Y, Li M. A multimodal model for protein function prediction. Sci Rep 2025; 15:10465. [PMID: 40140535 PMCID: PMC11947276 DOI: 10.1038/s41598-025-94612-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2025] [Accepted: 03/14/2025] [Indexed: 03/28/2025] Open
Abstract
Protein function, which is determined by sequence, structure, and other characteristics, plays a crucial role in an organism's performance. Existing protein function prediction methods mainly rely on sequence data and often ignore structural properties that are crucial for accurate prediction. Protein structure provides richer spatial and functional insights, which can significantly improve prediction accuracy. In this work, we propose a multi-modal protein function prediction model (MMPFP) that integrates protein sequence and structure information through the use of GCN, CNN, and Transformer models. We validate the model using the PDBest dataset, demonstrating that MMPFP outperforms traditional single-modal models in the molecular function (MF), biological process (BP), and cellular component (CC) prediction tasks. Specifically, MMPFP achieved AUPR scores of 0.693, 0.355, and 0.478; [Formula: see text] scores of 0.752, 0.629, and 0.691; and [Formula: see text] scores of 0.336, 0.488, and 0.459, showing a 3-5% improvement over single-modal models. Additionally, ablation studies confirm the effectiveness of the Transformer module within the GCN branch, further validating MMPFP's superior performance over existing methods. This multi-modal approach offers a more accurate and comprehensive framework for protein function prediction, addressing key limitations of current models.
Collapse
Affiliation(s)
- Yu Mao
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China
| | - WenHui Xu
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China
| | - Yue Shun
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China
| | - LongXin Chai
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China
| | - Lei Xue
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China
| | - Yong Yang
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China.
| | - Mei Li
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, 430062, Hubei, China.
| |
Collapse
|
14
|
Li J, Chen X, Huang H, Zeng M, Yu J, Gong X, Ye Q. $\mathcal{S}$ able: bridging the gap in protein structure understanding with an empowering and versatile pre-training paradigm. Brief Bioinform 2025; 26:bbaf120. [PMID: 40163822 PMCID: PMC11957296 DOI: 10.1093/bib/bbaf120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2024] [Revised: 01/23/2025] [Accepted: 02/23/2025] [Indexed: 04/02/2025] Open
Abstract
Protein pre-training has emerged as a transformative approach for solving diverse biological tasks. While many contemporary methods focus on sequence-based language models, recent findings highlight that protein sequences alone are insufficient to capture the extensive information inherent in protein structures. Recognizing the crucial role of protein structure in defining function and interactions, we introduce $\mathcal{S}$able, a versatile pre-training model designed to comprehensively understand protein structures. $\mathcal{S}$able incorporates a novel structural encoding mechanism that enhances inter-atomic information exchange and spatial awareness, combined with robust pre-training strategies and lightweight decoders optimized for specific downstream tasks. This approach enables $\mathcal{S}$able to consistently outperform existing methods in tasks such as generation, classification, and regression, demonstrating its superior capability in protein structure representation. The code and models can be accessed via GitHub repository at https://github.com/baaihealth/Sable.
Collapse
Affiliation(s)
- Jiashan Li
- Institute for Mathematical Sciences, Renmin University of China, 59 Zhongguancun Street, Beijing 100872, China
| | - Xi Chen
- Bio Computing Center, Beijing Academy of Artificial Intelligence, 150 Chengfu Road, Beijing 100084, China
| | - He Huang
- Bio Computing Center, Beijing Academy of Artificial Intelligence, 150 Chengfu Road, Beijing 100084, China
| | - Mingliang Zeng
- Bio Computing Center, Beijing Academy of Artificial Intelligence, 150 Chengfu Road, Beijing 100084, China
| | - Jingcheng Yu
- Bio Computing Center, Beijing Academy of Artificial Intelligence, 150 Chengfu Road, Beijing 100084, China
| | - Xinqi Gong
- Institute for Mathematical Sciences, Renmin University of China, 59 Zhongguancun Street, Beijing 100872, China
| | - Qiwei Ye
- Bio Computing Center, Beijing Academy of Artificial Intelligence, 150 Chengfu Road, Beijing 100084, China
| |
Collapse
|
15
|
Wijaya AJ, Anžel A, Richard H, Hattab G. Current state and future prospects of Horizontal Gene Transfer detection. NAR Genom Bioinform 2025; 7:lqaf005. [PMID: 39935761 PMCID: PMC11811736 DOI: 10.1093/nargab/lqaf005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2024] [Revised: 12/26/2024] [Accepted: 02/04/2025] [Indexed: 02/13/2025] Open
Abstract
Artificial intelligence (AI) has been shown to be beneficial in a wide range of bioinformatics applications. Horizontal Gene Transfer (HGT) is a driving force of evolutionary changes in prokaryotes. It is widely recognized that it contributes to the emergence of antimicrobial resistance (AMR), which poses a particularly serious threat to public health. Many computational approaches have been developed to study and detect HGT. However, the application of AI in this field has not been investigated. In this work, we conducted a review to provide information on the current trend of existing computational approaches for detecting HGT and to decipher the use of AI in this field. Here, we show a growing interest in HGT detection, characterized by a surge in the number of computational approaches, including AI-based approaches, in recent years. We organize existing computational approaches into a hierarchical structure of computational groups based on their computational methods and show how each computational group evolved. We make recommendations and discuss the challenges of HGT detection in general and the adoption of AI in particular. Moreover, we provide future directions for the field of HGT detection.
Collapse
Affiliation(s)
- Andre Jatmiko Wijaya
- Center for Artificial Intelligent in Public Health Research (ZKI-PH), Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany
- Department of Mathematics and Computer Science, Freie Universität, Arnimallee 14, 14195 Berlin, Germany
- Genome Competence Center (MF1), Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany
| | - Aleksandar Anžel
- Center for Artificial Intelligent in Public Health Research (ZKI-PH), Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany
| | - Hugues Richard
- Genome Competence Center (MF1), Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany
| | - Georges Hattab
- Center for Artificial Intelligent in Public Health Research (ZKI-PH), Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany
- Department of Mathematics and Computer Science, Freie Universität, Arnimallee 14, 14195 Berlin, Germany
| |
Collapse
|
16
|
Rosati R, Romeo L, Vargas VM, Gutierrez PA, Frontoni E, Hervas-Martinez C. Learning Ordinal-Hierarchical Constraints for Deep Learning Classifiers. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:4765-4778. [PMID: 38347692 DOI: 10.1109/tnnls.2024.3360641] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/16/2024]
Abstract
Real-world classification problems may disclose different hierarchical levels where the categories are displayed in an ordinal structure. However, no specific deep learning (DL) models simultaneously learn hierarchical and ordinal constraints while improving generalization performance. To fill this gap, we propose the introduction of two novel ordinal-hierarchical DL methodologies, namely, the hierarchical cumulative link model (HCLM) and hierarchical-ordinal binary decomposition (HOBD), which are able to model the ordinal structure within different hierarchical levels of the labels. In particular, we decompose the hierarchical-ordinal problem into local and global graph paths that may encode an ordinal constraint for each hierarchical level. Thus, we frame this problem as simultaneously minimizing global and local losses. Furthermore, the ordinal constraints are set by two approaches [ordinal binary decomposition (OBD) and cumulative link model (CLM)] within each global and local function. The effectiveness of the proposed approach is measured on four real-use case datasets concerning industrial, biomedical, computer vision, and financial domains. The extracted results demonstrate a statistically significant improvement to state-of-the-art nominal, ordinal, and hierarchical approaches.
Collapse
|
17
|
Luo J, Luo Y. Learning maximally spanning representations improves protein function annotation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.13.638156. [PMID: 40027840 PMCID: PMC11870436 DOI: 10.1101/2025.02.13.638156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Automated protein function annotation is a fundamental problem in computational biology, crucial for understanding the functional roles of proteins in biological processes, with broad implications in medicine and biotechnology. A persistent challenge in this problem is the imbalanced, long-tail distribution of available function annotations: a small set of well-studied function classes account for most annotated proteins, while many other classes have few annotated proteins, often due to investigative bias, experimental limitations, or intrinsic biases in protein evolution. As a result, existing machine learning models for protein function prediction tend to only optimize the prediction accuracy for well-studied function classes overrepresented in the training data, leading to poor accuracy for understudied functions. In this work, we develop MSRep, a novel deep learning-based protein function annotation framework designed to address this imbalance issue and improve annotation accuracy. MSRep is inspired by an intriguing phenomenon, called neural collapse (NC), commonly observed in high-accuracy deep neural networks used for classification tasks, where hidden representations in the final layer collapse to class-specific mean embeddings, while maintaining maximal inter-class separation. Given that NC consistently emerges across diverse architectures and tasks for high-accuracy models, we hypothesize that inducing NC structure in models trained on imbalanced data can enhance both prediction accuracy and generalizability. To achieve this, MSRep refines a pre-trained protein language model to produce NC-like representations by optimizing an NC-inspired loss function, which ensures that minority functions are equally represented in the embedding space as majority functions, in contrast to conventional classification methods whose embedding spaces are dominated by overrepresented classes. In evaluations across four protein function annotation tasks on the prediction of Enzyme Commission numbers, Gene3D codes, Pfam families, and Gene Ontology terms, MSRep demonstrates superior predictive performance for both well- and underrepresented classes, outperforming several state-of-the-art annotation tools. We anticipate that MSRep will enhance the annotation of understudied functions and novel, uncharacterized proteins, advancing future protein function studies and accelerating the discovery of new functional proteins. The source code of MSRep is available at https://github.com/luo-group/MSRep.
Collapse
Affiliation(s)
- Jiaqi Luo
- School of Computational Science and Engineering, Georgia Institute of Technology
| | - Yunan Luo
- School of Computational Science and Engineering, Georgia Institute of Technology
| |
Collapse
|
18
|
Lee Y, Gao P, Xu Y, Wang Z, Li S, Chen J. MEGA-GO: functions prediction of diverse protein sequence length using Multi-scalE Graph Adaptive neural network. Bioinformatics 2025; 41:btaf032. [PMID: 39847542 PMCID: PMC11810639 DOI: 10.1093/bioinformatics/btaf032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2024] [Revised: 01/13/2025] [Accepted: 01/21/2025] [Indexed: 01/25/2025] Open
Abstract
MOTIVATION The increasing accessibility of large-scale protein sequences through advanced sequencing technologies has necessitated the development of efficient and accurate methods for predicting protein function. Computational prediction models have emerged as a promising solution to expedite the annotation process. However, despite making significant progress in protein research, graph neural networks face challenges in capturing long-range structural correlations and identifying critical residues in protein graphs. Furthermore, existing models have limitations in effectively predicting the function of newly sequenced proteins that are not included in protein interaction networks. This highlights the need for novel approaches integrating protein structure and sequence data. RESULTS We introduce Multi-scalE Graph Adaptive neural network (MEGA-GO), highlighting the capability of capturing diverse protein sequence length features from multiple scales. The unique graph adaptive neural network architecture of MEGA-GO enables a more nuanced extraction of graph structure features, effectively capturing intricate relationships within biological data. Experimental results demonstrate that MEGA-GO outperforms mainstream protein function prediction models in the accuracy of Gene Ontology term classification, yielding 33.4%, 68.9%, and 44.6% of area under the precision-recall curve on biological process, molecular function, and cellular component domains, respectively. The rest of the experimental results reveal that our model consistently surpasses the state-of-the-art methods. AVAILABILITY AND IMPLEMENTATION The source code and data of MEGA-GO are available at https://github.com/Cheliosoops/MEGA-GO.
Collapse
Affiliation(s)
- Yujian Lee
- Guangdong Provincial Key Laboratory IRADS, Beijing Normal University-Hong Kong Baptist University United International College, Zhuhai 519087, China
- Department of Computer Science, Beijing Normal University-Hong Kong Baptist University United International College, Zhuhai 519087, China
| | - Peng Gao
- Department of Computer Science, Beijing Normal University-Hong Kong Baptist University United International College, Zhuhai 519087, China
| | - Yongqi Xu
- Department of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510520, China
| | - Ziyang Wang
- Department of Science of Chinese Materia Medica, Guangdong Medical University, Dongguan 524023, China
| | - Shuaicheng Li
- Department of Computer Science, City University of Hong Kong, Hong Kong, China
| | - Jiaxing Chen
- Guangdong Provincial Key Laboratory IRADS, Beijing Normal University-Hong Kong Baptist University United International College, Zhuhai 519087, China
| |
Collapse
|
19
|
Chen JY, Wang JF, Hu Y, Li XH, Qian YR, Song CL. Evaluating the advancements in protein language models for encoding strategies in protein function prediction: a comprehensive review. Front Bioeng Biotechnol 2025; 13:1506508. [PMID: 39906415 PMCID: PMC11790633 DOI: 10.3389/fbioe.2025.1506508] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2024] [Accepted: 01/02/2025] [Indexed: 02/06/2025] Open
Abstract
Protein function prediction is crucial in several key areas such as bioinformatics and drug design. With the rapid progress of deep learning technology, applying protein language models has become a research focus. These models utilize the increasing amount of large-scale protein sequence data to deeply mine its intrinsic semantic information, which can effectively improve the accuracy of protein function prediction. This review comprehensively combines the current status of applying the latest protein language models in protein function prediction. It provides an exhaustive performance comparison with traditional prediction methods. Through the in-depth analysis of experimental results, the significant advantages of protein language models in enhancing the accuracy and depth of protein function prediction tasks are fully demonstrated.
Collapse
Affiliation(s)
- Jia-Ying Chen
- School of Software, Xinjiang University, Urumqi, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi, China
- Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Xinjiang University, Urumqi, China
| | - Jing-Fu Wang
- School of Software, Xinjiang University, Urumqi, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi, China
- Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Xinjiang University, Urumqi, China
| | - Yue Hu
- School of Software, Xinjiang University, Urumqi, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi, China
- Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Xinjiang University, Urumqi, China
| | - Xin-Hui Li
- School of Software, Xinjiang University, Urumqi, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi, China
- Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Xinjiang University, Urumqi, China
| | - Yu-Rong Qian
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi, China
- Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Xinjiang University, Urumqi, China
- School of Computer Science and Technology, Xinjiang University, Urumqi, China
| | - Chao-Lin Song
- School of Software, Xinjiang University, Urumqi, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi, China
- Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Xinjiang University, Urumqi, China
| |
Collapse
|
20
|
Chou JC, Dassama LMK. Lipid Trafficking in Diverse Bacteria. Acc Chem Res 2025; 58:36-46. [PMID: 39680024 PMCID: PMC11713862 DOI: 10.1021/acs.accounts.4c00540] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2024] [Revised: 11/27/2024] [Accepted: 12/02/2024] [Indexed: 12/17/2024]
Abstract
Lipids are essential for life and serve as cell envelope components, signaling molecules, and nutrients. For lipids to achieve their required functions, they need to be correctly localized. This requires the action of transporter proteins and an energy source. The current understanding of bacterial lipid transporters is limited to a few classes. Given the diversity of lipid species and the predicted existence of specific lipid transporters, many more transporters await discovery and characterization. These proteins could be prime targets for modulators that control bacterial cell proliferation and pathogenesis. One overarching goal of our research is to understand the molecular mechanisms of bacterial metabolite trafficking, including lipids, and to leverage that understanding to identify or engineer inhibitory ligands. In recent years, our work has revealed two novel lipid transport systems in bacteria: bacterial sterol transporters (Bst) A, B, and C in Methylococcus capsulatus and the TatT proteins in Enhygromyxa salina and Treponema pallidum. Both systems are composed of transporters bioinformatically identified as being involved in the transport of other metabolites, but substrates were never revealed. However, the genetic colocalization of the genes encoding BstABC with sterol biosynthetic enzymes in M. capsulatus suggested that they might recognize sterols as substrates. Also, homologues of TatTs are present in diverse bacteria but are overrepresented in bacteria deficient in de novo lipid synthesis or residing in nutrient-poor environments; we reasoned that these proteins might facilitate the transport of lipids. Our efforts to reveal the substrate scope of two TatT proteins revealed their engagement with long-chain fatty acids. Enabling the discovery of the BstABC system and the TatT proteins were bioinformatic analyses, quantitative measurements of protein-ligand equilibrium affinities, and high-resolution structural studies that provided remarkable insights into ligand binding cavities and the structural basis for ligand interaction. These approaches, in particular our bioinformatics and structural work, highlighted the diversity of protein sequence and structures amenable to lipid engagement. These observations allowed the hypothesis that lipid handling proteins, in general and especially so in the bacterial domain, can have diverse amino acid compositions and three-dimensional structures. As such, bioinformatics geared at identifying them in poorly characterized genomes is likely to miss many candidates that diverge from well-characterized family members. This realization spurred efforts to understand the unifying features in all of the lipid handling proteins we have characterized to date. To do this, we inspected the ligand binding sites of the proteins: they were remarkably hydrophobic and sometimes displayed a dichotomy of hydrophobic and hydrophilic amino acids, akin to the ligands that they accommodate in those cavities. Because of this, we reasoned that the physicochemical features of ligand binding cavities could be accurate predictors of a protein's propensity to bind lipids. This finding was leveraged to create structure-based lipid-interacting pocket predictor (SLiPP), a machine-learning algorithm capable of identifying ligand cavities with physico-chemical features consistent with those of known lipid binding sites. SLiPP is especially useful in poorly annotated genomes (such as with bacterial pathogens), where it could reveal candidate proteins to be targeted for the development of antimicrobials.
Collapse
Affiliation(s)
- Jonathan
Chiu-Chun Chou
- Department
of Chemistry and Sarafan ChEM-H Institute, Stanford University, Stanford, California 94305, United States
| | - Laura M. K. Dassama
- Department
of Chemistry and Sarafan ChEM-H Institute, Stanford University, Stanford, California 94305, United States
- Department
of Microbiology and Immunology, Stanford
School of Medicine, Stanford, California 94305, United States
| |
Collapse
|
21
|
Wang W, Shuai Y, Zeng M, Fan W, Li M. DPFunc: accurately predicting protein function via deep learning with domain-guided structure information. Nat Commun 2025; 16:70. [PMID: 39746897 PMCID: PMC11697396 DOI: 10.1038/s41467-024-54816-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Accepted: 11/21/2024] [Indexed: 01/04/2025] Open
Abstract
Computational methods for predicting protein function are of great significance in understanding biological mechanisms and treating complex diseases. However, existing computational approaches of protein function prediction lack interpretability, making it difficult to understand the relations between protein structures and functions. In this study, we propose a deep learning-based solution, named DPFunc, for accurate protein function prediction with domain-guided structure information. DPFunc can detect significant regions in protein structures and accurately predict corresponding functions under the guidance of domain information. It outperforms current state-of-the-art methods and achieves a significant improvement over existing structure-based methods. Detailed analyses demonstrate that the guidance of domain information contributes to DPFunc for protein function prediction, enabling our method to detect key residues or regions in protein structures, which are closely related to their functions. In summary, DPFunc serves as an effective tool for large-scale protein function prediction, which pushes the border of protein understanding in biological systems.
Collapse
Affiliation(s)
- Wenkang Wang
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Yunyan Shuai
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Min Zeng
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Wei Fan
- Nuffield Department of Women's and Reproductive Health, University of Oxford, Oxford, OX39DU, UK
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China.
| |
Collapse
|
22
|
Boadu F, Lee A, Cheng J. Deep learning methods for protein function prediction. Proteomics 2025; 25:e2300471. [PMID: 38996351 PMCID: PMC11735672 DOI: 10.1002/pmic.202300471] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Revised: 06/15/2024] [Accepted: 06/18/2024] [Indexed: 07/14/2024]
Abstract
Predicting protein function from protein sequence, structure, interaction, and other relevant information is important for generating hypotheses for biological experiments and studying biological systems, and therefore has been a major challenge in protein bioinformatics. Numerous computational methods had been developed to advance protein function prediction gradually in the last two decades. Particularly, in the recent years, leveraging the revolutionary advances in artificial intelligence (AI), more and more deep learning methods have been developed to improve protein function prediction at a faster pace. Here, we provide an in-depth review of the recent developments of deep learning methods for protein function prediction. We summarize the significant advances in the field, identify several remaining major challenges to be tackled, and suggest some potential directions to explore. The data sources and evaluation metrics widely used in protein function prediction are also discussed to assist the machine learning, AI, and bioinformatics communities to develop more cutting-edge methods to advance protein function prediction.
Collapse
Affiliation(s)
- Frimpong Boadu
- Department of Electrical Engineering and Computer ScienceUniversity of MissouriColumbiaMissouriUSA
| | - Ahhyun Lee
- Department of Electrical Engineering and Computer ScienceUniversity of MissouriColumbiaMissouriUSA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer ScienceUniversity of MissouriColumbiaMissouriUSA
| |
Collapse
|
23
|
Yan R, Islam MT, Xing L. Deep representation learning of protein-protein interaction networks for enhanced pattern discovery. SCIENCE ADVANCES 2024; 10:eadq4324. [PMID: 39693438 DOI: 10.1126/sciadv.adq4324] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Accepted: 11/14/2024] [Indexed: 12/20/2024]
Abstract
Protein-protein interaction (PPI) networks, where nodes represent proteins and edges depict myriad interactions among them, are fundamental to understanding the dynamics within biological systems. Despite their pivotal role in modern biology, reliably discerning patterns from these intertwined networks remains a substantial challenge. The essence of the challenge lies in holistically characterizing the relationships of each node with others in the network and effectively using this information for accurate pattern discovery. In this work, we introduce a self-supervised network embedding framework termed discriminative network embedding (DNE). Unlike conventional methods that primarily focus on direct or limited-order node proximity, DNE characterizes a node both locally and globally by harnessing the contrast between representations from neighboring and distant nodes. Our experimental results demonstrate DNE's superior performance over existing techniques across various critical network analyses, including PPI inference and the identification of protein functional modules. DNE emerges as a robust strategy for node representation in PPI networks, offering promising avenues for diverse biomedical applications.
Collapse
Affiliation(s)
- Rui Yan
- Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA 94305, USA
| | - Md Tauhidul Islam
- Department of Radiation Oncology, Stanford University, Stanford, CA 94305, USA
| | - Lei Xing
- Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA 94305, USA
- Department of Radiation Oncology, Stanford University, Stanford, CA 94305, USA
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
24
|
Crawford J, Chikina M, Greene CS. Best holdout assessment is sufficient for cancer transcriptomic model selection. PATTERNS (NEW YORK, N.Y.) 2024; 5:101115. [PMID: 39776849 PMCID: PMC11701843 DOI: 10.1016/j.patter.2024.101115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Revised: 08/01/2024] [Accepted: 11/13/2024] [Indexed: 01/11/2025]
Abstract
Guidelines in statistical modeling for genomics hold that simpler models have advantages over more complex ones. Potential advantages include cost, interpretability, and improved generalization across datasets or biological contexts. We directly tested the assumption that small gene signatures generalize better by examining the generalization of mutation status prediction models across datasets (from cell lines to human tumors and vice versa) and biological contexts (holding out entire cancer types from pan-cancer data). We compared model selection between solely cross-validation performance and combining cross-validation performance with regularization strength. We did not observe that more regularized signatures generalized better. This result held across both generalization problems and for both linear models (LASSO logistic regression) and non-linear ones (neural networks). When the goal of an analysis is to produce generalizable predictive models, we recommend choosing the ones that perform best on held-out data or in cross-validation instead of those that are smaller or more regularized.
Collapse
Affiliation(s)
- Jake Crawford
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Maria Chikina
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| | - Casey S. Greene
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, USA
- Center for Health AI, University of Colorado School of Medicine, Aurora, CO, USA
| |
Collapse
|
25
|
Soleymani F, Paquet E, Viktor HL, Michalowski W. Structure-based protein and small molecule generation using EGNN and diffusion models: A comprehensive review. Comput Struct Biotechnol J 2024; 23:2779-2797. [PMID: 39050782 PMCID: PMC11268121 DOI: 10.1016/j.csbj.2024.06.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Revised: 06/13/2024] [Accepted: 06/18/2024] [Indexed: 07/27/2024] Open
Abstract
Recent breakthroughs in deep learning have revolutionized protein sequence and structure prediction. These advancements are built on decades of protein design efforts, and are overcoming traditional time and cost limitations. Diffusion models, at the forefront of these innovations, significantly enhance design efficiency by automating knowledge acquisition. In the field of de novo protein design, the goal is to create entirely novel proteins with predetermined structures. Given the arbitrary positions of proteins in 3-D space, graph representations and their properties are widely used in protein generation studies. A critical requirement in protein modelling is maintaining spatial relationships under transformations (rotations, translations, and reflections). This property, known as equivariance, ensures that predicted protein characteristics adapt seamlessly to changes in orientation or position. Equivariant graph neural networks offer a solution to this challenge. By incorporating equivariant graph neural networks to learn the score of the probability density function in diffusion models, one can generate proteins with robust 3-D structural representations. This review examines the latest deep learning advancements, specifically focusing on frameworks that combine diffusion models with equivariant graph neural networks for protein generation.
Collapse
Affiliation(s)
- Farzan Soleymani
- Telfer School of Management, University of Ottawa, ON, K1N 6N5, Canada
| | - Eric Paquet
- National Research Council, 1200 Montreal Road, Ottawa, ON, K1A 0R6, Canada
- School of Electrical Engineering and Computer Science, University of Ottawa, ON, K1N 6N5, Canada
| | - Herna Lydia Viktor
- School of Electrical Engineering and Computer Science, University of Ottawa, ON, K1N 6N5, Canada
| | | |
Collapse
|
26
|
Shi W, Zhang Y, Sun Y, Lin Z. Function-Genes and Disease-Genes Prediction Based on Network Embedding and One-Class Classification. Interdiscip Sci 2024; 16:781-801. [PMID: 39230798 DOI: 10.1007/s12539-024-00638-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Revised: 05/14/2024] [Accepted: 05/21/2024] [Indexed: 09/05/2024]
Abstract
Using genes which have been experimentally-validated for diseases (functions) can develop machine learning methods to predict new disease/function-genes. However, the prediction of both function-genes and disease-genes faces the same problem: there are only certain positive examples, but no negative examples. To solve this problem, we proposed a function/disease-genes prediction algorithm based on network embedding (Variational Graph Auto-Encoders, VGAE) and one-class classification (Fast Minimum Covariance Determinant, Fast-MCD): VGAEMCD. Firstly, we constructed a protein-protein interaction (PPI) network centered on experimentally-validated genes; then VGAE was used to get the embeddings of nodes (genes) in the network; finally, the embeddings were input into the improved deep learning one-class classifier based on Fast-MCD to predict function/disease-genes. VGAEMCD can predict function-gene and disease-gene in a unified way, and only the experimentally-verified genes are needed to provide (no need for expression profile). VGAEMCD outperforms classical one-class classification algorithms in Recall, Precision, F-measure, Specificity, and Accuracy. Further experiments show that seven metrics of VGAEMCD are higher than those of state-of-art function/disease-genes prediction algorithms. The above results indicate that VGAEMCD can well learn the distribution characteristics of positive examples and accurately identify function/disease-genes.
Collapse
Affiliation(s)
- Weiyu Shi
- College of Maritime Economics and Management, Dalian Maritime University, Dalian, 116026, China
| | - Yan Zhang
- Institute of Environmental Systems Biology, College of Environmental Science and Engineering, Dalian Maritime University, Dalian, 116026, China
| | - Yeqing Sun
- Institute of Environmental Systems Biology, College of Environmental Science and Engineering, Dalian Maritime University, Dalian, 116026, China.
| | - Zhengkui Lin
- College of Maritime Economics and Management, Dalian Maritime University, Dalian, 116026, China.
| |
Collapse
|
27
|
Xiang W, Xiong Z, Chen H, Xiong J, Zhang W, Fu Z, Zheng M, Liu B, Shi Q. FAPM: functional annotation of proteins using multimodal models beyond structural modeling. Bioinformatics 2024; 40:btae680. [PMID: 39540736 PMCID: PMC11630832 DOI: 10.1093/bioinformatics/btae680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2024] [Revised: 10/12/2024] [Accepted: 11/12/2024] [Indexed: 11/16/2024] Open
Abstract
MOTIVATION Assigning accurate property labels to proteins, like functional terms and catalytic activity, is challenging, especially for proteins without homologs and "tail labels" with few known examples. Previous methods mainly focused on protein sequence features, overlooking the semantic meaning of protein labels. RESULTS We introduce functional annotation of proteins using multimodal models (FAPM), a contrastive multimodal model that links natural language with protein sequence language. This model combines a pretrained protein sequence model with a pretrained large language model to generate labels, such as Gene Ontology (GO) functional terms and catalytic activity predictions, in natural language. Our results show that FAPM excels in understanding protein properties, outperforming models based solely on protein sequences or structures. It achieves state-of-the-art performance on public benchmarks and in-house experimentally annotated phage proteins, which often have few known homologs. Additionally, FAPM's flexibility allows it to incorporate extra text prompts, like taxonomy information, enhancing both its predictive performance and explainability. This novel approach offers a promising alternative to current methods that rely on multiple sequence alignment for protein annotation. AVAILABILITY AND IMPLEMENTATION The online demo is at: https://huggingface.co/spaces/wenkai/FAPM_demo.
Collapse
Affiliation(s)
- Wenkai Xiang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
- Lingang Laboratory, Shanghai 200031, China
| | | | - Huan Chen
- BioBank, The First Affiliated Hospital of Xi’an Jiaotong University, Xi’an 710061, China
| | - Jiacheng Xiong
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Wei Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zunyun Fu
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
| | - Mingyue Zheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai 201203, China
- Lingang Laboratory, Shanghai 200031, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Bing Liu
- BioBank, The First Affiliated Hospital of Xi’an Jiaotong University, Xi’an 710061, China
| | - Qian Shi
- Lingang Laboratory, Shanghai 200031, China
| |
Collapse
|
28
|
Vu TTD, Kim J, Jung J. An experimental analysis of graph representation learning for Gene Ontology based protein function prediction. PeerJ 2024; 12:e18509. [PMID: 39553733 PMCID: PMC11569786 DOI: 10.7717/peerj.18509] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2024] [Accepted: 10/21/2024] [Indexed: 11/19/2024] Open
Abstract
Understanding protein function is crucial for deciphering biological systems and facilitating various biomedical applications. Computational methods for predicting Gene Ontology functions of proteins emerged in the 2000s to bridge the gap between the number of annotated proteins and the rapidly growing number of newly discovered amino acid sequences. Recently, there has been a surge in studies applying graph representation learning techniques to biological networks to enhance protein function prediction tools. In this review, we provide fundamental concepts in graph embedding algorithms. This study described graph representation learning methods for protein function prediction based on four principal data categories, namely PPI network, protein structure, Gene Ontology graph, and integrated graph. The commonly used approaches for each category were summarized and diagrammed, with the specific results of each method explained in detail. Finally, existing limitations and potential solutions were discussed, and directions for future research within the protein research community were suggested.
Collapse
Affiliation(s)
- Thi Thuy Duong Vu
- Faculty of Fundamental Sciences, University of Medicine and Pharmacy at Ho Chi Minh City, Ho Chi Minh City, Vietnam
| | - Jeongho Kim
- Department of Information and Communication Engineering, Myongji University, Yongin, Republic of South Korea
| | - Jaehee Jung
- Department of Information and Communication Engineering, Myongji University, Yongin, Republic of South Korea
| |
Collapse
|
29
|
Liu Q, Zhang C, Freddolino L. InterLabelGO+: unraveling label correlations in protein function prediction. Bioinformatics 2024; 40:btae655. [PMID: 39499152 PMCID: PMC11568131 DOI: 10.1093/bioinformatics/btae655] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2024] [Revised: 10/07/2024] [Accepted: 11/01/2024] [Indexed: 11/07/2024] Open
Abstract
MOTIVATION Accurate protein function prediction is crucial for understanding biological processes and advancing biomedical research. However, the rapid growth of protein sequences far outpaces the experimental characterization of their functions, necessitating the development of automated computational methods. RESULTS We present InterLabelGO+, a hybrid approach that integrates a deep learning-based method with an alignment-based method for improved protein function prediction. InterLabelGO+ incorporates a novel loss function that addresses label dependency and imbalance and further enhances performance through dynamic weighting of the alignment-based component. A preliminary version of InterLabelGO+ achieved a strong performance in the CAFA5 challenge, ranking sixth out of 1625 participating teams. Comprehensive evaluations on large-scale protein function prediction tasks demonstrate InterLabelGO+'s ability to accurately predict Gene Ontology terms across various functional categories and evaluation metrics. AVAILABILITY AND IMPLEMENTATION The source code and datasets for InterLabelGO+ are freely available on GitHub at https://github.com/QuanEvans/InterLabelGO. A web-server is available at https://seq2fun.dcmb.med.umich.edu/InterLabelGO/. The software is implemented in Python and PyTorch, and is supported on Linux and macOS.
Collapse
Affiliation(s)
- Quancheng Liu
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48109, USA
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Lydia Freddolino
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48109, USA
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, 48109, USA
| |
Collapse
|
30
|
Kumar V, Deepak A, Ranjan A, Prakash A. CrossPredGO: A Novel Light-Weight Cross-Modal Multi-Attention Framework for Protein Function Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1709-1720. [PMID: 38843056 DOI: 10.1109/tcbb.2024.3410696] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/27/2024]
Abstract
Proteins are represented in various ways, each contributing differently to protein-related tasks. Here, information from each representation (protein sequence, 3D structure, and interaction data) is combined for an efficient protein function prediction task. Recently, uni-modal has produced promising results with state-of-the-art attention mechanisms that learn the relative importance of features, whereas multi-modal approaches have produced promising results by simply concatenating obtained features using a computational approach from different representations which leads to an increase in the overall trainable parameters. In this paper, we propose a novel, light-weight cross-modal multi-attention (CrMoMulAtt) mechanism that captures the relative contribution of each modality with a lower number of trainable parameters. The proposed mechanism shows a higher contribution from PPI and a lower contribution from structure data. The results obtained from the proposed CrossPredGO mechanism demonstrate an increment in in the range of +(3.29 to 7.20)% with at most 31% lower trainable parameters compared with DeepGO and MultiPredGO.
Collapse
|
31
|
Kumar V, Deepak A, Ranjan A, Prakash A. Bi-SeqCNN: A Novel Light-Weight Bi-Directional CNN Architecture for Protein Function Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1922-1933. [PMID: 38990747 DOI: 10.1109/tcbb.2024.3426491] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/13/2024]
Abstract
Deep learning approaches, such as convolution neural networks (CNNs) and deep recurrent neural networks (RNNs), have been the backbone for predicting protein function, with promising state-of-the-art (SOTA) results. RNNs with an in-built ability (i) focus on past information, (ii) collect both short-and-long range dependency information, and (iii) bi-directional processing offers a strong sequential processing mechanism. CNNs, however, are confined to focusing on short-term information from both the past and the future, although they offer parallelism. Therefore, a novel bi-directional CNN that strictly complies with the sequential processing mechanism of RNNs is introduced and is used for developing a protein function prediction framework, Bi-SeqCNN. This is a sub-sequence-based framework. Further, Bi-SeqCNN is an ensemble approach to better the prediction results. To our knowledge, this is the first time bi-directional CNNs are employed for general temporal data analysis and not just for protein sequences. The proposed architecture produces improvements up to +5.5% over contemporary SOTA methods on three benchmark protein sequence datasets. Moreover, it is substantially lighter and attain these results with (0.50-0.70 times) fewer parameters than the SOTA methods.
Collapse
|
32
|
Taha K. Employing Machine Learning Techniques to Detect Protein Function: A Survey, Experimental, and Empirical Evaluations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1965-1986. [PMID: 39008392 DOI: 10.1109/tcbb.2024.3427381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/17/2024]
Abstract
This review article delves deeply into the various machine learning (ML) methods and algorithms employed in discerning protein functions. Each method discussed is assessed for its efficacy, limitations, potential improvements, and future prospects. We present an innovative hierarchical classification system that arranges algorithms into intricate categories and unique techniques. This taxonomy is based on a tri-level hierarchy, starting with the methodology category and narrowing down to specific techniques. Such a framework allows for a structured and comprehensive classification of algorithms, assisting researchers in understanding the interrelationships among diverse algorithms and techniques. The study incorporates both empirical and experimental evaluations to differentiate between the techniques. The empirical evaluation ranks the techniques based on four criteria. The experimental assessments rank: (1) individual techniques under the same methodology sub-category, (2) different sub-categories within the same category, and (3) the broad categories themselves. Integrating the innovative methodological classification, empirical findings, and experimental assessments, the article offers a well-rounded understanding of ML strategies in protein function identification. The paper also explores techniques for multi-task and multi-label detection of protein functions, in addition to focusing on single-task methods. Moreover, the paper sheds light on the future avenues of ML in protein function determination.
Collapse
|
33
|
Li L, Dannenfelser R, Cruz C, Yao V. A best-match approach for gene set analyses in embedding spaces. Genome Res 2024; 34:1421-1433. [PMID: 39231608 PMCID: PMC11529866 DOI: 10.1101/gr.279141.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Accepted: 08/29/2024] [Indexed: 09/06/2024]
Abstract
Embedding methods have emerged as a valuable class of approaches for distilling essential information from complex high-dimensional data into more accessible lower-dimensional spaces. Applications of embedding methods to biological data have demonstrated that gene embeddings can effectively capture physical, structural, and functional relationships between genes. However, this utility has been primarily realized by using gene embeddings for downstream machine-learning tasks. Much less has been done to examine the embeddings directly, especially analyses of gene sets in embedding spaces. Here, we propose an Algorithm for Network Data Embedding and Similarity (ANDES), a novel best-match approach that can be used with existing gene embeddings to compare gene sets while reconciling gene set diversity. This intuitive method has important downstream implications for improving the utility of embedding spaces for various tasks. Specifically, we show how ANDES, when applied to different gene embeddings encoding protein-protein interactions, can be used as a novel overrepresentation- and rank-based gene set enrichment analysis method that achieves state-of-the-art performance. Additionally, ANDES can use multiorganism joint gene embeddings to facilitate functional knowledge transfer across organisms, allowing for phenotype mapping across model systems. Our flexible, straightforward best-match methodology can be extended to other embedding spaces with diverse community structures between set elements.
Collapse
Affiliation(s)
- Lechuan Li
- Department of Computer Science, Rice University, Houston, Texas 77005, USA
| | - Ruth Dannenfelser
- Department of Computer Science, Rice University, Houston, Texas 77005, USA
| | - Charlie Cruz
- Department of Computer Science, Rice University, Houston, Texas 77005, USA
| | - Vicky Yao
- Department of Computer Science, Rice University, Houston, Texas 77005, USA
| |
Collapse
|
34
|
Langschied F, Bordin N, Cosentino S, Fuentes-Palacios D, Glover N, Hiller M, Hu Y, Huerta-Cepas J, Coelho LP, Iwasaki W, Majidian S, Manzano-Morales S, Persson E, Richards TA, Gabaldón T, Sonnhammer E, Thomas PD, Dessimoz C, Ebersberger I. Quest for Orthologs in the Era of Biodiversity Genomics. Genome Biol Evol 2024; 16:evae224. [PMID: 39404012 PMCID: PMC11523110 DOI: 10.1093/gbe/evae224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/11/2024] [Indexed: 11/01/2024] Open
Abstract
The era of biodiversity genomics is characterized by large-scale genome sequencing efforts that aim to represent each living taxon with an assembled genome. Generating knowledge from this wealth of data has not kept up with this pace. We here discuss major challenges to integrating these novel genomes into a comprehensive functional and evolutionary network spanning the tree of life. In summary, the expanding datasets create a need for scalable gene annotation methods. To trace gene function across species, new methods must seek to increase the resolution of ortholog analyses, e.g. by extending analyses to the protein domain level and by accounting for alternative splicing. Additionally, the scope of orthology prediction should be pushed beyond well-investigated proteomes. This demands the development of specialized methods for the identification of orthologs to short proteins and noncoding RNAs and for the functional characterization of novel gene families. Furthermore, protein structures predicted by machine learning are now readily available, but this new information is yet to be integrated with orthology-based analyses. Finally, an increasing focus should be placed on making orthology assignments adhere to the findable, accessible, interoperable, and reusable (FAIR) principles. This fosters green bioinformatics by avoiding redundant computations and helps integrating diverse scientific communities sharing the need for comparative genetics and genomics information. It should also help with communicating orthology-related concepts in a format that is accessible to the public, to counteract existing misinformation about evolution.
Collapse
Affiliation(s)
- Felix Langschied
- Department for Applied Bioinformatics, Institute of Cell Biology and Neuroscience, Goethe University, Frankfurt, Germany
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, WC1E 6BT, London, UK
| | - Salvatore Cosentino
- Department of Integrated Biosciences, The University of Tokyo, 277-0882 Tokyo, Japan
| | - Diego Fuentes-Palacios
- Barcelona Supercomputing Center (BSC-CNS), 08034 Barcelona, Spain
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, 08028 Barcelona, Spain
| | - Natasha Glover
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| | - Michael Hiller
- Department of Comparative Genomics, Institute of Cell Biology and Neuroscience, Goethe University, Frankfurt, Germany
| | - Yanhui Hu
- Department of Genetics, Harvard Medical School, Boston, MA 02115, USA
- Drosophila RNAi Screening Center, Harvard Medical School, Boston, MA 02115, USA
| | - Jaime Huerta-Cepas
- Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA-CSIC), Campus de Montegancedo-UPM, Madrid, Spain
| | - Luis Pedro Coelho
- Centre for Microbiome Research, School of Biomedical Sciences, Queensland University of Technology, Translational Research Institute, Woolloongabba, Queensland, Australia
| | - Wataru Iwasaki
- Department of Integrated Biosciences, University of Tokyo, 277-0882 Tokyo, Japan
| | - Sina Majidian
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| | - Saioa Manzano-Morales
- Barcelona Supercomputing Center (BSC-CNS), 08034 Barcelona, Spain
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, 08028 Barcelona, Spain
| | - Emma Persson
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Solna, Sweden
| | | | - Toni Gabaldón
- Barcelona Supercomputing Center (BSC-CNS), 08034 Barcelona, Spain
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, 08028 Barcelona, Spain
- Catalan Institution for Research and Advanced Studies (ICREA), Barcelona, Spain
- CIBER de Enfermedades Infecciosas, Instituto de Salud Carlos III, Madrid, Spain
| | - Erik Sonnhammer
- Department of Biochemistry and Biophysics, Stockholm University, Science for Life Laboratory, Solna, Sweden
| | - Paul D Thomas
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA, USA
| | - Christophe Dessimoz
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland
| | - Ingo Ebersberger
- Department for Applied Bioinformatics, Institute of Cell Biology and Neuroscience, Goethe University, Frankfurt, Germany
- LOEWE Centre for Translational Biodiversity Genomics, 60325 Frankfurt, Germany
- Senckenberg Biodiversity and Climate Research Centre (S-BIK-F), Frankfurt am Main, Germany
| |
Collapse
|
35
|
Meng L, Wang X. TAWFN: a deep learning framework for protein function prediction. Bioinformatics 2024; 40:btae571. [PMID: 39312678 PMCID: PMC11639667 DOI: 10.1093/bioinformatics/btae571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Revised: 08/27/2024] [Accepted: 09/19/2024] [Indexed: 09/25/2024] Open
Abstract
MOTIVATION Proteins play pivotal roles in biological systems, and precise prediction of their functions is indispensable for practical applications. Despite the surge in protein sequence data facilitated by high-throughput techniques, unraveling the exact functionalities of proteins still demands considerable time and resources. Currently, numerous methods rely on protein sequences for prediction, while methods targeting protein structures are scarce, often employing convolutional neural networks (CNN) or graph convolutional networks (GCNs) individually. RESULTS To address these challenges, our approach starts from protein structures and proposes a method that combines CNN and GCN into a unified framework called the two-model adaptive weight fusion network (TAWFN) for protein function prediction. First, amino acid contact maps and sequences are extracted from the protein structure. Then, the sequence is used to generate one-hot encoded features and deep semantic features. These features, along with the constructed graph, are fed into the adaptive graph convolutional networks (AGCN) module and the multi-layer convolutional neural network (MCNN) module as needed, resulting in preliminary classification outcomes. Finally, the preliminary classification results are inputted into the adaptive weight computation network, where adaptive weights are calculated to fuse the initial predictions from both networks, yielding the final prediction result. To evaluate the effectiveness of our method, experiments were conducted on the PDBset and AFset datasets. For molecular function, biological process, and cellular component tasks, TAWFN achieved area under the precision-recall curve (AUPR) values of 0.718, 0.385, and 0.488 respectively, with corresponding Fmax scores of 0.762, 0.628, and 0.693, and Smin scores of 0.326, 0.483, and 0.454. The experimental results demonstrate that TAWFN exhibits promising performance, outperforming existing methods. AVAILABILITY AND IMPLEMENTATION The TAWFN source code can be found at: https://github.com/ss0830/TAWFN.
Collapse
Affiliation(s)
- Lu Meng
- College of Information Science and Engineering, Northeastern University, Shenyang, Liaoning, 110000, China
| | - Xiaoran Wang
- College of Information Science and Engineering, Northeastern University, Shenyang, Liaoning, 110000, China
| |
Collapse
|
36
|
Qiao B, Wang S, Hou M, Chen H, Zhou Z, Xie X, Pang S, Yang C, Yang F, Zou Q, Sun S. Identifying nucleotide-binding leucine-rich repeat receptor and pathogen effector pairing using transfer-learning and bilinear attention network. Bioinformatics 2024; 40:btae581. [PMID: 39331576 PMCID: PMC11969219 DOI: 10.1093/bioinformatics/btae581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Revised: 08/24/2024] [Accepted: 09/25/2024] [Indexed: 09/29/2024] Open
Abstract
MOTIVATION Nucleotide-binding leucine-rich repeat (NLR) family is a class of immune receptors capable of detecting and defending against pathogen invasion. They have been widely used in crop breeding. Notably, the correspondence between NLRs and effectors (CNE) determines the applicability and effectiveness of NLRs. Unfortunately, CNE data is very scarce. In fact, we've found a substantial 91 291 NLRs confirmed via wet experiments and bioinformatics methods but only 387 CNEs are recognized, which greatly restricts the potential application of NLRs. RESULTS We propose a deep learning algorithm called ProNEP to identify NLR-effector pairs in a high-throughput manner. Specifically, we conceptualized the CNE prediction task as a protein-protein interaction (PPI) prediction task. Then, ProNEP predicts the interaction between NLRs and effectors by combining the transfer learning with a bilinear attention network. ProNEP achieves superior performance against state-of-the-art models designed for PPI predictions. Based on ProNEP, we conduct extensive identification of potential CNEs for 91 291 NLRs. With the rapid accumulation of genomic data, we expect that this tool will be widely used to predict CNEs in new species, advancing biology, immunology, and breeding. AVAILABILITY AND IMPLEMENTATION The ProNEP is available at http://nerrd.cn/#/prediction. The project code is available at https://github.com/QiaoYJYJ/ProNEP.
Collapse
Affiliation(s)
- Baixue Qiao
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), Harbin 150001, China
- State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, Harbin 150001, China
| | - Shuda Wang
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), Harbin 150001, China
- State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, Harbin 150001, China
| | - Mingjun Hou
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), Harbin 150001, China
| | - Haodi Chen
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), Harbin 150001, China
| | - Zhengwenyang Zhou
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), Harbin 150001, China
| | - Xueying Xie
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), Harbin 150001, China
| | - Shaozi Pang
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), Harbin 150001, China
| | - Chunxue Yang
- College of Landscape Architecture, Northeast Forestry University, Harbin 150001, China
| | - Fenglong Yang
- Department of Bioinformatics, Fujian Key Laboratory of Medical Bioinformatics, School of Medical Technology and Engineering, Fujian Medical University, Fuzhou 350122, China
- Key Laboratory of Ministry of Education for Gastrointestinal Cancer, School of Basic Medical Sciences, Fujian Medical University, Fuzhou 350122, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Shanwen Sun
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), Harbin 150001, China
- State Key Laboratory of Tree Genetics and Breeding, Northeast Forestry University, Harbin 150001, China
| |
Collapse
|
37
|
Meher PK, Pradhan UK, Sethi PL, Naha S, Gupta A, Parsad R. PredPSP: a novel computational tool to discover pathway-specific photosynthetic proteins in plants. PLANT MOLECULAR BIOLOGY 2024; 114:106. [PMID: 39316155 DOI: 10.1007/s11103-024-01500-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Accepted: 09/04/2024] [Indexed: 09/25/2024]
Abstract
Photosynthetic proteins play a crucial role in agricultural productivity by harnessing light energy for plant growth. Understanding these proteins, especially within C3 and C4 pathways, holds promise for improving crops in challenging environments. Despite existing models, a comprehensive computational framework specifically targeting plant photosynthetic proteins is lacking. The underutilization of plant datasets in computational algorithms accentuates the gap this study aims to fill by introducing a novel sequence-based computational method for identifying these proteins. The scope of this study encompassed diverse plant species, ensuring comprehensive representation across C3 and C4 pathways. Utilizing six deep learning models and seven shallow learning algorithms, paired with six sequence-derived feature sets followed by feature selection strategy, this study developed a comprehensive model for prediction of plant-specific photosynthetic proteins. Following 5-fold cross-validation analysis, LightGBM with 65 and 90 LGBM-VIM selected features respectively emerged as the best models for C3 (auROC: 91.78%, auPRC: 92.55%) and C4 (auROC: 99.05%, auPRC: 99.18%) plants. Validation using an independent dataset confirmed the robustness of the proposed model for both C3 (auROC: 87.23%, auPRC: 88.40%) and C4 (auROC: 92.83%, auPRC: 92.29%) categories. Comparison with existing methods demonstrated the superiority of the proposed model in predicting plant-specific photosynthetic proteins. This study further established a free online prediction server PredPSP ( https://iasri-sg.icar.gov.in/predpsp/ ) to facilitate ongoing efforts for identifying photosynthetic proteins in C3 and C4 plants. Being first of its kind, this study offers valuable insights into predicting plant-specific photosynthetic proteins which holds significant implications for plant biology.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India.
| | - Upendra Kumar Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India
| | - Padma Lochan Sethi
- Department of Bioinformatics, Odisha University of Agriculture & Technology, Bhubaneswar, 751003, Odisha, India
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India
| | - Rajender Parsad
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India
| |
Collapse
|
38
|
Bai P, Li G, Luo J, Liang C. Deep learning model for protein multi-label subcellular localization and function prediction based on multi-task collaborative training. Brief Bioinform 2024; 25:bbae568. [PMID: 39489606 PMCID: PMC11531862 DOI: 10.1093/bib/bbae568] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2024] [Revised: 09/24/2024] [Accepted: 10/22/2024] [Indexed: 11/05/2024] Open
Abstract
The functional study of proteins is a critical task in modern biology, playing a pivotal role in understanding the mechanisms of pathogenesis, developing new drugs, and discovering novel drug targets. However, existing computational models for subcellular localization face significant challenges, such as reliance on known Gene Ontology (GO) annotation databases or overlooking the relationship between GO annotations and subcellular localization. To address these issues, we propose DeepMTC, an end-to-end deep learning-based multi-task collaborative training model. DeepMTC integrates the interrelationship between subcellular localization and the functional annotation of proteins, leveraging multi-task collaborative training to eliminate dependence on known GO databases. This strategy gives DeepMTC a distinct advantage in predicting newly discovered proteins without prior functional annotations. First, DeepMTC leverages pre-trained language model with high accuracy to obtain the 3D structure and sequence features of proteins. Additionally, it employs a graph transformer module to encode protein sequence features, addressing the problem of long-range dependencies in graph neural networks. Finally, DeepMTC uses a functional cross-attention mechanism to efficiently combine upstream learned functional features to perform the subcellular localization task. The experimental results demonstrate that DeepMTC outperforms state-of-the-art models in both protein function prediction and subcellular localization. Moreover, interpretability experiments revealed that DeepMTC can accurately identify the key residues and functional domains of proteins, confirming its superior performance. The code and dataset of DeepMTC are freely available at https://github.com/ghli16/DeepMTC.
Collapse
Affiliation(s)
- Peihao Bai
- School of Information and Software Engineering, East China Jiaotong University, No. 808 Shuanggang East Road, Nanchang 330013, China
| | - Guanghui Li
- School of Information and Software Engineering, East China Jiaotong University, No. 808 Shuanggang East Road, Nanchang 330013, China
| | - Jiawei Luo
- College of Computer Science and Electronic Engineering, Hunan University, No. 2 Lushan Road, Changsha 410082, China
| | - Cheng Liang
- School of Information Science and Engineering, Shandong Normal University, No. 1 University Road, Jinan 250358, China
- Shandong Key Laboratory of Biophysics, Dezhou University, No. 566 University Road, Dezhou 253023, China
| |
Collapse
|
39
|
Mi J, Wang H, Li J, Sun J, Li C, Wan J, Zeng Y, Gao J. GGN-GO: geometric graph networks for predicting protein function by multi-scale structure features. Brief Bioinform 2024; 25:bbae559. [PMID: 39487084 PMCID: PMC11530295 DOI: 10.1093/bib/bbae559] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2024] [Revised: 10/03/2024] [Accepted: 10/17/2024] [Indexed: 11/04/2024] Open
Abstract
Recent advances in high-throughput sequencing have led to an explosion of genomic and transcriptomic data, offering a wealth of protein sequence information. However, the functions of most proteins remain unannotated. Traditional experimental methods for annotation of protein functions are costly and time-consuming. Current deep learning methods typically rely on Graph Convolutional Networks to propagate features between protein residues. However, these methods fail to capture fine atomic-level geometric structural features and cannot directly compute or propagate structural features (such as distances, directions, and angles) when transmitting features, often simplifying them to scalars. Additionally, difficulties in capturing long-range dependencies limit the model's ability to identify key nodes (residues). To address these challenges, we propose a geometric graph network (GGN-GO) for predicting protein function that enriches feature extraction by capturing multi-scale geometric structural features at the atomic and residue levels. We use a geometric vector perceptron to convert these features into vector representations and aggregate them with node features for better understanding and propagation in the network. Moreover, we introduce a graph attention pooling layer captures key node information by adaptively aggregating local functional motifs, while contrastive learning enhances graph representation discriminability through random noise and different views. The experimental results show that GGN-GO outperforms six comparative methods in tasks with the most labels for both experimentally validated and predicted protein structures. Furthermore, GGN-GO identifies functional residues corresponding to those experimentally confirmed, showcasing its interpretability and the ability to pinpoint key protein regions. The code and data are available at: https://github.com/MiJia-ID/GGN-GO.
Collapse
Affiliation(s)
- Jia Mi
- The College of Information Science and Technology, Beijing University of Chemical Technology, Beijing
| | - Han Wang
- The College of Information Science and Technology, Beijing University of Chemical Technology, Beijing
| | - Jing Li
- The College of Life Science and Technology, Beijing University of Chemical Technology, Beijing
| | - Jinghong Sun
- The College of Information Science and Technology, Beijing University of Chemical Technology, Beijing
| | - Chang Li
- The College of Information Science and Technology, Beijing University of Chemical Technology, Beijing
| | - Jing Wan
- The College of Information Science and Technology, Beijing University of Chemical Technology, Beijing
| | - Yuan Zeng
- Microbial Resource and Big Data Center, Institute of Microbiology, Chinese Academy of Sciences
- Chinese National Microbiology Data Center (NMDC)
| | - Jingyang Gao
- The College of Information Science and Technology, Beijing University of Chemical Technology, Beijing
| |
Collapse
|
40
|
Barrios-Núñez I, Martínez-Redondo G, Medina-Burgos P, Cases I, Fernández R, Rojas A. Decoding functional proteome information in model organisms using protein language models. NAR Genom Bioinform 2024; 6:lqae078. [PMID: 38962255 PMCID: PMC11217674 DOI: 10.1093/nargab/lqae078] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2024] [Revised: 05/31/2024] [Accepted: 06/26/2024] [Indexed: 07/05/2024] Open
Abstract
Protein language models have been tested and proved to be reliable when used on curated datasets but have not yet been applied to full proteomes. Accordingly, we tested how two different machine learning-based methods performed when decoding functional information from the proteomes of selected model organisms. We found that protein language models are more precise and informative than deep learning methods for all the species tested and across the three gene ontologies studied, and that they better recover functional information from transcriptomic experiments. The results obtained indicate that these language models are likely to be suitable for large-scale annotation and downstream analyses, and we recommend a guide for their use.
Collapse
Affiliation(s)
- Israel Barrios-Núñez
- Computational Biology and Bioinformatics Group, Andalusian Center for Developmental Biology (CABD-CSIC), 41013 Sevilla, Spain
| | | | - Patricia Medina-Burgos
- Computational Biology and Bioinformatics Group, Andalusian Center for Developmental Biology (CABD-CSIC), 41013 Sevilla, Spain
| | - Ildefonso Cases
- Bioinformatics Unit, Andalusian Center for Developmental Biology (CABD-CSIC), 41013 Sevilla, Spain
| | - Rosa Fernández
- Metazoa Phylogenomics Lab, Institute of Evolutionary Biology (CSIC-UPF), 08003 Barcelona, Spain
| | - Ana M Rojas
- Computational Biology and Bioinformatics Group, Andalusian Center for Developmental Biology (CABD-CSIC), 41013 Sevilla, Spain
| |
Collapse
|
41
|
Jang YJ, Qin QQ, Huang SY, Peter ATJ, Ding XM, Kornmann B. Accurate prediction of protein function using statistics-informed graph networks. Nat Commun 2024; 15:6601. [PMID: 39097570 PMCID: PMC11297950 DOI: 10.1038/s41467-024-50955-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Accepted: 07/15/2024] [Indexed: 08/05/2024] Open
Abstract
Understanding protein function is pivotal in comprehending the intricate mechanisms that underlie many crucial biological activities, with far-reaching implications in the fields of medicine, biotechnology, and drug development. However, more than 200 million proteins remain uncharacterized, and computational efforts heavily rely on protein structural information to predict annotations of varying quality. Here, we present a method that utilizes statistics-informed graph networks to predict protein functions solely from its sequence. Our method inherently characterizes evolutionary signatures, allowing for a quantitative assessment of the significance of residues that carry out specific functions. PhiGnet not only demonstrates superior performance compared to alternative approaches but also narrows the sequence-function gap, even in the absence of structural information. Our findings indicate that applying deep learning to evolutionary data can highlight functional sites at the residue level, providing valuable support for interpreting both existing properties and new functionalities of proteins in research and biomedicine.
Collapse
Affiliation(s)
- Yaan J Jang
- Department of Biochemistry, University of Oxford, Oxford, UK.
- AmoAi Technologies, Oxford, UK.
| | - Qi-Qi Qin
- AmoAi Technologies, Oxford, UK
- School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China
| | - Si-Yu Huang
- AmoAi Technologies, Oxford, UK
- Oxford Martin School, University of Oxford, Oxford, UK
- School of Systems Science, Beijing Normal University, Beijing, China
| | | | - Xue-Ming Ding
- School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China
| | - Benoît Kornmann
- Department of Biochemistry, University of Oxford, Oxford, UK.
| |
Collapse
|
42
|
Khandelwal M, Kumar Rout R. DeepPRMS: advanced deep learning model to predict protein arginine methylation sites. Brief Funct Genomics 2024; 23:452-463. [PMID: 38267081 DOI: 10.1093/bfgp/elae001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Revised: 11/17/2023] [Accepted: 01/03/2024] [Indexed: 01/26/2024] Open
Abstract
Protein methylation is a form of post-translational modifications of protein, which is crucial for various cellular processes, including transcription activity and DNA repair. Correctly predicting protein methylation sites is fundamental for research and drug discovery. Some experimental techniques, such as methyl-specific antibodies, chromatin immune precipitation and mass spectrometry, exist for predicting protein methylation sites, but these techniques are time-consuming and costly. The ability to predict methylation sites using in silico techniques may help researchers identify potential candidate sites for future examination and make it easier to carry out site-specific investigations and downstream characterizations. In this research, we proposed a novel deep learning-based predictor, named DeepPRMS, to identify protein methylation sites in primary sequences. The DeepPRMS utilizes the gated recurrent unit (GRU) and convolutional neural network (CNN) algorithms to extract the sequential and spatial information from the primary sequences. GRU is used to extract sequential information, while CNN is used for spatial information. We combined the latent representation of GRU and CNN models to have a better interaction among them. Based on the independent test data set, DeepPRMS obtained an accuracy of 85.32%, a specificity of 84.94%, Matthew's correlation coefficient of 0.71 and a sensitivity of 85.80%. The results indicate that DeepPRMS can predict protein methylation sites with high accuracy and outperform the state-of-the-art models. The DeepPRMS is expected to effectively guide future research experiments for identifying potential methylated protein sites. The web server is available at http://deepprms.nitsri.ac.in/.
Collapse
Affiliation(s)
- Monika Khandelwal
- Computer Science & Engineering, National Institute of Technology Srinagar, Hazratbal, Srinagar 190006, Jammu and Kashmir, India
| | - Ranjeet Kumar Rout
- Computer Science & Engineering, National Institute of Technology Srinagar, Hazratbal, Srinagar 190006, Jammu and Kashmir, India
| |
Collapse
|
43
|
Truong-Quoc C, Lee JY, Kim KS, Kim DN. Prediction of DNA origami shape using graph neural network. NATURE MATERIALS 2024; 23:984-992. [PMID: 38486095 DOI: 10.1038/s41563-024-01846-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Accepted: 02/22/2024] [Indexed: 07/10/2024]
Abstract
Unlike proteins, which have a wealth of validated structural data, experimentally or computationally validated DNA origami datasets are limited. Here we present a graph neural network that can predict the three-dimensional conformation of DNA origami assemblies both rapidly and accurately. We develop a hybrid data-driven and physics-informed approach for model training, designed to minimize not only the data-driven loss but also the physics-informed loss. By employing an ensemble strategy, the model can successfully infer the shape of monomeric DNA origami structures almost in real time. Further refinement of the model in an unsupervised manner enables the analysis of supramolecular assemblies consisting of tens to hundreds of DNA blocks. The proposed model enables an automated inverse design of DNA origami structures for given target shapes. Our approach facilitates the real-time virtual prototyping of DNA origami, broadening its design space.
Collapse
Affiliation(s)
- Chien Truong-Quoc
- Department of Mechanical Engineering, Seoul National University, Seoul, Korea
| | - Jae Young Lee
- Institute of Advanced Machines and Design, Seoul National University, Seoul, Korea
| | - Kyung Soo Kim
- Department of Mechanical Engineering, Seoul National University, Seoul, Korea
| | - Do-Nyun Kim
- Department of Mechanical Engineering, Seoul National University, Seoul, Korea.
- Institute of Advanced Machines and Design, Seoul National University, Seoul, Korea.
- Institute of Engineering Research, Seoul National University, Seoul, Korea.
| |
Collapse
|
44
|
Chen Z, Luo Q. DualNetGO: a dual network model for protein function prediction via effective feature selection. Bioinformatics 2024; 40:btae437. [PMID: 38963311 PMCID: PMC11538015 DOI: 10.1093/bioinformatics/btae437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 06/05/2024] [Accepted: 07/03/2024] [Indexed: 07/05/2024] Open
Abstract
MOTIVATION Protein-protein interaction (PPI) networks are crucial for automatically annotating protein functions. As multiple PPI networks exist for the same set of proteins that capture properties from different aspects, it is a challenging task to effectively utilize these heterogeneous networks. Recently, several deep learning models have combined PPI networks from all evidence, or concatenated all graph embeddings for protein function prediction. However, the lack of a judicious selection procedure prevents the effective harness of information from different PPI networks, as these networks vary in densities, structures, and noise levels. Consequently, combining protein features indiscriminately could increase the noise level, leading to decreased model performance. RESULTS We develop DualNetGO, a dual-network model comprised of a Classifier and a Selector, to predict protein functions by effectively selecting features from different sources including graph embeddings of PPI networks, protein domain, and subcellular location information. Evaluation of DualNetGO on human and mouse datasets in comparison with other network-based models shows at least 4.5%, 6.2%, and 14.2% improvement on Fmax in BP, MF, and CC gene ontology categories, respectively, for human, and 3.3%, 10.6%, and 7.7% improvement on Fmax for mouse. We demonstrate the generalization capability of our model by training and testing on the CAFA3 data, and show its versatility by incorporating Esm2 embeddings. We further show that our model is insensitive to the choice of graph embedding method and is time- and memory-saving. These results demonstrate that combining a subset of features including PPI networks and protein attributes selected by our model is more effective in utilizing PPI network information than only using one kind of or concatenating graph embeddings from all kinds of PPI networks. AVAILABILITY AND IMPLEMENTATION The source code of DualNetGO and some of the experiment data are available at: https://github.com/georgedashen/DualNetGO.
Collapse
Affiliation(s)
- Zhuoyang Chen
- Data Science and Analytics Thrust, Information Hub, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, Guangdong, 511400, China
| | - Qiong Luo
- Data Science and Analytics Thrust, Information Hub, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, Guangdong, 511400, China
- HKUST, Hong Kong SAR, China
| |
Collapse
|
45
|
de Oliveira GB, Pedrini H, Dias Z. Integrating Transformers and AutoML for Protein Function Prediction. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2024; 2024:1-5. [PMID: 40039729 DOI: 10.1109/embc53108.2024.10782139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2025]
Abstract
The next-generation sequencing technology and the decreasing cost of experimental verification of proteins made the accumulation of sequenced proteins in recent years possible. However, determining protein function is still difficult due to the cost and time required for this analysis. For that reason, computational methods have been developed to automatically assign annotations to proteins. In this work, we present MAGO, an approach based on Transformers and AutoML, and MAGO+, an ensemble of MAGO with BLASTp, to deal with this task. MAGO and MAGO+ surpassed state-of-the-art methods based on machine learning and ensemble methods combining local alignment tools and machine learning algorithms, improving the results based on Fmax and presenting statistically significant differences with the compared approaches.
Collapse
|
46
|
Zhapa-Camacho F, Tang Z, Kulmanov M, Hoehndorf R. Predicting protein functions using positive-unlabeled ranking with ontology-based priors. Bioinformatics 2024; 40:i401-i409. [PMID: 38940168 PMCID: PMC11211813 DOI: 10.1093/bioinformatics/btae237] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
Automated protein function prediction is a crucial and widely studied problem in bioinformatics. Computationally, protein function is a multilabel classification problem where only positive samples are defined and there is a large number of unlabeled annotations. Most existing methods rely on the assumption that the unlabeled set of protein function annotations are negatives, inducing the false negative issue, where potential positive samples are trained as negatives. We introduce a novel approach named PU-GO, wherein we address function prediction as a positive-unlabeled ranking problem. We apply empirical risk minimization, i.e. we minimize the classification risk of a classifier where class priors are obtained from the Gene Ontology hierarchical structure. We show that our approach is more robust than other state-of-the-art methods on similarity-based and time-based benchmark datasets. AVAILABILITY AND IMPLEMENTATION Data and code are available at https://github.com/bio-ontology-research-group/PU-GO.
Collapse
Affiliation(s)
- Fernando Zhapa-Camacho
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
- Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
| | - Zhenwei Tang
- Department of Computer Science, University of Toronto, Toronto, ON M5S 1A1, Canada
| | - Maxat Kulmanov
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
- Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
- SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
| | - Robert Hoehndorf
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
- Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
- SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Saudi Arabia
| |
Collapse
|
47
|
Curion F, Theis FJ. Machine learning integrative approaches to advance computational immunology. Genome Med 2024; 16:80. [PMID: 38862979 PMCID: PMC11165829 DOI: 10.1186/s13073-024-01350-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Accepted: 05/23/2024] [Indexed: 06/13/2024] Open
Abstract
The study of immunology, traditionally reliant on proteomics to evaluate individual immune cells, has been revolutionized by single-cell RNA sequencing. Computational immunologists play a crucial role in analysing these datasets, moving beyond traditional protein marker identification to encompass a more detailed view of cellular phenotypes and their functional roles. Recent technological advancements allow the simultaneous measurements of multiple cellular components-transcriptome, proteome, chromatin, epigenetic modifications and metabolites-within single cells, including in spatial contexts within tissues. This has led to the generation of complex multiscale datasets that can include multimodal measurements from the same cells or a mix of paired and unpaired modalities. Modern machine learning (ML) techniques allow for the integration of multiple "omics" data without the need for extensive independent modelling of each modality. This review focuses on recent advancements in ML integrative approaches applied to immunological studies. We highlight the importance of these methods in creating a unified representation of multiscale data collections, particularly for single-cell and spatial profiling technologies. Finally, we discuss the challenges of these holistic approaches and how they will be instrumental in the development of a common coordinate framework for multiscale studies, thereby accelerating research and enabling discoveries in the computational immunology field.
Collapse
Affiliation(s)
- Fabiola Curion
- Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany
- Department of Mathematics, School of Computation, Information and Technology, Technical University of Munich, Munich, Germany
| | - Fabian J Theis
- Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany.
- Department of Mathematics, School of Computation, Information and Technology, Technical University of Munich, Munich, Germany.
- School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany.
| |
Collapse
|
48
|
Liu Y, Zhang Y, Chen Z, Peng J. POLAT: Protein function prediction based on soft mask graph network and residue-Label ATtention. Comput Biol Chem 2024; 110:108064. [PMID: 38677014 DOI: 10.1016/j.compbiolchem.2024.108064] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2023] [Revised: 01/19/2024] [Accepted: 03/26/2024] [Indexed: 04/29/2024]
Abstract
MOTIVATION Elucidating protein function is a central problem in biochemistry, genetics, and molecular biology. Developing computational methods for protein function prediction is critical due to the significant gap between sequence and functional data. Recent advances in protein structure prediction, which strongly correlates with function, make it feasible to use structure to predict function. However, current structure-based methods overlook the fact that individual residues may contribute differently to the protein's function and do not take into account the correlation between protein residues and their functions. The challenge of effectively utilizing the relationship between protein residues and function-level information to predict protein function remains unsolved. RESULT We proposed a protein function prediction method based on Soft Mask Graph Networks and Residue-Label Attention (POLAT), which could combine sequence features, predicted structure features, and function-level information to get an accurate prediction. We use soft mask graph networks to adaptively extract the residues relevant to functions. A residue-label attention mechanism is adopted to obtain the protein-level encoded features of a protein, which are then concatenated with a protein-level embedding and fed into a dense classifier to determine the probabilities of each function. POLAT achieves 0.670, 0.515, 0.578 Fmax and 0.677, 0.409, 0.507 AUPR on the PDB cdhit test set for the MFO, BPO, and CCO domains, respectively, outperforming the existing structure-based SOTA method GAT-GO (Fmax 0.633, 0.492, 0.547; AUPR 0.660, 0.381, 0.479). POLAT is also competitive in extensive experiments among sequence-based and multimodal methods and achieves the SOTA performance in three out of six metrics.
Collapse
Affiliation(s)
- Yang Liu
- Intelligent Bioinformatics Laboratory, School of Computer and Artificial Intelligence, Wuhan University of Technology, Wuhan, 430070, China.
| | - Yi Zhang
- Intelligent Bioinformatics Laboratory, School of Computer and Artificial Intelligence, Wuhan University of Technology, Wuhan, 430070, China.
| | - ZiHao Chen
- Intelligent Bioinformatics Laboratory, School of Computer and Artificial Intelligence, Wuhan University of Technology, Wuhan, 430070, China.
| | - Jing Peng
- Intelligent Bioinformatics Laboratory, School of Computer and Artificial Intelligence, Wuhan University of Technology, Wuhan, 430070, China.
| |
Collapse
|
49
|
Lin B, Luo X, Liu Y, Jin X. A comprehensive review and comparison of existing computational methods for protein function prediction. Brief Bioinform 2024; 25:bbae289. [PMID: 39003530 PMCID: PMC11246557 DOI: 10.1093/bib/bbae289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 05/18/2024] [Indexed: 07/15/2024] Open
Abstract
Protein function prediction is critical for understanding the cellular physiological and biochemical processes, and it opens up new possibilities for advancements in fields such as disease research and drug discovery. During the past decades, with the exponential growth of protein sequence data, many computational methods for predicting protein function have been proposed. Therefore, a systematic review and comparison of these methods are necessary. In this study, we divide these methods into four different categories, including sequence-based methods, 3D structure-based methods, PPI network-based methods and hybrid information-based methods. Furthermore, their advantages and disadvantages are discussed, and then their performance is comprehensively evaluated and compared. Finally, we discuss the challenges and opportunities present in this field.
Collapse
Affiliation(s)
- Baohui Lin
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen, Guangdong 518118, China
| | - Xiaoling Luo
- Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Shenzhen, Guangdong, China
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, Guangdong 518061, China
| | - Yumeng Liu
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen, Guangdong 518118, China
| | - Xiaopeng Jin
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen, Guangdong 518118, China
| |
Collapse
|
50
|
Ansari M, White AD. Learning peptide properties with positive examples only. DIGITAL DISCOVERY 2024; 3:977-986. [PMID: 38756224 PMCID: PMC11094695 DOI: 10.1039/d3dd00218g] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/05/2023] [Accepted: 03/30/2024] [Indexed: 05/18/2024]
Abstract
Deep learning can create accurate predictive models by exploiting existing large-scale experimental data, and guide the design of molecules. However, a major barrier is the requirement of both positive and negative examples in the classical supervised learning frameworks. Notably, most peptide databases come with missing information and low number of observations on negative examples, as such sequences are hard to obtain using high-throughput screening methods. To address this challenge, we solely exploit the limited known positive examples in a semi-supervised setting, and discover peptide sequences that are likely to map to certain antimicrobial properties via positive-unlabeled learning (PU). In particular, we use the two learning strategies of adapting base classifier and reliable negative identification to build deep learning models for inferring solubility, hemolysis, binding against SHP-2, and non-fouling activity of peptides, given their sequence. We evaluate the predictive performance of our PU learning method and show that by only using the positive data, it can achieve competitive performance when compared with the classical positive-negative (PN) classification approach, where there is access to both positive and negative examples.
Collapse
Affiliation(s)
- Mehrad Ansari
- Department of Chemical Engineering, University of Rochester Rochester NY 14627 USA
| | - Andrew D White
- Department of Chemical Engineering, University of Rochester Rochester NY 14627 USA
| |
Collapse
|