1
|
Wang B, Cui B, Chen S, Wang X, Wang Y, Li J. MSNGO: multi-species protein function annotation based on 3D protein structure and network propagation. Bioinformatics 2025; 41:btaf285. [PMID: 40327458 PMCID: PMC12122197 DOI: 10.1093/bioinformatics/btaf285] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2024] [Revised: 04/09/2025] [Accepted: 05/06/2025] [Indexed: 05/08/2025] Open
Abstract
MOTIVATION In recent years, protein function prediction has broken through the bottleneck of sequence features, significantly improving prediction accuracy using high-precision protein structures predicted by AlphaFold2. While single-species protein function prediction methods have achieved remarkable success, multi-species approaches still face challenges such as difficulties in multi-source data integration and insufficient knowledge transfer between distantly-related species. How to integrate large-scale data and provide effective cross-species label propagation for species with sparse protein annotations remains a critical and unresolved challenge. To address this problem, we propose the MSNGO (Multi-species protein Structures and Network to predict GO terms) model, which integrates structural features and network propagation methods. Our validation shows that using structural features can significantly improve the accuracy of multi-species protein function prediction. RESULTS We employ graph representation learning techniques to extract amino acid representations from protein structure contact maps and train a structural model using a graph convolution pooling module to derive protein-level structural features. After incorporating the sequence features from ESM-2, we apply a network propagation algorithm to aggregate information and update node representations within a heterogeneous network. The results demonstrate that MSNGO outperforms previous multi-species protein function prediction methods that rely on sequence features and protein-protein networks. AVAILABILITY AND IMPLEMENTATION https://github.com/blingbell/MSNGO.
Collapse
Affiliation(s)
- Beibei Wang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China
| | - Boyue Cui
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China
| | - Shiqu Chen
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China
| | - Xuan Wang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China
- Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China
| | - Yadong Wang
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Junyi Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China
- Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| |
Collapse
|
2
|
Zhang H, Sun Y, Wang Y, Luo X, Liu Y, Chen B, Jin X, Zhu D. GTPLM-GO: Enhancing Protein Function Prediction Through Dual-Branch Graph Transformer and Protein Language Model Fusing Sequence and Local-Global PPI Information. Int J Mol Sci 2025; 26:4088. [PMID: 40362328 PMCID: PMC12072039 DOI: 10.3390/ijms26094088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2025] [Revised: 04/21/2025] [Accepted: 04/23/2025] [Indexed: 05/15/2025] Open
Abstract
Currently, protein-protein interaction (PPI) networks have become an essential data source for protein function prediction. However, methods utilizing graph neural networks (GNNs) face significant challenges in modeling PPI networks. A primary issue is over-smoothing, which occurs when multiple GNN layers are stacked to capture global information. This architectural limitation inherently impairs the integration of local and global information within PPI networks, thereby limiting the accuracy of protein function prediction. To effectively utilize information within PPI networks, we propose GTPLM-GO, a protein function prediction method based on a dual-branch Graph Transformer and protein language model. The dual-branch Graph Transformer achieves the collaborative modeling of local and global information in PPI networks through two branches: a graph neural network and a linear attention-based Transformer encoder. GTPLM-GO integrates local-global PPI information with the functional semantic encoding constructed by the protein language model, overcoming the issue of inadequate information extraction in existing methods. Experimental results demonstrate that GTPLM-GO outperforms advanced network-based and sequence-based methods on PPI network datasets of varying scales.
Collapse
Affiliation(s)
- Haotian Zhang
- School of Computer Science and Technology, Harbin Institute of Technology, Weihai 264209, China; (H.Z.); (Y.S.); (Y.W.); (B.C.)
| | - Yundong Sun
- School of Computer Science and Technology, Harbin Institute of Technology, Weihai 264209, China; (H.Z.); (Y.S.); (Y.W.); (B.C.)
- Department of Electronic Science and Technology, Harbin Institute of Technology, Harbin 150001, China
| | - Yansong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Weihai 264209, China; (H.Z.); (Y.S.); (Y.W.); (B.C.)
| | - Xiaoling Luo
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China;
| | - Yumeng Liu
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen 518118, China;
| | - Bin Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Weihai 264209, China; (H.Z.); (Y.S.); (Y.W.); (B.C.)
| | - Xiaopeng Jin
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen 518118, China;
| | - Dongjie Zhu
- School of Computer Science and Technology, Harbin Institute of Technology, Weihai 264209, China; (H.Z.); (Y.S.); (Y.W.); (B.C.)
| |
Collapse
|
3
|
Wang Y, Sun Y, Lin B, Zhang H, Luo X, Liu Y, Jin X, Zhu D. SEGT-GO: a graph transformer method based on PPI serialization and explanatory artificial intelligence for protein function prediction. BMC Bioinformatics 2025; 26:46. [PMID: 39930351 PMCID: PMC11808960 DOI: 10.1186/s12859-025-06059-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2024] [Accepted: 01/20/2025] [Indexed: 02/14/2025] Open
Abstract
BACKGROUND A massive amount of protein sequences have been obtained, but their functions remain challenging to discern. In recent research on protein function prediction, Protein-Protein Interaction (PPI) Networks have played a crucial role. Uncovering potential function relationships between distant proteins within PPI networks is essential for improving the accuracy of protein function prediction. Most current studies attempt to capture these distant relationships by stacking graph network layers, but performance gains diminish as the number of layers increases. RESULTS To further explore the potential functional relationships between multi-hop proteins in PPI networks, this paper proposes SEGT-GO, a Graph Transformer method based on PPI multi-hop neighborhood Serialization and Explainable artificial intelligence for large-scale multispecies protein function prediction. The multi-hop neighborhood serialization maps multi-hop information in the PPI Network into serialized feature embeddings, enabling the Graph Transformer to learn deeper functional features within the PPI Network. Based on game theory, the SHAP eXplainable Artificial Intelligence (XAI) framework optimizes model input and filters out feature noise, enhancing model performance. CONCLUSIONS Compared to the advanced network method DeepGraphGO, SEGT-GO achieves more competitive results in standard large-scale datasets and superior results on small ones, validating its ability to extract functional information from deep proteins. Furthermore, SEGT-GO achieves superior results in cross-species learning and prediction of the functions of unseen proteins, further proving the method's strong generalization.
Collapse
Affiliation(s)
- Yansong Wang
- School of Computer Science and Technology, Harbin Institute of Technology Weihai Campus, Weihai, 264209, China
| | - Yundong Sun
- School of Computer Science and Technology, Harbin Institute of Technology Weihai Campus, Weihai, 264209, China
- Department of Electronic Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
| | - Baohui Lin
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen, 518118, China
| | - Haotian Zhang
- School of Computer Science and Technology, Harbin Institute of Technology Weihai Campus, Weihai, 264209, China
| | - Xiaoling Luo
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060, China
| | - Yumeng Liu
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen, 518118, China
| | - Xiaopeng Jin
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen, 518118, China.
| | - Dongjie Zhu
- School of Computer Science and Technology, Harbin Institute of Technology Weihai Campus, Weihai, 264209, China.
| |
Collapse
|
4
|
Lee Y, Gao P, Xu Y, Wang Z, Li S, Chen J. MEGA-GO: functions prediction of diverse protein sequence length using Multi-scalE Graph Adaptive neural network. Bioinformatics 2025; 41:btaf032. [PMID: 39847542 PMCID: PMC11810639 DOI: 10.1093/bioinformatics/btaf032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2024] [Revised: 01/13/2025] [Accepted: 01/21/2025] [Indexed: 01/25/2025] Open
Abstract
MOTIVATION The increasing accessibility of large-scale protein sequences through advanced sequencing technologies has necessitated the development of efficient and accurate methods for predicting protein function. Computational prediction models have emerged as a promising solution to expedite the annotation process. However, despite making significant progress in protein research, graph neural networks face challenges in capturing long-range structural correlations and identifying critical residues in protein graphs. Furthermore, existing models have limitations in effectively predicting the function of newly sequenced proteins that are not included in protein interaction networks. This highlights the need for novel approaches integrating protein structure and sequence data. RESULTS We introduce Multi-scalE Graph Adaptive neural network (MEGA-GO), highlighting the capability of capturing diverse protein sequence length features from multiple scales. The unique graph adaptive neural network architecture of MEGA-GO enables a more nuanced extraction of graph structure features, effectively capturing intricate relationships within biological data. Experimental results demonstrate that MEGA-GO outperforms mainstream protein function prediction models in the accuracy of Gene Ontology term classification, yielding 33.4%, 68.9%, and 44.6% of area under the precision-recall curve on biological process, molecular function, and cellular component domains, respectively. The rest of the experimental results reveal that our model consistently surpasses the state-of-the-art methods. AVAILABILITY AND IMPLEMENTATION The source code and data of MEGA-GO are available at https://github.com/Cheliosoops/MEGA-GO.
Collapse
Affiliation(s)
- Yujian Lee
- Guangdong Provincial Key Laboratory IRADS, Beijing Normal University-Hong Kong Baptist University United International College, Zhuhai 519087, China
- Department of Computer Science, Beijing Normal University-Hong Kong Baptist University United International College, Zhuhai 519087, China
| | - Peng Gao
- Department of Computer Science, Beijing Normal University-Hong Kong Baptist University United International College, Zhuhai 519087, China
| | - Yongqi Xu
- Department of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510520, China
| | - Ziyang Wang
- Department of Science of Chinese Materia Medica, Guangdong Medical University, Dongguan 524023, China
| | - Shuaicheng Li
- Department of Computer Science, City University of Hong Kong, Hong Kong, China
| | - Jiaxing Chen
- Guangdong Provincial Key Laboratory IRADS, Beijing Normal University-Hong Kong Baptist University United International College, Zhuhai 519087, China
| |
Collapse
|
5
|
Boadu F, Lee A, Cheng J. Deep learning methods for protein function prediction. Proteomics 2025; 25:e2300471. [PMID: 38996351 PMCID: PMC11735672 DOI: 10.1002/pmic.202300471] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Revised: 06/15/2024] [Accepted: 06/18/2024] [Indexed: 07/14/2024]
Abstract
Predicting protein function from protein sequence, structure, interaction, and other relevant information is important for generating hypotheses for biological experiments and studying biological systems, and therefore has been a major challenge in protein bioinformatics. Numerous computational methods had been developed to advance protein function prediction gradually in the last two decades. Particularly, in the recent years, leveraging the revolutionary advances in artificial intelligence (AI), more and more deep learning methods have been developed to improve protein function prediction at a faster pace. Here, we provide an in-depth review of the recent developments of deep learning methods for protein function prediction. We summarize the significant advances in the field, identify several remaining major challenges to be tackled, and suggest some potential directions to explore. The data sources and evaluation metrics widely used in protein function prediction are also discussed to assist the machine learning, AI, and bioinformatics communities to develop more cutting-edge methods to advance protein function prediction.
Collapse
Affiliation(s)
- Frimpong Boadu
- Department of Electrical Engineering and Computer ScienceUniversity of MissouriColumbiaMissouriUSA
| | - Ahhyun Lee
- Department of Electrical Engineering and Computer ScienceUniversity of MissouriColumbiaMissouriUSA
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer ScienceUniversity of MissouriColumbiaMissouriUSA
| |
Collapse
|
6
|
Vu TTD, Kim J, Jung J. An experimental analysis of graph representation learning for Gene Ontology based protein function prediction. PeerJ 2024; 12:e18509. [PMID: 39553733 PMCID: PMC11569786 DOI: 10.7717/peerj.18509] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2024] [Accepted: 10/21/2024] [Indexed: 11/19/2024] Open
Abstract
Understanding protein function is crucial for deciphering biological systems and facilitating various biomedical applications. Computational methods for predicting Gene Ontology functions of proteins emerged in the 2000s to bridge the gap between the number of annotated proteins and the rapidly growing number of newly discovered amino acid sequences. Recently, there has been a surge in studies applying graph representation learning techniques to biological networks to enhance protein function prediction tools. In this review, we provide fundamental concepts in graph embedding algorithms. This study described graph representation learning methods for protein function prediction based on four principal data categories, namely PPI network, protein structure, Gene Ontology graph, and integrated graph. The commonly used approaches for each category were summarized and diagrammed, with the specific results of each method explained in detail. Finally, existing limitations and potential solutions were discussed, and directions for future research within the protein research community were suggested.
Collapse
Affiliation(s)
- Thi Thuy Duong Vu
- Faculty of Fundamental Sciences, University of Medicine and Pharmacy at Ho Chi Minh City, Ho Chi Minh City, Vietnam
| | - Jeongho Kim
- Department of Information and Communication Engineering, Myongji University, Yongin, Republic of South Korea
| | - Jaehee Jung
- Department of Information and Communication Engineering, Myongji University, Yongin, Republic of South Korea
| |
Collapse
|
7
|
Lin B, Luo X, Liu Y, Jin X. A comprehensive review and comparison of existing computational methods for protein function prediction. Brief Bioinform 2024; 25:bbae289. [PMID: 39003530 PMCID: PMC11246557 DOI: 10.1093/bib/bbae289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 05/18/2024] [Indexed: 07/15/2024] Open
Abstract
Protein function prediction is critical for understanding the cellular physiological and biochemical processes, and it opens up new possibilities for advancements in fields such as disease research and drug discovery. During the past decades, with the exponential growth of protein sequence data, many computational methods for predicting protein function have been proposed. Therefore, a systematic review and comparison of these methods are necessary. In this study, we divide these methods into four different categories, including sequence-based methods, 3D structure-based methods, PPI network-based methods and hybrid information-based methods. Furthermore, their advantages and disadvantages are discussed, and then their performance is comprehensively evaluated and compared. Finally, we discuss the challenges and opportunities present in this field.
Collapse
Affiliation(s)
- Baohui Lin
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen, Guangdong 518118, China
| | - Xiaoling Luo
- Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Shenzhen, Guangdong, China
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, Guangdong 518061, China
| | - Yumeng Liu
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen, Guangdong 518118, China
| | - Xiaopeng Jin
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen, Guangdong 518118, China
| |
Collapse
|
8
|
Ye Z, Mao D, Wang Y, Deng H, Liu X, Zhang T, Han Z, Zhang X. Comparative Genome-Wide Identification of the Fatty Acid Desaturase Gene Family in Tea and Oil Tea. PLANTS (BASEL, SWITZERLAND) 2024; 13:1444. [PMID: 38891253 PMCID: PMC11174766 DOI: 10.3390/plants13111444] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/10/2024] [Revised: 05/07/2024] [Accepted: 05/16/2024] [Indexed: 06/21/2024]
Abstract
Camellia oil is valuable as an edible oil and serves as a base material for a range of high-value products. Camellia plants of significant economic importance, such as Camellia sinensis and Camellia oleifera, have been classified into sect. Thea and sect. Oleifera, respectively. Fatty acid desaturases play a crucial role in catalyzing the formation of double bonds at specific positions of fatty acid chains, leading to the production of unsaturated fatty acids and contributing to lipid synthesis. Comparative genomics results have revealed that expanded gene families in oil tea are enriched in functions related to lipid, fatty acid, and seed processes. To explore the function of the FAD gene family, a total of 82 FAD genes were identified in tea and oil tea. Transcriptome data showed the differential expression of the FAD gene family in mature seeds of tea tree and oil tea tree. Furthermore, the structural analysis and clustering of FAD proteins provided insights for the further exploration of the function of the FAD gene family and its role in lipid synthesis. Overall, these findings shed light on the role of the FAD gene family in Camellia plants and their involvement in lipid metabolism, as well as provide a reference for understanding their function in oil synthesis.
Collapse
Affiliation(s)
- Ziqi Ye
- The Laboratory of Forestry Genetics, Central South University of Forestry and Technology, Changsha 410004, China; (Z.Y.); (H.D.); (X.L.); (T.Z.)
| | - Dan Mao
- National Forest and Seedling Workstation of Hunan Province, The Forestry Department of Hunan Province, Changsha 410004, China; (D.M.); (Y.W.)
| | - Yujian Wang
- National Forest and Seedling Workstation of Hunan Province, The Forestry Department of Hunan Province, Changsha 410004, China; (D.M.); (Y.W.)
| | - Hongda Deng
- The Laboratory of Forestry Genetics, Central South University of Forestry and Technology, Changsha 410004, China; (Z.Y.); (H.D.); (X.L.); (T.Z.)
| | - Xing Liu
- The Laboratory of Forestry Genetics, Central South University of Forestry and Technology, Changsha 410004, China; (Z.Y.); (H.D.); (X.L.); (T.Z.)
| | - Tongyue Zhang
- The Laboratory of Forestry Genetics, Central South University of Forestry and Technology, Changsha 410004, China; (Z.Y.); (H.D.); (X.L.); (T.Z.)
| | - Zhiqiang Han
- The Laboratory of Forestry Genetics, Central South University of Forestry and Technology, Changsha 410004, China; (Z.Y.); (H.D.); (X.L.); (T.Z.)
| | - Xingtan Zhang
- Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518000, China
| |
Collapse
|