1
|
Zhao Y, Yang Z, Wang L, Zhang Y, Lin H, Wang J. Predicting Protein Functions Based on Heterogeneous Graph Attention Technique. IEEE J Biomed Health Inform 2024; 28:2408-2415. [PMID: 38319781 DOI: 10.1109/jbhi.2024.3357834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2024]
Abstract
In bioinformatics, protein function prediction stands as a fundamental area of research and plays a crucial role in addressing various biological challenges, such as the identification of potential targets for drug discovery and the elucidation of disease mechanisms. However, known functional annotation databases usually provide positive experimental annotations that proteins carry out a given function, and rarely record negative experimental annotations that proteins do not carry out a given function. Therefore, existing computational methods based on deep learning models focus on these positive annotations for prediction and ignore these scarce but informative negative annotations, leading to an underestimation of precision. To address this issue, we introduce a deep learning method that utilizes a heterogeneous graph attention technique. The method first constructs a heterogeneous graph that covers the protein-protein interaction network, ontology structure, and positive and negative annotation information. Then, it learns embedding representations of proteins and ontology terms by using the heterogeneous graph attention technique. Finally, it leverages these learned representations to reconstruct the positive protein-term associations and score unobserved functional annotations. It can enhance the predictive performance by incorporating these known limited negative annotations into the constructed heterogeneous graph. Experimental results on three species (i.e., Human, Mouse, and Arabidopsis) demonstrate that our method can achieve better performance in predicting new protein annotations than state-of-the-art methods.
Collapse
|
2
|
Zhang C, Zhang X, Freddolino L, Zhang Y. BioLiP2: an updated structure database for biologically relevant ligand-protein interactions. Nucleic Acids Res 2024; 52:D404-D412. [PMID: 37522378 PMCID: PMC10767969 DOI: 10.1093/nar/gkad630] [Citation(s) in RCA: 31] [Impact Index Per Article: 31.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 07/03/2023] [Accepted: 07/17/2023] [Indexed: 08/01/2023] Open
Abstract
With the progress of structural biology, the Protein Data Bank (PDB) has witnessed rapid accumulation of experimentally solved protein structures. Since many structures are determined with purification and crystallization additives that are unrelated to a protein's in vivo function, it is nontrivial to identify the subset of protein-ligand interactions that are biologically relevant. We developed the BioLiP2 database (https://zhanggroup.org/BioLiP) to extract biologically relevant protein-ligand interactions from the PDB database. BioLiP2 assesses the functional relevance of the ligands by geometric rules and experimental literature validations. The ligand binding information is further enriched with other function annotations, including Enzyme Commission numbers, Gene Ontology terms, catalytic sites, and binding affinities collected from other databases and a manual literature survey. Compared to its predecessor BioLiP, BioLiP2 offers significantly greater coverage of nucleic acid-protein interactions, and interactions involving large complexes that are unavailable in PDB format. BioLiP2 also integrates cutting-edge structural alignment algorithms with state-of-the-art structure prediction techniques, which for the first time enables composite protein structure and sequence-based searching and significantly enhances the usefulness of the database in structure-based function annotations. With these new developments, BioLiP2 will continue to be an important and comprehensive database for docking, virtual screening, and structure-based protein function analyses.
Collapse
Affiliation(s)
- Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Xi Zhang
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| | - Lydia Freddolino
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
- Department of Computer Science, School of Computing, National University of Singapore, 117417, Singapore
- Cancer Science Institute of Singapore, National University of Singapore,117599, Singapore
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, 117596, Singapore
| |
Collapse
|
3
|
Sharma L, Deepak A, Ranjan A, Krishnasamy G. A CNN-CBAM-BIGRU model for protein function prediction. Stat Appl Genet Mol Biol 2024; 23:sagmb-2024-0004. [PMID: 38943434 DOI: 10.1515/sagmb-2024-0004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Accepted: 06/07/2024] [Indexed: 07/01/2024]
Abstract
Understanding a protein's function based solely on its amino acid sequence is a crucial but intricate task in bioinformatics. Traditionally, this challenge has proven difficult. However, recent years have witnessed the rise of deep learning as a powerful tool, achieving significant success in protein function prediction. Their strength lies in their ability to automatically learn informative features from protein sequences, which can then be used to predict the protein's function. This study builds upon these advancements by proposing a novel model: CNN-CBAM+BiGRU. It incorporates a Convolutional Block Attention Module (CBAM) alongside BiGRUs. CBAM acts as a spotlight, guiding the CNN to focus on the most informative parts of the protein data, leading to more accurate feature extraction. BiGRUs, a type of Recurrent Neural Network (RNN), excel at capturing long-range dependencies within the protein sequence, which are essential for accurate function prediction. The proposed model integrates the strengths of both CNN-CBAM and BiGRU. This study's findings, validated through experimentation, showcase the effectiveness of this combined approach. For the human dataset, the suggested method outperforms the CNN-BIGRU+ATT model by +1.0 % for cellular components, +1.1 % for molecular functions, and +0.5 % for biological processes. For the yeast dataset, the suggested method outperforms the CNN-BIGRU+ATT model by +2.4 % for the cellular component, +1.2 % for molecular functions, and +0.6 % for biological processes.
Collapse
Affiliation(s)
- Lavkush Sharma
- Department of Computer Science and Engineering, 230635 National Institute of Technology Patna , Patna, Bihar, India
| | - Akshay Deepak
- Department of Computer Science and Engineering, 230635 National Institute of Technology Patna , Patna, Bihar, India
| | - Ashish Ranjan
- Department of Computer Science and Engineering, C.V. Raman Global University, Bhubaneswar, Odisha, India
| | | |
Collapse
|
4
|
Wang S, You R, Liu Y, Xiong Y, Zhu S. NetGO 3.0: Protein Language Model Improves Large-scale Functional Annotations. GENOMICS, PROTEOMICS & BIOINFORMATICS 2023; 21:349-358. [PMID: 37075830 PMCID: PMC10626176 DOI: 10.1016/j.gpb.2023.04.001] [Citation(s) in RCA: 27] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Revised: 02/24/2023] [Accepted: 04/07/2023] [Indexed: 04/21/2023]
Abstract
As one of the state-of-the-art automated function prediction (AFP) methods, NetGO 2.0 integrates multi-source information to improve the performance. However, it mainly utilizes the proteins with experimentally supported functional annotations without leveraging valuable information from a vast number of unannotated proteins. Recently, protein language models have been proposed to learn informative representations [e.g., Evolutionary Scale Modeling (ESM)-1b embedding] from protein sequences based on self-supervision. Here, we represented each protein by ESM-1b and used logistic regression (LR) to train a new model, LR-ESM, for AFP. The experimental results showed that LR-ESM achieved comparable performance with the best-performing component of NetGO 2.0. Therefore, by incorporating LR-ESM into NetGO 2.0, we developed NetGO 3.0 to improve the performance of AFP extensively. NetGO 3.0 is freely accessible at https://dmiip.sjtu.edu.cn/ng3.0.
Collapse
Affiliation(s)
- Shaojun Wang
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, China
| | - Ronghui You
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, China
| | - Yunjia Liu
- School of Life Sciences, Fudan University, Shanghai 200433, China
| | - Yi Xiong
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, Shanghai 200240, China; Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China
| | - Shanfeng Zhu
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, China; Shanghai Qi Zhi Institute, Shanghai 200030, China; MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China; Shanghai Key Laboratory of Intelligent Information Processing and Shanghai Institute of Artificial Intelligence Algorithm, Fudan University, Shanghai 200433, China; Zhangjiang Fudan International Innovation Center, Shanghai 200433, China.
| |
Collapse
|
5
|
Zheng Y, Young ND, Song J, Chang BC, Gasser RB. An informatic workflow for the enhanced annotation of excretory/secretory proteins of Haemonchus contortus. Comput Struct Biotechnol J 2023; 21:2696-2704. [PMID: 37143762 PMCID: PMC10151223 DOI: 10.1016/j.csbj.2023.03.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2023] [Revised: 03/16/2023] [Accepted: 03/16/2023] [Indexed: 03/19/2023] Open
Abstract
Major advances in genomic and associated technologies have demanded reliable bioinformatic tools and workflows for the annotation of genes and their products via comparative analyses using well-curated reference data sets, accessible in public repositories. However, the accurate in silico annotation of molecules (proteins) encoded in organisms (e.g., multicellular parasites) which are evolutionarily distant from those for which these extensive reference data sets are available, including invertebrate model organisms (e.g., Caenorhabditis elegans - free-living nematode, and Drosophila melanogaster - the vinegar fly) and vertebrate species (e.g., Homo sapiens and Mus musculus), remains a major challenge. Here, we constructed an informatic workflow for the enhanced annotation of biologically-important, excretory/secretory (ES) proteins ("secretome") encoded in the genome of a parasitic roundworm, called Haemonchus contortus (commonly known as the barber's pole worm). We critically evaluated the performance of five distinct methods, refined some of them, and then combined the use of all five methods to comprehensively annotate ES proteins, according to gene ontology, biological pathways and/or metabolic (enzymatic) processes. Then, using optimised parameter settings, we applied this workflow to comprehensively annotate 2591 of all 3353 proteins (77.3%) in the secretome of H. contortus. This result is a substantial improvement (10-25%) over previous annotations using individual, "off-the-shelf" algorithms and default settings, indicating the ready applicability of the present, refined workflow to gene/protein sequence data sets from a wide range of organisms in the Tree-of-Life.
Collapse
|
6
|
Kabir MN, Wong L. EnsembleFam: towards more accurate protein family prediction in the twilight zone. BMC Bioinformatics 2022; 23:90. [PMID: 35287576 PMCID: PMC8919565 DOI: 10.1186/s12859-022-04626-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Accepted: 03/02/2022] [Indexed: 11/30/2022] Open
Abstract
Background Current protein family modeling methods like profile Hidden Markov Model (pHMM), k-mer based methods, and deep learning-based methods do not provide very accurate protein function prediction for proteins in the twilight zone, due to low sequence similarity to reference proteins with known functions. Results We present a novel method EnsembleFam, aiming at better function prediction for proteins in the twilight zone. EnsembleFam extracts the core characteristics of a protein family using similarity and dissimilarity features calculated from sequence homology relations. EnsembleFam trains three separate Support Vector Machine (SVM) classifiers for each family using these features, and an ensemble prediction is made to classify novel proteins into these families. Extensive experiments are conducted using the Clusters of Orthologous Groups (COG) dataset and G Protein-Coupled Receptor (GPCR) dataset. EnsembleFam not only outperforms state-of-the-art methods on the overall dataset but also provides a much more accurate prediction for twilight zone proteins. Conclusions EnsembleFam, a machine learning method to model protein families, can be used to better identify members with very low sequence homology. Using EnsembleFam protein functions can be predicted using just sequence information with better accuracy than state-of-the-art methods.
Collapse
Affiliation(s)
- Mohammad Neamul Kabir
- Department of Computer Science, National University of Singapore, 13 Computing Drive, 117417, Singapore, Singapore.
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, 13 Computing Drive, 117417, Singapore, Singapore
| |
Collapse
|
7
|
Yan H, Ma G, Teixeira da Silva JA, Qiu L, Xu J, Zhou H, Wei M, Xiong J, Li M, Zhou S, Wu J, Tang X. Genome-Wide Identification and Analysis of NAC Transcription Factor Family in Two Diploid Wild Relatives of Cultivated Sweet Potato Uncovers Potential NAC Genes Related to Drought Tolerance. Front Genet 2021; 12:744220. [PMID: 34899836 DOI: 10.3389/fgene.021.744220] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2021] [Accepted: 11/05/2021] [Indexed: 11/13/2022] Open
Abstract
NAC (NAM, ATAF1/2, and CUC2) proteins play a pivotal role in modulating plant development and offer protection against biotic and abiotic stresses. Until now, no systematic knowledge of NAC family genes is available for the food security crop, sweet potato. Here, a comprehensive genome-wide survey of NAC domain-containing proteins identified 130 ItbNAC and 144 ItfNAC genes with full length sequences in the genomes of two diploid wild relatives of cultivated sweet potato, Ipomoea triloba and Ipomoea trifida, respectively. These genes were physically mapped onto 15 I. triloba and 16 I. trifida chromosomes, respectively. Phylogenetic analysis divided all 274 NAC proteins into 20 subgroups together with NAC transcription factors (TFs) from Arabidopsis. There were 9 and 15 tandem duplication events in the I. triloba and I. trifida genomes, respectively, indicating an important role of tandem duplication in sweet potato gene expansion and evolution. Moreover, synteny analysis suggested that most NAC genes in the two diploid sweet potato species had a similar origin and evolutionary process. Gene expression patterns based on RNA-Seq data in different tissues and in response to various hormone, biotic or abiotic treatments revealed their possible involvement in organ development and response to various biotic/abiotic stresses. The expression of 36 NAC TFs, which were upregulated in the five tissues and in response to mannitol treatment, was also determined by real-time quantitative polymerase chain reaction (RT-qPCR) in hexaploid cultivated sweet potato exposed to drought stress. Those results largely corroborated the expression profile of mannitol treatment uncovered by the RNA-Seq data. Some significantly up-regulated genes related to drought stress, such as ItbNAC110, ItbNAC114, ItfNAC15, ItfNAC28, and especially ItfNAC62, which had a conservative spatial conformation with a closely related paralogous gene, ANAC019, may be potential candidate genes for a sweet potato drought tolerance breeding program. This analysis provides comprehensive and systematic information about NAC family genes in two diploid wild relatives of cultivated sweet potato, and will provide a blueprint for their functional characterization and exploitation to improve the tolerance of sweet potato to abiotic stresses.
Collapse
Affiliation(s)
- Haifeng Yan
- Sugarcane Research Institute of Guangxi Academy of Agricultural Sciences, Guangxi Key Laboratory of Sugarcane Genetic Improvement and Key Laboratory of Sugarcane Biotechnology and Genetic Improvement (Guangxi), Ministry of Agriculture, Nanning, China
| | - Guohua Ma
- Guangdong Provincial Key Laboratory of Applied Botany, South China Botanical Garden, The Chinese Academy of Sciences, Guangzhou, China
| | | | - Lihang Qiu
- Sugarcane Research Institute of Guangxi Academy of Agricultural Sciences, Guangxi Key Laboratory of Sugarcane Genetic Improvement and Key Laboratory of Sugarcane Biotechnology and Genetic Improvement (Guangxi), Ministry of Agriculture, Nanning, China
| | - Juan Xu
- Biological Technology Research Institute, Guangxi Academy of Agricultural Sciences, Nanning, China
| | - Huiwen Zhou
- Sugarcane Research Institute of Guangxi Academy of Agricultural Sciences, Guangxi Key Laboratory of Sugarcane Genetic Improvement and Key Laboratory of Sugarcane Biotechnology and Genetic Improvement (Guangxi), Ministry of Agriculture, Nanning, China
| | - Minzheng Wei
- Cash Crop Institute of Guangxi Academy of Agricultural Sciences, Nanning, China
| | - Jun Xiong
- Cash Crop Institute of Guangxi Academy of Agricultural Sciences, Nanning, China
| | - Mingzhi Li
- Biodata Biotechnology Co., Ltd, Hefei, China
| | - Shaohuan Zhou
- GuangXi Center for Disease Prevention and Control, Nanning, China
| | - Jianming Wu
- Sugarcane Research Institute of Guangxi Academy of Agricultural Sciences, Guangxi Key Laboratory of Sugarcane Genetic Improvement and Key Laboratory of Sugarcane Biotechnology and Genetic Improvement (Guangxi), Ministry of Agriculture, Nanning, China
| | - Xiuhua Tang
- Cash Crop Institute of Guangxi Academy of Agricultural Sciences, Nanning, China
| |
Collapse
|
8
|
Yan H, Ma G, Teixeira da Silva JA, Qiu L, Xu J, Zhou H, Wei M, Xiong J, Li M, Zhou S, Wu J, Tang X. Genome-Wide Identification and Analysis of NAC Transcription Factor Family in Two Diploid Wild Relatives of Cultivated Sweet Potato Uncovers Potential NAC Genes Related to Drought Tolerance. Front Genet 2021; 12:744220. [PMID: 34899836 PMCID: PMC8653416 DOI: 10.3389/fgene.2021.744220] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022] Open
Abstract
NAC (NAM, ATAF1/2, and CUC2) proteins play a pivotal role in modulating plant development and offer protection against biotic and abiotic stresses. Until now, no systematic knowledge of NAC family genes is available for the food security crop, sweet potato. Here, a comprehensive genome-wide survey of NAC domain-containing proteins identified 130 ItbNAC and 144 ItfNAC genes with full length sequences in the genomes of two diploid wild relatives of cultivated sweet potato, Ipomoea triloba and Ipomoea trifida, respectively. These genes were physically mapped onto 15 I. triloba and 16 I. trifida chromosomes, respectively. Phylogenetic analysis divided all 274 NAC proteins into 20 subgroups together with NAC transcription factors (TFs) from Arabidopsis. There were 9 and 15 tandem duplication events in the I. triloba and I. trifida genomes, respectively, indicating an important role of tandem duplication in sweet potato gene expansion and evolution. Moreover, synteny analysis suggested that most NAC genes in the two diploid sweet potato species had a similar origin and evolutionary process. Gene expression patterns based on RNA-Seq data in different tissues and in response to various hormone, biotic or abiotic treatments revealed their possible involvement in organ development and response to various biotic/abiotic stresses. The expression of 36 NAC TFs, which were upregulated in the five tissues and in response to mannitol treatment, was also determined by real-time quantitative polymerase chain reaction (RT-qPCR) in hexaploid cultivated sweet potato exposed to drought stress. Those results largely corroborated the expression profile of mannitol treatment uncovered by the RNA-Seq data. Some significantly up-regulated genes related to drought stress, such as ItbNAC110, ItbNAC114, ItfNAC15, ItfNAC28, and especially ItfNAC62, which had a conservative spatial conformation with a closely related paralogous gene, ANAC019, may be potential candidate genes for a sweet potato drought tolerance breeding program. This analysis provides comprehensive and systematic information about NAC family genes in two diploid wild relatives of cultivated sweet potato, and will provide a blueprint for their functional characterization and exploitation to improve the tolerance of sweet potato to abiotic stresses.
Collapse
Affiliation(s)
- Haifeng Yan
- Sugarcane Research Institute of Guangxi Academy of Agricultural Sciences, Guangxi Key Laboratory of Sugarcane Genetic Improvement and Key Laboratory of Sugarcane Biotechnology and Genetic Improvement (Guangxi), Ministry of Agriculture, Nanning, China
| | - Guohua Ma
- Guangdong Provincial Key Laboratory of Applied Botany, South China Botanical Garden, The Chinese Academy of Sciences, Guangzhou, China
| | | | - Lihang Qiu
- Sugarcane Research Institute of Guangxi Academy of Agricultural Sciences, Guangxi Key Laboratory of Sugarcane Genetic Improvement and Key Laboratory of Sugarcane Biotechnology and Genetic Improvement (Guangxi), Ministry of Agriculture, Nanning, China
| | - Juan Xu
- Biological Technology Research Institute, Guangxi Academy of Agricultural Sciences, Nanning, China
| | - Huiwen Zhou
- Sugarcane Research Institute of Guangxi Academy of Agricultural Sciences, Guangxi Key Laboratory of Sugarcane Genetic Improvement and Key Laboratory of Sugarcane Biotechnology and Genetic Improvement (Guangxi), Ministry of Agriculture, Nanning, China
| | - Minzheng Wei
- Cash Crop Institute of Guangxi Academy of Agricultural Sciences, Nanning, China
| | - Jun Xiong
- Cash Crop Institute of Guangxi Academy of Agricultural Sciences, Nanning, China
| | - Mingzhi Li
- Biodata Biotechnology Co., Ltd, Hefei, China
| | - Shaohuan Zhou
- GuangXi Center for Disease Prevention and Control, Nanning, China,*Correspondence: Shaohuan Zhou, ; Jianming Wu, ; Xiuhua Tang,
| | - Jianming Wu
- Sugarcane Research Institute of Guangxi Academy of Agricultural Sciences, Guangxi Key Laboratory of Sugarcane Genetic Improvement and Key Laboratory of Sugarcane Biotechnology and Genetic Improvement (Guangxi), Ministry of Agriculture, Nanning, China,*Correspondence: Shaohuan Zhou, ; Jianming Wu, ; Xiuhua Tang,
| | - Xiuhua Tang
- Cash Crop Institute of Guangxi Academy of Agricultural Sciences, Nanning, China,*Correspondence: Shaohuan Zhou, ; Jianming Wu, ; Xiuhua Tang,
| |
Collapse
|