1
|
Zheng W, Wuyun Q, Li Y, Liu Q, Zhou X, Peng C, Zhu Y, Freddolino L, Zhang Y. Deep-learning-based single-domain and multidomain protein structure prediction with D-I-TASSER. Nat Biotechnol 2025:10.1038/s41587-025-02654-4. [PMID: 40410405 DOI: 10.1038/s41587-025-02654-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2024] [Accepted: 03/26/2025] [Indexed: 05/25/2025]
Abstract
The dominant success of deep learning techniques on protein structure prediction has challenged the necessity and usefulness of traditional force field-based folding simulations. We proposed a hybrid approach, deep-learning-based iterative threading assembly refinement (D-I-TASSER), which constructs atomic-level protein structural models by integrating multisource deep learning potentials with iterative threading fragment assembly simulations. D-I-TASSER introduces a domain splitting and assembly protocol for the automated modeling of large multidomain protein structures. Benchmark tests and the most recent critical assessment of protein structure prediction, 15 experiments demonstrate that D-I-TASSER outperforms AlphaFold2 and AlphaFold3 on both single-domain and multidomain proteins. Large-scale folding experiments further show that D-I-TASSER could fold 81% of protein domains and 73% of full-chain sequences in the human proteome with results highly complementary to recently released models by AlphaFold2. These results highlight a new avenue to integrate deep learning with classical physics-based folding simulations for high-accuracy protein structure and function predictions that are usable in genome-wide applications.
Collapse
Affiliation(s)
- Wei Zheng
- NITFID, School of Statistics and Data Science, AAIS, LPMC and KLMDASR, Nankai University, Tianjin, China
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Qiqige Wuyun
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
| | - Yang Li
- Cancer Science Institute of Singapore, National University of Singapore, Singapore, Singapore
| | - Quancheng Liu
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Xiaogen Zhou
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Chunxiang Peng
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Yiheng Zhu
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Lydia Freddolino
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA.
| | - Yang Zhang
- Cancer Science Institute of Singapore, National University of Singapore, Singapore, Singapore.
- Department of Computer Science, School of Computing, National University of Singapore, Singapore, Singapore.
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore.
| |
Collapse
|
2
|
Kawabata T, Kinoshita K. Assessing Structural Classification Using AlphaFold2 Models Through ECOD-Based Comparative Analysis. Proteins 2025. [PMID: 40251890 DOI: 10.1002/prot.26828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2024] [Revised: 03/27/2025] [Accepted: 03/30/2025] [Indexed: 04/21/2025]
Abstract
Identifying homologous proteins is a fundamental task in structural bioinformatics. While AlphaFold2 has revolutionized protein structure prediction, the extent to which structure comparison of its models can reliably detect homologs remains unclear. In this study, we evaluate the feasibility of homology detection using AlphaFold2-predicted structures through structural comparisons. We considered the classification of the ECOD database for experimental structures as the correct standard and obtained their corresponding predicted models from AlphaFoldDB. To ensure blind assessment, we divided the structures into test and train sets according to their release date. Predicted and experimental 3D structures in the test and train sets were compared using 3D structure comparisons (MATRAS, Dali, and Foldseek) and sequence comparisons (BLAST and HHsearch). The results were evaluated based on the homology annotations in the ECOD database. For top-1 accuracy, the performance of structural comparisons was comparable to that of HHsearch. However, when considering metrics that included all structural pairs, including more remote homology, structural comparisons outperformed HHsearch. No significant differences were observed between comparisons of experimental versus experimental, predicted versus experimental, and predicted versus predicted structures with pLDDT (prediction confidence) values greater than 60. We also demonstrate that predicted protein structures, determined by NMR, had lower pLDDT values and contained fewer coils than their experimental counterparts. These findings highlight the potential of AlphaFold2 models in structural classification and suggest that 3D structural searches should be conducted not only against the PDB but also against AlphaFoldDB to identify more potential homologs.
Collapse
Affiliation(s)
- Takeshi Kawabata
- Graduate School of Information Sciences, Tohoku University, Sendai, Japan
| | - Kengo Kinoshita
- Graduate School of Information Sciences, Tohoku University, Sendai, Japan
| |
Collapse
|
3
|
Rangra S, Aggarwal KK. Characterization and kinetics of a cathepsin B-inhibiting protein from Musa acuminata Colla peel. Biochimie 2025; 229:141-150. [PMID: 39461656 DOI: 10.1016/j.biochi.2024.10.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2024] [Revised: 10/23/2024] [Accepted: 10/24/2024] [Indexed: 10/29/2024]
Abstract
Hyperexpression of cathepsin B caused by an imbalance of endogenous inhibitors is involved in multiple pathologies, hence making it a key therapeutic target. Protease inhibitors are effective biomolecules that regulate protease activities and are considered potential therapeutic agents in various diseases. Plant protease inhibitors have been reported as an effective complementary alternative drug. A proteinaceous cathepsin B inhibitor (CBI-BP) has been isolated from Musa acuminata Colla (banana) peel with a molecular weight of 27.9 kDa on SDS-PAGE. The purity of the CBI-BP was confirmed on the native- PAGE. The isolated CBI-BP showed an IC50 value of 8.14 μg and a Ki value of 10.59 μg (0.19 μM). Cathepsin B inhibition kinetics indicated that CBI-BP follows a mixed-type of cathepsin B inhibition. Its inhibition activity was also confirmed by reverse zymography. The inhibitor was stable from pH 2.6-10.0 with maximum activity at pH 7.2, temperature 25-100 °C and exhibited thermostability for 60 min at 70 °C. MALDI/TOF/MS analysis of CBI-BP showed 40 % similarity to the GH18 domain-containing protein (A0A4S8JRM9) from Musa balbisiana. Although in-silico docking studies showed binding of A0A4S8JRM9 to cathepsin B affects the binding energy of the substrate to cathepsin B but is not reported for any anti-cathepsin B activity. This suggests that isolated CBI-BP might be a novel protein with anti-cathepsin B activity. Thus the isolated CBI-BP may be further explored as possible anti-cathepsin B drug.
Collapse
Affiliation(s)
- Sabita Rangra
- University School of Biotechnology, Guru Gobind Singh Indraprastha University. New Delhi-110078, India
| | - Kamal Krishan Aggarwal
- University School of Biotechnology, Guru Gobind Singh Indraprastha University. New Delhi-110078, India.
| |
Collapse
|
4
|
Zhang C, Wang Q, Li Y, Teng A, Hu G, Wuyun Q, Zheng W. The Historical Evolution and Significance of Multiple Sequence Alignment in Molecular Structure and Function Prediction. Biomolecules 2024; 14:1531. [PMID: 39766238 PMCID: PMC11673352 DOI: 10.3390/biom14121531] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Revised: 11/24/2024] [Accepted: 11/27/2024] [Indexed: 01/11/2025] Open
Abstract
Multiple sequence alignment (MSA) has evolved into a fundamental tool in the biological sciences, playing a pivotal role in predicting molecular structures and functions. With broad applications in protein and nucleic acid modeling, MSAs continue to underpin advancements across a range of disciplines. MSAs are not only foundational for traditional sequence comparison techniques but also increasingly important in the context of artificial intelligence (AI)-driven advancements. Recent breakthroughs in AI, particularly in protein and nucleic acid structure prediction, rely heavily on the accuracy and efficiency of MSAs to enhance remote homology detection and guide spatial restraints. This review traces the historical evolution of MSA, highlighting its significance in molecular structure and function prediction. We cover the methodologies used for protein monomers, protein complexes, and RNA, while also exploring emerging AI-based alternatives, such as protein language models, as complementary or replacement approaches to traditional MSAs in application tasks. By discussing the strengths, limitations, and applications of these methods, this review aims to provide researchers with valuable insights into MSA's evolving role, equipping them to make informed decisions in structural prediction research.
Collapse
Affiliation(s)
- Chenyue Zhang
- NITFID, School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin 300071, China; (C.Z.); (Y.L.); (G.H.)
| | - Qinxin Wang
- Suzhou New & High-Tech Innovation Service Center, Suzhou 215011, China;
| | - Yiyang Li
- NITFID, School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin 300071, China; (C.Z.); (Y.L.); (G.H.)
| | - Anqi Teng
- Bioscience and Biomedical Engineering Thrust, Systems Hub, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou 511453, China;
| | - Gang Hu
- NITFID, School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin 300071, China; (C.Z.); (Y.L.); (G.H.)
| | - Qiqige Wuyun
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| | - Wei Zheng
- NITFID, School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin 300071, China; (C.Z.); (Y.L.); (G.H.)
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
5
|
Thareja P, Chhillar RS, Dalal S, Simaiya S, Lilhore UK, Alroobaea R, Alsafyani M, Baqasah AM, Algarni S. Intelligence model on sequence-based prediction of PPI using AISSO deep concept with hyperparameter tuning process. Sci Rep 2024; 14:21797. [PMID: 39294330 PMCID: PMC11410825 DOI: 10.1038/s41598-024-72558-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2024] [Accepted: 09/09/2024] [Indexed: 09/20/2024] Open
Abstract
Protein-protein interaction (PPI) prediction is vital for interpreting biological activities. Even though many diverse sorts of data and machine learning approaches have been employed in PPI prediction, performance still has to be enhanced. As a result, we adopted an Aquilla Influenced Shark Smell (AISSO)-based hybrid prediction technique to construct a sequence-dependent PPI prediction model. This model has two stages of operation: feature extraction and prediction. Along with sequence-based and Gene Ontology features, unique features were produced in the feature extraction stage utilizing the improved semantic similarity technique, which may deliver reliable findings. These collected characteristics were then sent to the prediction step, and hybrid neural networks, such as the Improved Recurrent Neural Network and Deep Belief Networks, were used to predict the PPI using modified score level fusion. These neural networks' weight variables were adjusted utilizing a unique optimal methodology called Aquila Influenced Shark Smell (AISSO), and the outcomes showed that the developed model had attained an accuracy of around 88%, which is much better than the traditional methods; this model AISSO-based PPI prediction can provide precise and effective predictions.
Collapse
Affiliation(s)
- Preeti Thareja
- DCSA, Maharshi Dayanand University, Rohtak, Haryana, India
| | | | - Sandeep Dalal
- DCSA, Maharshi Dayanand University, Rohtak, Haryana, India
| | - Sarita Simaiya
- Arba Minch University, Arba Minch, Ethiopia.
- Department of Computer Science and Engineering, Galgotias University, Greater Noida, UP, India.
| | - Umesh Kumar Lilhore
- Department of Computer Science and Engineering, Galgotias University, Greater Noida, UP, India
| | - Roobaea Alroobaea
- Department of Computer Science, College of Computers and Information Technology, Taif University, P. O. Box 11099, 21944, Taif, Saudi Arabia
| | - Majed Alsafyani
- Department of Computer Science, College of Computers and Information Technology, Taif University, P. O. Box 11099, 21944, Taif, Saudi Arabia
| | - Abdullah M Baqasah
- Department of Information Technology, College of Computers and Information Technology, Taif University, P. O. Box 11099, Taif, 21944, Saudi Arabia
| | - Sultan Algarni
- Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, 21589, Jeddah, Saudi Arabia
| |
Collapse
|
6
|
Zhang C, Freddolino L. FURNA: A database for functional annotations of RNA structures. PLoS Biol 2024; 22:e3002476. [PMID: 39074139 PMCID: PMC11309384 DOI: 10.1371/journal.pbio.3002476] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 08/08/2024] [Accepted: 06/24/2024] [Indexed: 07/31/2024] Open
Abstract
Despite the increasing number of 3D RNA structures in the Protein Data Bank, the majority of experimental RNA structures lack thorough functional annotations. As the significance of the functional roles played by noncoding RNAs becomes increasingly apparent, comprehensive annotation of RNA function is becoming a pressing concern. In response to this need, we have developed FURNA (Functions of RNAs), the first database for experimental RNA structures that aims to provide a comprehensive repository of high-quality functional annotations. These include Gene Ontology terms, Enzyme Commission numbers, ligand-binding sites, RNA families, protein-binding motifs, and cross-references to related databases. FURNA is available at https://seq2fun.dcmb.med.umich.edu/furna/ to enable quick discovery of RNA functions from their structures and sequences.
Collapse
Affiliation(s)
- Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
- Department of Biological Chemistry, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Lydia Freddolino
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
- Department of Biological Chemistry, University of Michigan, Ann Arbor, Michigan, United States of America
| |
Collapse
|
7
|
Zhang C, Freddolino L. A large-scale assessment of sequence database search tools for homology-based protein function prediction. Brief Bioinform 2024; 25:bbae349. [PMID: 39038936 PMCID: PMC11262835 DOI: 10.1093/bib/bbae349] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Revised: 06/03/2024] [Accepted: 07/05/2024] [Indexed: 07/24/2024] Open
Abstract
Sequence database searches followed by homology-based function transfer form one of the oldest and most popular approaches for predicting protein functions, such as Gene Ontology (GO) terms. These searches are also a critical component in most state-of-the-art machine learning and deep learning-based protein function predictors. Although sequence search tools are the basis of homology-based protein function prediction, previous studies have scarcely explored how to select the optimal sequence search tools and configure their parameters to achieve the best function prediction. In this paper, we evaluate the effect of using different options from among popular search tools, as well as the impacts of search parameters, on protein function prediction. When predicting GO terms on a large benchmark dataset, we found that BLASTp and MMseqs2 consistently exceed the performance of other tools, including DIAMOND-one of the most popular tools for function prediction-under default search parameters. However, with the correct parameter settings, DIAMOND can perform comparably to BLASTp and MMseqs2 in function prediction. Additionally, we developed a new scoring function to derive GO prediction from homologous hits that consistently outperform previously proposed scoring functions. These findings enable the improvement of almost all protein function prediction algorithms with a few easily implementable changes in their sequence homolog-based component. This study emphasizes the critical role of search parameter settings in homology-based function transfer and should have an important contribution to the development of future protein function prediction algorithms.
Collapse
Affiliation(s)
- Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, Department of Biological Chemistry, University of Michigan, 100 Washtenaw Avenue, Ann Arbor, MI 48109, United States
| | - Lydia Freddolino
- Department of Computational Medicine and Bioinformatics, Department of Biological Chemistry, University of Michigan, 100 Washtenaw Avenue, Ann Arbor, MI 48109, United States
| |
Collapse
|
8
|
Wang C, Wang Y, Ding P, Li S, Yu X, Yu B. ML-FGAT: Identification of multi-label protein subcellular localization by interpretable graph attention networks and feature-generative adversarial networks. Comput Biol Med 2024; 170:107944. [PMID: 38215617 DOI: 10.1016/j.compbiomed.2024.107944] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 12/08/2023] [Accepted: 01/01/2024] [Indexed: 01/14/2024]
Abstract
The prediction of multi-label protein subcellular localization (SCL) is a pivotal area in bioinformatics research. Recent advancements in protein structure research have facilitated the application of graph neural networks. This paper introduces a novel approach termed ML-FGAT. The approach begins by extracting node information of proteins from sequence data, physical-chemical properties, evolutionary insights, and structural details. Subsequently, various evolutionary techniques are integrated to consolidate multi-view information. A linear discriminant analysis framework, grounded on entropy weight, is then employed to reduce the dimensionality of the merged features. To enhance the robustness of the model, the training dataset is augmented using feature-generative adversarial networks. For the primary prediction step, graph attention networks are employed to determine multi-label protein SCL, leveraging both node and neighboring information. The interpretability is enhanced by analyzing the attention weight parameters. The training is based on the Gram-positive bacteria dataset, while validation employs newly constructed datasets: human, virus, Gram-negative bacteria, plant, and SARS-CoV-2. Following a leave-one-out cross-validation procedure, ML-FGAT demonstrates noteworthy superiority in this domain.
Collapse
Affiliation(s)
- Congjing Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Yifei Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Pengju Ding
- College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao, 266061, China
| | - Shan Li
- School of Mathematics and Statistics, Central South University, Changsha, 410083, China
| | - Xu Yu
- Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum, Qingdao, 266580, China
| | - Bin Yu
- School of Data Science, Qingdao University of Science and Technology, Qingdao, 266061, China; School of Data Science, University of Science and Technology of China, Hefei, 230027, China.
| |
Collapse
|
9
|
Fu Y, Gu Z, Luo X, Guo Q, Lai L, Deng M. Learning a generalized graph transformer for protein function prediction in dissimilar sequences. Gigascience 2024; 13:giae093. [PMID: 39657158 PMCID: PMC11734293 DOI: 10.1093/gigascience/giae093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Revised: 07/04/2024] [Accepted: 10/25/2024] [Indexed: 12/17/2024] Open
Abstract
BACKGROUND In the face of a growing disparity between high-throughput sequence data and low-throughput experimental studies, the emerging field of deep learning stands as a promising alternative. Generally, many data-driven approaches are capable of facilitating fast and accurate predictions of protein functions. Nevertheless, the inherent statistical nature of deep learning techniques may limit their generalization capabilities when applied to novel nonhomologous proteins that diverge significantly from existing ones. RESULTS In this work, we herein propose a novel, generalized approach named Graph Adversarial Learning with Alignment (GALA) for protein function prediction. Our GALA method integrates a graph transformer architecture with an attention pooling module to extract embeddings from both protein sequences and structures, facilitating unified learning of protein representations. Particularly noteworthy, GALA incorporates a domain discriminator conditioned on both learnable representations and predicted probabilities, which undergoes adversarial learning to ensure representation invariance across diverse environments. To optimize the model with abundant label information, we generate label embeddings in the hidden space, explicitly aligning them with protein representations. Benchmarked on datasets derived from the PDB database and Swiss-Prot database, our GALA achieves considerable performance comparable to several state-of-the-art methods. Even more, GALA demonstrates wonderful biological interpretability by identifying significant functional residues associated with Gene Ontology terms through class activation mapping. CONCLUSIONS GALA, which leverages adversarial learning and label embedding alignment to acquire domain-invariant protein representations, exhibits outstanding generalizability in function prediction for proteins from previously unseen sequence space. By incorporating the structures predicted by AlphaFold2, GALA demonstrates significant potential for function annotation in newly discovered sequences. A detailed implementation of our GALA is available at https://github.com/fuyw-aisw/GALA.
Collapse
Affiliation(s)
- Yiwei Fu
- School of Mathematical Sciences, Peking University, Beijing 100871, China
| | - Zhonghui Gu
- Peking-Tsinghua Center for Life Sciences, Peking University, Beijing 100871, China
| | - Xiao Luo
- Department of Computer Science, University of California, Los Angeles, CA 90024, USA
| | - Qirui Guo
- Center for Quantitative Biology, Peking University, Beijing 100871, China
| | - Luhua Lai
- Peking-Tsinghua Center for Life Sciences, Peking University, Beijing 100871, China
- Center for Quantitative Biology, Peking University, Beijing 100871, China
| | - Minghua Deng
- School of Mathematical Sciences, Peking University, Beijing 100871, China
- Center for Quantitative Biology, Peking University, Beijing 100871, China
- Center for Statistical Science, Peking University, Beijing 100871, China
| |
Collapse
|
10
|
Gulzar I, Khalil A, Ashfaq UA, Liaquat S, Haque A. Identification of Peptidoglycan Glycosyltransferase FtsI as a Potential Drug Target against Salmonella Enteritidis and Salmonella Typhimurium Serovars Through Subtractive Genomics, Molecular Docking and Molecular Dynamics Simulation Approaches. Curr Pharm Des 2024; 30:2882-2895. [PMID: 39219121 DOI: 10.2174/0113816128332400240827061932] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Revised: 08/13/2024] [Accepted: 08/20/2024] [Indexed: 09/04/2024]
Abstract
INTRODUCTION Salmonella enterica serovar Enteritidis and Salmonella enterica serovar Typhimurium are among the main causative agents of nontyphoidal Salmonella infections, imposing a significant global health burden. The emergence of antibiotic resistance in these pathogens underscores the need for innovative therapeutic strategies. OBJECTIVE To identify proteins as potential drug targets against Salmonella Enteritidis and Salmonella Typhimurium serovars using In silico approaches. METHODS In this study, a subtractive genomics approach was employed to identify potential drug targets. The whole proteome of Salmonella enteritidis PT4 and Salmonella typhimurium (D23580), containing 393 and 478 proteins, respectively, was analyzed through subtractive genomics to identify human homologous proteins of the pathogen and also the proteins linked to shared metabolic pathways of pathogen and its host. RESULTS Subsequent analysis revealed 19 common essential proteins shared by both strains. To ensure hostspecificity, we identified 10 non-homologous proteins absent in humans. Among these proteins, peptidoglycan glycosyltransferase FtsI was pivotal, participating in pathogen-specific pathways and making it a promising drug target. Molecular docking highlighted two potential compounds, Balsamenonon A and 3,3',4',7-Tetrahydroxyflavylium, with strong binding affinities with FtsI. A 100 ns molecular dynamics simulation having 10,000 frames substantiated the strong binding affinity and demonstrated the enduring stability of the predicted compounds at the docked site. CONCLUSION The findings in this study provide the foundation for drug development strategies against Salmonella infections, which can contribute to the prospective development of natural and cost-effective drugs targeting Salmonella Enteritidis and Salmonella Typhimurium.
Collapse
Affiliation(s)
- Imran Gulzar
- Department of Bioinformatics and Biotechnology, Government College University Faisalabad (GCUF), Faisalabad, Pakistan
| | - Asma Khalil
- Department of Bioinformatics and Biotechnology, Government College University Faisalabad (GCUF), Faisalabad, Pakistan
| | - Usman Ali Ashfaq
- Department of Bioinformatics and Biotechnology, Government College University Faisalabad (GCUF), Faisalabad, Pakistan
| | - Sadia Liaquat
- Department of Bioinformatics and Biotechnology, Government College University Faisalabad (GCUF), Faisalabad, Pakistan
| | - Asma Haque
- Department of Bioinformatics and Biotechnology, Government College University Faisalabad (GCUF), Faisalabad, Pakistan
| |
Collapse
|
11
|
Zhang C, Freddolino PL. FURNA: a database for function annotations of RNA structures. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.19.572314. [PMID: 38187637 PMCID: PMC10769261 DOI: 10.1101/2023.12.19.572314] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2024]
Abstract
Despite the increasing number of 3D RNA structures in the Protein Data Bank, the majority of experimental RNA structures lack thorough functional annotations. As the significance of the functional roles played by non-coding RNAs becomes increasingly apparent, comprehensive annotation of RNA function is becoming a pressing concern. In response to this need, we have developed FURNA (Functions of RNAs), the first database for experimental RNA structures that aims to provide a comprehensive repository of high-quality functional annotations. These include Gene Ontology terms, Enzyme Commission numbers, ligand binding sites, RNA families, protein binding motifs, and cross-references to related databases. FURNA is available at https://seq2fun.dcmb.med.umich.edu/furna/ to enable quick discovery of RNA functions from their structures and sequences.
Collapse
Affiliation(s)
- Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| | - P. Lydia Freddolino
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
12
|
Alexander LT, Durairaj J, Kryshtafovych A, Abriata LA, Bayo Y, Bhabha G, Breyton C, Caulton SG, Chen J, Degroux S, Ekiert DC, Erlandsen BS, Freddolino PL, Gilzer D, Greening C, Grimes JM, Grinter R, Gurusaran M, Hartmann MD, Hitchman CJ, Keown JR, Kropp A, Kursula P, Lovering AL, Lemaitre B, Lia A, Liu S, Logotheti M, Lu S, Markússon S, Miller MD, Minasov G, Niemann HH, Opazo F, Phillips GN, Davies OR, Rommelaere S, Rosas‐Lemus M, Roversi P, Satchell K, Smith N, Wilson MA, Wu K, Xia X, Xiao H, Zhang W, Zhou ZH, Fidelis K, Topf M, Moult J, Schwede T. Protein target highlights in CASP15: Analysis of models by structure providers. Proteins 2023; 91:1571-1599. [PMID: 37493353 PMCID: PMC10792529 DOI: 10.1002/prot.26545] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Accepted: 06/15/2023] [Indexed: 07/27/2023]
Abstract
We present an in-depth analysis of selected CASP15 targets, focusing on their biological and functional significance. The authors of the structures identify and discuss key protein features and evaluate how effectively these aspects were captured in the submitted predictions. While the overall ability to predict three-dimensional protein structures continues to impress, reproducing uncommon features not previously observed in experimental structures is still a challenge. Furthermore, instances with conformational flexibility and large multimeric complexes highlight the need for novel scoring strategies to better emphasize biologically relevant structural regions. Looking ahead, closer integration of computational and experimental techniques will play a key role in determining the next challenges to be unraveled in the field of structural molecular biology.
Collapse
Affiliation(s)
- Leila T. Alexander
- BiozentrumUniversity of BaselBaselSwitzerland
- Computational Structural BiologySIB Swiss Institute of BioinformaticsBaselSwitzerland
| | - Janani Durairaj
- BiozentrumUniversity of BaselBaselSwitzerland
- Computational Structural BiologySIB Swiss Institute of BioinformaticsBaselSwitzerland
| | | | - Luciano A. Abriata
- School of Life SciencesÉcole Polytechnique Fédérale de LausanneLausanneSwitzerland
| | - Yusupha Bayo
- Department of BiosciencesUniversity of MilanoMilanItaly
- IBBA‐CNR Unit of MilanoInstitute of Agricultural Biology and BiotechnologyMilanItaly
| | - Gira Bhabha
- Department of Cell BiologyNew York University School of MedicineNew YorkNew YorkUSA
| | | | | | - James Chen
- Department of Cell BiologyNew York University School of MedicineNew YorkNew YorkUSA
| | | | - Damian C. Ekiert
- Department of Cell BiologyNew York University School of MedicineNew YorkNew YorkUSA
- Department of MicrobiologyNew York University School of MedicineNew YorkNew YorkUSA
| | - Benedikte S. Erlandsen
- Wellcome Centre for Cell BiologyInstitute of Cell Biology, University of EdinburghEdinburghUK
| | - Peter L. Freddolino
- Department of Biological Chemistry, Computational Medicine and BioinformaticsUniversity of MichiganAnn ArborMichiganUSA
| | - Dominic Gilzer
- Department of ChemistryBielefeld UniversityBielefeldGermany
| | - Chris Greening
- Department of Microbiology, Biomedicine Discovery InstituteMonash UniversityClaytonVictoriaAustralia
- Securing Antarctica's Environmental FutureMonash UniversityClaytonVictoriaAustralia
- Centre to Impact AMRMonash UniversityClaytonVictoriaAustralia
- ARC Research Hub for Carbon Utilisation and RecyclingMonash UniversityClaytonVictoriaAustralia
| | - Jonathan M. Grimes
- Division of Structural Biology, Wellcome Centre for Human GeneticsUniversity of OxfordOxfordUK
| | - Rhys Grinter
- Department of Microbiology, Biomedicine Discovery InstituteMonash UniversityClaytonVictoriaAustralia
- Centre for Electron Microscopy of Membrane ProteinsMonash Institute of Pharmaceutical SciencesParkvilleVictoriaAustralia
| | - Manickam Gurusaran
- Wellcome Centre for Cell BiologyInstitute of Cell Biology, University of EdinburghEdinburghUK
| | - Marcus D. Hartmann
- Max Planck Institute for BiologyTübingenGermany
- Interfaculty Institute of Biochemistry, University of TübingenTübingenGermany
| | - Charlie J. Hitchman
- Department of Molecular and Cell Biology, Leicester Institute of Structural and Chemical BiologyUniversity of LeicesterLeicesterUK
| | - Jeremy R. Keown
- Division of Structural Biology, Wellcome Centre for Human GeneticsUniversity of OxfordOxfordUK
| | - Ashleigh Kropp
- Department of Microbiology, Biomedicine Discovery InstituteMonash UniversityClaytonVictoriaAustralia
| | - Petri Kursula
- Department of BiomedicineUniversity of BergenBergenNorway
- Faculty of Biochemistry and Molecular Medicine & Biocenter OuluUniversity of OuluOuluFinland
| | | | - Bruno Lemaitre
- School of Life SciencesÉcole Polytechnique Fédérale de LausanneLausanneSwitzerland
| | - Andrea Lia
- Department of Molecular and Cell Biology, Leicester Institute of Structural and Chemical BiologyUniversity of LeicesterLeicesterUK
- ISPA‐CNR Unit of LecceInstitute of Sciences of Food ProductionLecceItaly
| | - Shiheng Liu
- Department of Microbiology, Immunology, and Molecular GeneticsUniversity of CaliforniaLos AngelesCaliforniaUSA
- California NanoSystems InstituteUniversity of CaliforniaLos AngelesCaliforniaUSA
| | - Maria Logotheti
- Max Planck Institute for BiologyTübingenGermany
- Interfaculty Institute of Biochemistry, University of TübingenTübingenGermany
- Present address:
Institute of BiochemistryUniversity of GreifswaldGreifswaldGermany
| | - Shuze Lu
- Lanzhou University School of Life SciencesLanzhouChina
| | | | | | - George Minasov
- Department of Microbiology‐ImmunologyNorthwestern Feinberg School of MedicineChicagoIllinoisUSA
| | | | - Felipe Opazo
- NanoTag Biotechnologies GmbHGöttingenGermany
- Institute of Neuro‐ and Sensory PhysiologyUniversity of Göttingen Medical CenterGöttingenGermany
- Center for Biostructural Imaging of Neurodegeneration (BIN)University of Göttingen Medical CenterGöttingenGermany
| | - George N. Phillips
- Department of BiosciencesRice UniversityHoustonTexasUSA
- Department of ChemistryRice UniversityHoustonTexasUSA
| | - Owen R. Davies
- Wellcome Centre for Cell BiologyInstitute of Cell Biology, University of EdinburghEdinburghUK
| | - Samuel Rommelaere
- School of Life SciencesÉcole Polytechnique Fédérale de LausanneLausanneSwitzerland
| | - Monica Rosas‐Lemus
- Department of Microbiology‐ImmunologyNorthwestern Feinberg School of MedicineChicagoIllinoisUSA
- Present address:
Department of Molecular Genetics and MicrobiologyUniversity of New MexicoAlbuquerqueNew MexicoUSA
| | - Pietro Roversi
- IBBA‐CNR Unit of MilanoInstitute of Agricultural Biology and BiotechnologyMilanItaly
- Department of Molecular and Cell Biology, Leicester Institute of Structural and Chemical BiologyUniversity of LeicesterLeicesterUK
| | - Karla Satchell
- Department of Microbiology‐ImmunologyNorthwestern Feinberg School of MedicineChicagoIllinoisUSA
| | - Nathan Smith
- Department of Biochemistry and the Redox Biology CenterUniversity of NebraskaLincolnNebraskaUSA
| | - Mark A. Wilson
- Department of Biochemistry and the Redox Biology CenterUniversity of NebraskaLincolnNebraskaUSA
| | - Kuan‐Lin Wu
- Department of ChemistryRice UniversityHoustonTexasUSA
| | - Xian Xia
- Department of Microbiology, Immunology, and Molecular GeneticsUniversity of CaliforniaLos AngelesCaliforniaUSA
- California NanoSystems InstituteUniversity of CaliforniaLos AngelesCaliforniaUSA
| | - Han Xiao
- Department of BiosciencesRice UniversityHoustonTexasUSA
- Department of ChemistryRice UniversityHoustonTexasUSA
- Department of BioengineeringRice UniversityHoustonTexasUSA
| | - Wenhua Zhang
- Lanzhou University School of Life SciencesLanzhouChina
| | - Z. Hong Zhou
- Department of Microbiology, Immunology, and Molecular GeneticsUniversity of CaliforniaLos AngelesCaliforniaUSA
- California NanoSystems InstituteUniversity of CaliforniaLos AngelesCaliforniaUSA
| | | | - Maya Topf
- University Medical Center Hamburg‐Eppendorf (UKE)HamburgGermany
- Centre for Structural Systems BiologyLeibniz‐Institut für Virologie (LIV)HamburgGermany
| | - John Moult
- Department of Cell Biology and Molecular Genetics, Institute for Bioscience and Biotechnology ResearchUniversity of MarylandRockvilleMarylandUSA
| | - Torsten Schwede
- BiozentrumUniversity of BaselBaselSwitzerland
- Computational Structural BiologySIB Swiss Institute of BioinformaticsBaselSwitzerland
| |
Collapse
|
13
|
Zhang C, Lydia Freddolino P. A large-scale assessment of sequence database search tools for homology-based protein function prediction. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.14.567021. [PMID: 38013998 PMCID: PMC10680702 DOI: 10.1101/2023.11.14.567021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
Sequence database searches followed by homology-based function transfer form one of the oldest and most popular approaches for predicting protein functions, such as Gene Ontology (GO) terms. Although sequence search tools are the basis of homology-based protein function prediction, previous studies have scarcely explored how to select the optimal sequence search tools and configure their parameters to achieve the best function prediction. In this paper, we evaluate the effect of using different options from among popular search tools, as well as the impacts of search parameters, on protein function prediction. When predicting GO terms on a large benchmark dataset, we found that BLASTp and MMseqs2 consistently exceed the performance of other tools, including DIAMOND - one of the most popular tools for function prediction - under default search parameters. However, with the correct parameter settings, DIAMOND can perform comparably to BLASTp and MMseqs2 in function prediction. This study emphasizes the critical role of search parameter settings in homology-based function transfer.
Collapse
Affiliation(s)
- Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
- Department of Biological Chemistry, University of Michigan, Ann Arbor, 48109, USA
| | - P. Lydia Freddolino
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
- Department of Biological Chemistry, University of Michigan, Ann Arbor, 48109, USA
| |
Collapse
|
14
|
Oliveira GB, Pedrini H, Dias Z. TEMPROT: protein function annotation using transformers embeddings and homology search. BMC Bioinformatics 2023; 24:242. [PMID: 37291492 DOI: 10.1186/s12859-023-05375-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Accepted: 06/02/2023] [Indexed: 06/10/2023] Open
Abstract
BACKGROUND Although the development of sequencing technologies has provided a large number of protein sequences, the analysis of functions that each one plays is still difficult due to the efforts of laboratorial methods, making necessary the usage of computational methods to decrease this gap. As the main source of information available about proteins is their sequences, approaches that can use this information, such as classification based on the patterns of the amino acids and the inference based on sequence similarity using alignment tools, are able to predict a large collection of proteins. The methods available in the literature that use this type of feature can achieve good results, however, they present restrictions of protein length as input to their models. In this work, we present a new method, called TEMPROT, based on the fine-tuning and extraction of embeddings from an available architecture pre-trained on protein sequences. We also describe TEMPROT+, an ensemble between TEMPROT and BLASTp, a local alignment tool that analyzes sequence similarity, which improves the results of our former approach. RESULTS The evaluation of our proposed classifiers with the literature approaches has been conducted on our dataset, which was derived from CAFA3 challenge database. Both TEMPROT and TEMPROT+ achieved competitive results on [Formula: see text], [Formula: see text], AuPRC and IAuPRC metrics on Biological Process (BP), Cellular Component (CC) and Molecular Function (MF) ontologies compared to state-of-the-art models, with the main results equal to 0.581, 0.692 and 0.662 of [Formula: see text] on BP, CC and MF, respectively. CONCLUSIONS The comparison with the literature showed that our model presented competitive results compared the state-of-the-art approaches considering the amino acid sequence pattern recognition and homology analysis. Our model also presented improvements related to the input size that the model can use to train compared to the literature methods.
Collapse
Affiliation(s)
| | - Helio Pedrini
- Institute of Computing, University of Campinas, Campinas, Brazil
| | - Zanoni Dias
- Institute of Computing, University of Campinas, Campinas, Brazil
| |
Collapse
|
15
|
Wang S, You R, Liu Y, Xiong Y, Zhu S. NetGO 3.0: Protein Language Model Improves Large-scale Functional Annotations. GENOMICS, PROTEOMICS & BIOINFORMATICS 2023; 21:349-358. [PMID: 37075830 PMCID: PMC10626176 DOI: 10.1016/j.gpb.2023.04.001] [Citation(s) in RCA: 27] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Revised: 02/24/2023] [Accepted: 04/07/2023] [Indexed: 04/21/2023]
Abstract
As one of the state-of-the-art automated function prediction (AFP) methods, NetGO 2.0 integrates multi-source information to improve the performance. However, it mainly utilizes the proteins with experimentally supported functional annotations without leveraging valuable information from a vast number of unannotated proteins. Recently, protein language models have been proposed to learn informative representations [e.g., Evolutionary Scale Modeling (ESM)-1b embedding] from protein sequences based on self-supervision. Here, we represented each protein by ESM-1b and used logistic regression (LR) to train a new model, LR-ESM, for AFP. The experimental results showed that LR-ESM achieved comparable performance with the best-performing component of NetGO 2.0. Therefore, by incorporating LR-ESM into NetGO 2.0, we developed NetGO 3.0 to improve the performance of AFP extensively. NetGO 3.0 is freely accessible at https://dmiip.sjtu.edu.cn/ng3.0.
Collapse
Affiliation(s)
- Shaojun Wang
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, China
| | - Ronghui You
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, China
| | - Yunjia Liu
- School of Life Sciences, Fudan University, Shanghai 200433, China
| | - Yi Xiong
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, Shanghai 200240, China; Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China
| | - Shanfeng Zhu
- Institute of Science and Technology for Brain-Inspired Intelligence and MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200433, China; Shanghai Qi Zhi Institute, Shanghai 200030, China; MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China; Shanghai Key Laboratory of Intelligent Information Processing and Shanghai Institute of Artificial Intelligence Algorithm, Fudan University, Shanghai 200433, China; Zhangjiang Fudan International Innovation Center, Shanghai 200433, China.
| |
Collapse
|
16
|
Li J, Liu B, Feng X, Zhang M, Ding T, Zhao Y, Wang C. Comparative proteome and volatile metabolome analysis of Aspergillus oryzae 3.042 and Aspergillus sojae 3.495 during koji fermentation. Food Res Int 2023; 165:112527. [PMID: 36869527 DOI: 10.1016/j.foodres.2023.112527] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2022] [Revised: 01/09/2023] [Accepted: 01/21/2023] [Indexed: 01/26/2023]
Abstract
Aspergillus oryzae 3.042 and Aspergillus sojae 3.495 are crucial starters for fermented soybean foods since their abundant secreted enzymes. This study aimed to compare the differences in protein secretion between A. oryzae 3.042 and A. sojae 3.495 during the soy sauce koji fermentation and the effect on volatile metabolites to understand the fermentation characteristics of the strains better. Label-free proteomics detected 210 differentially expressed proteins (DEPs) enriched in amino acid metabolism and protein folding, sorting and degradation pathways. Subsequently, extracellular enzyme analysis showed that three peptidases, including peptide hydrolase, dipeptidyl aminopeptidase, and peptidase S41, were up-regulated in A. sojae 3.495. Seven carbohydrases, including α-galactosidase, endo-arabinase, β-glucosidase, α-galactosidase, α-glucuronidase, arabinan-endo 1,5-α-l-arabinase, and endo-1,4-β-xylanase were up-regulated in A. oryzae 3.042, contributing to the difference in enzyme activity. Significantly different extracellular enzymes influenced the content of volatile alcohols, aldehydes and esters such as (R, R)-2,3-butanediol, 1-hexanol, hexanal, decanal, ethyl l-lactate and methyl myristate in both strains, which affected the type of aroma of koji. Overall, this study revealed the differences in molecular mechanisms between A. oryzae 3.042 and A. sojae 3.495 under solid-state fermentation, providing a reference for targeted enhancement strains.
Collapse
Affiliation(s)
- Jingyao Li
- "State Key Laboratory of Food Nutrition and Safety", Key Laboratory of Food Nutrition and Safety, Ministry of Education, College of Food Engineering and Biotechnology, Tianjin University of Science and Technology, No.29, 13th Avenue, Tianjin Economy Technological Development Area, Tianjin 300457, People Republic of China
| | - Bin Liu
- College of Biological and Environmental Engineering, Binzhou University, 391 Huanghe 5th Road, 256603 Binzhou City, Shandong Province, China
| | - Xiaojuan Feng
- "State Key Laboratory of Food Nutrition and Safety", Key Laboratory of Food Nutrition and Safety, Ministry of Education, College of Food Engineering and Biotechnology, Tianjin University of Science and Technology, No.29, 13th Avenue, Tianjin Economy Technological Development Area, Tianjin 300457, People Republic of China
| | - Mengli Zhang
- "State Key Laboratory of Food Nutrition and Safety", Key Laboratory of Food Nutrition and Safety, Ministry of Education, College of Food Engineering and Biotechnology, Tianjin University of Science and Technology, No.29, 13th Avenue, Tianjin Economy Technological Development Area, Tianjin 300457, People Republic of China
| | - Tingting Ding
- "State Key Laboratory of Food Nutrition and Safety", Key Laboratory of Food Nutrition and Safety, Ministry of Education, College of Food Engineering and Biotechnology, Tianjin University of Science and Technology, No.29, 13th Avenue, Tianjin Economy Technological Development Area, Tianjin 300457, People Republic of China
| | - Yue Zhao
- "State Key Laboratory of Food Nutrition and Safety", Key Laboratory of Food Nutrition and Safety, Ministry of Education, College of Food Engineering and Biotechnology, Tianjin University of Science and Technology, No.29, 13th Avenue, Tianjin Economy Technological Development Area, Tianjin 300457, People Republic of China
| | - Chunling Wang
- "State Key Laboratory of Food Nutrition and Safety", Key Laboratory of Food Nutrition and Safety, Ministry of Education, College of Food Engineering and Biotechnology, Tianjin University of Science and Technology, No.29, 13th Avenue, Tianjin Economy Technological Development Area, Tianjin 300457, People Republic of China.
| |
Collapse
|
17
|
Yan TC, Yue ZX, Xu HQ, Liu YH, Hong YF, Chen GX, Tao L, Xie T. A systematic review of state-of-the-art strategies for machine learning-based protein function prediction. Comput Biol Med 2023; 154:106446. [PMID: 36680931 DOI: 10.1016/j.compbiomed.2022.106446] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Revised: 12/07/2022] [Accepted: 12/19/2022] [Indexed: 12/24/2022]
Abstract
New drug discovery is inseparable from the discovery of drug targets, and the vast majority of the known targets are proteins. At the same time, proteins are essential structural and functional elements of living cells necessary for the maintenance of all forms of life. Therefore, protein functions have become the focus of many pharmacological and biological studies. Traditional experimental techniques are no longer adequate for rapidly growing annotation of protein sequences, and approaches to protein function prediction using computational methods have emerged and flourished. A significant trend has been to use machine learning to achieve this goal. In this review, approaches to protein function prediction based on the sequence, structure, protein-protein interaction (PPI) networks, and fusion of multi-information sources are discussed. The current status of research on protein function prediction using machine learning is considered, and existing challenges and prominent breakthroughs are discussed to provide ideas and methods for future studies.
Collapse
Affiliation(s)
- Tian-Ci Yan
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Zi-Xuan Yue
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Hong-Quan Xu
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Yu-Hong Liu
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Yan-Feng Hong
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Gong-Xing Chen
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Lin Tao
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China.
| | - Tian Xie
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China.
| |
Collapse
|
18
|
Singh S, Yadav PK, Singh AK. In-silico structural characterization and phylogenetic analysis of Nucleoside diphosphate kinase: A novel antiapoptotic protein of Porphyromonas gingivalis. J Cell Biochem 2023; 124:545-556. [PMID: 36815439 DOI: 10.1002/jcb.30389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Revised: 02/02/2023] [Accepted: 02/09/2023] [Indexed: 02/24/2023]
Abstract
The Nucleoside diphosphate kinase (NDK) protein of Porphyromonas gingivalis (P. gingivalis) plays a crucial role in immune evasion and inhibition of apoptosis in host cells and has the potential to cause cancer. However, its structure has not yet been characterized. We used an in-silico approach to determine the 3D structure of the P. gingivalis NDK. Furthermore, structural characterization and functional annotation were performed using computational approaches. The 3D structure of NDK was predicted through homology modeling. The structural domains predicted for the model protein belong to the NDK family. Structural alignment of prokaryotic and eukaryotic NDKs with the model protein revealed the conservation of the domain region. Structure-based phylogenetic analysis depicted a significant evolutionary relationship between the model protein and the prokaryotic NDK. Functional annotation of the model confirmed structural homology, exhibiting similar enzymatic functions as NDK, including ATP binding and nucleoside diphosphate kinase activity. Furthermore, molecular dynamic (MD) simulation technique stabilized the model structure and provides a thermo-stable protein structure that can be used as a therapeutic target for further studies.
Collapse
Affiliation(s)
- Suchitra Singh
- Department of Bioinformatics, Central University of South Bihar, Gaya, India
| | - Piyush Kumar Yadav
- Department of Bioinformatics, Central University of South Bihar, Gaya, India
| | - Ajay Kumar Singh
- Department of Bioinformatics, Central University of South Bihar, Gaya, India
| |
Collapse
|
19
|
Sharma L, Deepak A, Ranjan A, Krishnasamy G. A novel hybrid CNN and BiGRU-Attention based deep learning model for protein function prediction. Stat Appl Genet Mol Biol 2023; 22:sagmb-2022-0057. [PMID: 37658681 DOI: 10.1515/sagmb-2022-0057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2022] [Accepted: 04/20/2023] [Indexed: 09/03/2023]
Abstract
Proteins are the building blocks of all living things. Protein function must be ascertained if the molecular mechanism of life is to be understood. While CNN is good at capturing short-term relationships, GRU and LSTM can capture long-term dependencies. A hybrid approach that combines the complementary benefits of these deep-learning models motivates our work. Protein Language models, which use attention networks to gather meaningful data and build representations for proteins, have seen tremendous success in recent years processing the protein sequences. In this paper, we propose a hybrid CNN + BiGRU - Attention based model with protein language model embedding that effectively combines the output of CNN with the output of BiGRU-Attention for predicting protein functions. We evaluated the performance of our proposed hybrid model on human and yeast datasets. The proposed hybrid model improves the Fmax value over the state-of-the-art model SDN2GO for the cellular component prediction task by 1.9 %, for the molecular function prediction task by 3.8 % and for the biological process prediction task by 0.6 % for human dataset and for yeast dataset the cellular component prediction task by 2.4 %, for the molecular function prediction task by 5.2 % and for the biological process prediction task by 1.2 %.
Collapse
Affiliation(s)
- Lavkush Sharma
- Department of Computer Science and Engineering, National Institute of Technology Patna, Patna, Bihar, India
| | - Akshay Deepak
- Department of Computer Science and Engineering, National Institute of Technology Patna, Patna, Bihar, India
| | - Ashish Ranjan
- Department of Computer Science and Engineering, ITER, Siksha 'O' Anusandhan University (Deemed to be University), Bhubaneswar, Odisha, India
| | | |
Collapse
|
20
|
Sarker B, Khare N, Devignes MD, Aridhi S. Improving automatic GO annotation with semantic similarity. BMC Bioinformatics 2022; 23:433. [PMID: 36510133 PMCID: PMC9743508 DOI: 10.1186/s12859-022-04958-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2022] [Accepted: 09/19/2022] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Automatic functional annotation of proteins is an open research problem in bioinformatics. The growing number of protein entries in public databases, for example in UniProtKB, poses challenges in manual functional annotation. Manual annotation requires expert human curators to search and read related research articles, interpret the results, and assign the annotations to the proteins. Thus, it is a time-consuming and expensive process. Therefore, designing computational tools to perform automatic annotation leveraging the high quality manual annotations that already exist in UniProtKB/SwissProt is an important research problem RESULTS: In this paper, we extend and adapt the GrAPFI (graph-based automatic protein function inference) (Sarker et al. in BMC Bioinform 21, 2020; Sarker et al., in: Proceedings of 7th international conference on complex networks and their applications, Cambridge, 2018) method for automatic annotation of proteins with gene ontology (GO) terms renaming it as GrAPFI-GO. The original GrAPFI method uses label propagation in a similarity graph where proteins are linked through the domains, families, and superfamilies that they share. Here, we also explore various types of similarity measures based on common neighbors in the graph. Moreover, GO terms are arranged in a hierarchical manner according to semantic parent-child relations. Therefore, we propose an efficient pruning and post-processing technique that integrates both semantic similarity and hierarchical relations between the GO terms. We produce experimental results comparing the GrAPFI-GO method with and without considering common neighbors similarity. We also test the performance of GrAPFI-GO and other annotation tools for GO annotation on a benchmark of proteins with and without the proposed pruning and post-processing procedure. CONCLUSION Our results show that the proposed semantic hierarchical post-processing potentially improves the performance of GrAPFI-GO and of other annotation tools as well. Thus, GrAPFI-GO exposes an original efficient and reusable procedure, to exploit the semantic relations among the GO terms in order to improve the automatic annotation of protein functions.
Collapse
Affiliation(s)
- Bishnu Sarker
- grid.29172.3f0000 0001 2194 6418CNRS, Inria, LORIA, University of Lorraine, 54000 Nancy, France ,grid.443078.c0000 0004 0371 4228Khulna University of Engineering and Technology, Khulna, Bangladesh ,grid.259870.10000 0001 0286 752XSchool of Applied Computational Sciences, Meharry Medical College, Nashville, TN USA
| | - Navya Khare
- grid.29172.3f0000 0001 2194 6418CNRS, Inria, LORIA, University of Lorraine, 54000 Nancy, France ,grid.419361.80000 0004 1759 7632International Institute of Information Technology, Hyderabad, India
| | | | - Sabeur Aridhi
- grid.29172.3f0000 0001 2194 6418CNRS, Inria, LORIA, University of Lorraine, 54000 Nancy, France
| |
Collapse
|
21
|
Zhu YH, Zhang C, Liu Y, Omenn GS, Freddolino PL, Yu DJ, Zhang Y. TripletGO: Integrating Transcript Expression Profiles with Protein Homology Inferences for Gene Function Prediction. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:1013-1027. [PMID: 35568117 PMCID: PMC10025770 DOI: 10.1016/j.gpb.2022.03.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/20/2021] [Revised: 03/02/2022] [Accepted: 04/16/2022] [Indexed: 01/13/2023]
Abstract
Gene Ontology (GO) has been widely used to annotate functions of genes and gene products. Here, we proposed a new method, TripletGO, to deduce GO terms of protein-coding and non-coding genes, through the integration of four complementary pipelines built on transcript expression profile, genetic sequence alignment, protein sequence alignment, and naïve probability. TripletGO was tested on a large set of 5754 genes from 8 species (human, mouse, Arabidopsis, rat, fly, budding yeast, fission yeast, and nematoda) and 2433 proteins with available expression data from the third Critical Assessment of Protein Function Annotation challenge (CAFA3). Experimental results show that TripletGO achieves function annotation accuracy significantly beyond the current state-of-the-art approaches. Detailed analyses show that the major advantage of TripletGO lies in the coupling of a new triplet network-based profiling method with the feature space mapping technique, which can accurately recognize function patterns from transcript expression profiles. Meanwhile, the combination of multiple complementary models, especially those from transcript expression and protein-level alignments, improves the coverage and accuracy of the final GO annotation results. The standalone package and an online server of TripletGO are freely available at https://zhanggroup.org/TripletGO/.
Collapse
Affiliation(s)
- Yi-Heng Zhu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China; Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yan Liu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Gilbert S Omenn
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA; Departments of Internal Medicine and Human Genetics, and School of Public Health, University of Michigan, Ann Arbor, MI 48109, USA
| | - Peter L Freddolino
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA; Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China.
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA; Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA.
| |
Collapse
|
22
|
Zhang C, Shine M, Pyle AM, Zhang Y. US-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes. Nat Methods 2022; 19:1109-1115. [PMID: 36038728 DOI: 10.1038/s41592-022-01585-1] [Citation(s) in RCA: 164] [Impact Index Per Article: 54.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Accepted: 07/19/2022] [Indexed: 11/09/2022]
Abstract
Structure comparison and alignment are of fundamental importance in structural biology studies. We developed the first universal platform, US-align, to uniformly align monomer and complex structures of different macromolecules-proteins, RNAs and DNAs. The pipeline is built on a uniform TM-score objective function coupled with a heuristic alignment searching algorithm. Large-scale benchmarks demonstrated consistent advantages of US-align over state-of-the-art methods in pairwise and multiple structure alignments of different molecules. Detailed analyses showed that the main advantage of US-align lies in the extensive optimization of the unified objective function powered by efficient heuristic search iterations, which substantially improve the accuracy and speed of the structural alignment process. Meanwhile, the universal protocol fusing different molecular and structural types helps facilitate the heterogeneous oligomer structure comparison and template-based protein-protein and protein-RNA/DNA docking.
Collapse
Affiliation(s)
- Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.,Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, CT, USA.,Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - Morgan Shine
- Yale Combined Program in the Biological and Biomedical Sciences, Yale University, New Haven, CT, USA
| | - Anna Marie Pyle
- Howard Hughes Medical Institute, Chevy Chase, MD, USA.,Yale Combined Program in the Biological and Biomedical Sciences, Yale University, New Haven, CT, USA.,Department of Chemistry, Yale University, New Haven, CT, USA
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA. .,Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA.
| |
Collapse
|
23
|
Schmidt M, Proctor T, Diao R, Freddolino L. Escherichia coli YigI is a Conserved Gammaproteobacterial Acyl-CoA Thioesterase Permitting Metabolism of Unusual Fatty Acid Substrates. J Bacteriol 2022; 204:e0001422. [PMID: 35876515 PMCID: PMC9380530 DOI: 10.1128/jb.00014-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Accepted: 06/21/2022] [Indexed: 01/27/2023] Open
Abstract
Thioesterases play a critical role in metabolism, membrane biosynthesis, and overall homeostasis for all domains of life. In this present study, we characterize a putative thioesterase from Escherichia coli MG1655 and define its role as a cytosolic enzyme. Building on structure-guided functional predictions, we show that YigI is a medium- to long-chain acyl-CoA thioesterase that is involved in the degradation of conjugated linoleic acid (CLA) in vivo, showing overlapping specificity with two previously defined E. coli thioesterases TesB and FadM. We then bioinformatically identify the regulatory relationships that induce YigI expression, which include: an acidic environment, high oxygen availability, and exposure to aminoglycosides. Our findings define a role for YigI and shed light on why the E. coli genome harbors numerous thioesterases with closely related functions. IMPORTANCE Previous research has shown that long chain acyl-CoA thioesterases are needed for E. coli to grow in the presence of carbon sources such as conjugated linoleic acid, but that E. coli must possess at least one such enzyme that had not previously been characterized. Building off structure-guided function predictions, we showed that the poorly annotated protein YigI is indeed the previously unidentified third acyl CoA thioesterase. We found that the three potentially overlapping acyl-CoA thioesterases appear to be induced by nonoverlapping conditions and use that information as a starting point for identifying the precise reactions catalyzed by each such thioesterase, which is an important prerequisite for their industrial application and for more accurate metabolic modeling of E. coli.
Collapse
Affiliation(s)
- Michael Schmidt
- Department of Biological Chemistry, University of Michigan Medical School, Ann Arbor, Michigan, USA
| | - Theresa Proctor
- Post-baccalaureate Research Education Program (PREP), University of Michigan Medical School, Ann Arbor, Michigan, USA
| | - Rucheng Diao
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, Michigan, USA
| | - Lydia Freddolino
- Department of Biological Chemistry, University of Michigan Medical School, Ann Arbor, Michigan, USA
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, Michigan, USA
| |
Collapse
|
24
|
I-TASSER-MTD: a deep-learning-based platform for multi-domain protein structure and function prediction. Nat Protoc 2022; 17:2326-2353. [PMID: 35931779 DOI: 10.1038/s41596-022-00728-0] [Citation(s) in RCA: 223] [Impact Index Per Article: 74.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2022] [Accepted: 05/24/2022] [Indexed: 01/17/2023]
Abstract
Most proteins in cells are composed of multiple folding units (or domains) to perform complex functions in a cooperative manner. Relative to the rapid progress in single-domain structure prediction, there are few effective tools available for multi-domain protein structure assembly, mainly due to the complexity of modeling multi-domain proteins, which involves higher degrees of freedom in domain-orientation space and various levels of continuous and discontinuous domain assembly and linker refinement. To meet the challenge and the high demand of the community, we developed I-TASSER-MTD to model the structures and functions of multi-domain proteins through a progressive protocol that combines sequence-based domain parsing, single-domain structure folding, inter-domain structure assembly and structure-based function annotation in a fully automated pipeline. Advanced deep-learning models have been incorporated into each of the steps to enhance both the domain modeling and inter-domain assembly accuracy. The protocol allows for the incorporation of experimental cross-linking data and cryo-electron microscopy density maps to guide the multi-domain structure assembly simulations. I-TASSER-MTD is built on I-TASSER but substantially extends its ability and accuracy in modeling large multi-domain protein structures and provides meaningful functional insights for the targets at both the domain- and full-chain levels from the amino acid sequence alone.
Collapse
|
25
|
Wang S, Atkinson GRS, Hayes WB. SANA: cross-species prediction of Gene Ontology GO annotations via topological network alignment. NPJ Syst Biol Appl 2022; 8:25. [PMID: 35859153 PMCID: PMC9300714 DOI: 10.1038/s41540-022-00232-x] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2019] [Accepted: 05/20/2022] [Indexed: 12/31/2022] Open
Abstract
Topological network alignment aims to align two networks node-wise in order to maximize the observed common connection (edge) topology between them. The topological alignment of two protein-protein interaction (PPI) networks should thus expose protein pairs with similar interaction partners allowing, for example, the prediction of common Gene Ontology (GO) terms. Unfortunately, no network alignment algorithm based on topology alone has been able to achieve this aim, though those that include sequence similarity have seen some success. We argue that this failure of topology alone is due to the sparsity and incompleteness of the PPI network data of almost all species, which provides the network topology with a small signal-to-noise ratio that is effectively swamped when sequence information is added to the mix. Here we show that the weak signal can be detected using multiple stochastic samples of "good" topological network alignments, which allows us to observe regions of the two networks that are robustly aligned across multiple samples. The resulting network alignment frequency (NAF) strongly correlates with GO-based Resnik semantic similarity and enables the first successful cross-species predictions of GO terms based on topology-only network alignments. Our best predictions have an AUPR of about 0.4, which is competitive with state-of-the-art algorithms, even when there is no observable sequence similarity and no known homology relationship. While our results provide only a "proof of concept" on existing network data, we hypothesize that predicting GO terms from topology-only network alignments will become increasingly practical as the volume and quality of PPI network data increase.
Collapse
Affiliation(s)
- Siyue Wang
- Department of Computer Science, University of California, Irvine, CA, 92697-3435, USA
| | - Giles R S Atkinson
- Department of Computer Science, University of California, Irvine, CA, 92697-3435, USA
| | - Wayne B Hayes
- Department of Computer Science, University of California, Irvine, CA, 92697-3435, USA.
| |
Collapse
|
26
|
Tiley AMM, Lawless C, Pilo P, Karki SJ, Lu J, Long Z, Gibriel H, Bailey AM, Feechan A. The Zymoseptoria tritici white collar-1 gene, ZtWco-1, is required for development and virulence on wheat. Fungal Genet Biol 2022; 161:103715. [PMID: 35709910 DOI: 10.1016/j.fgb.2022.103715] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Revised: 06/02/2022] [Accepted: 06/06/2022] [Indexed: 11/04/2022]
Abstract
The fungus Zymoseptoria tritici causes Septoria Tritici Blotch (STB), which is one of the most devastating diseases of wheat in Europe. There are currently no fully durable methods of control against Z. tritici, so novel strategies are urgently required. One of the ways in which fungi are able to respond to their surrounding environment is through the use of photoreceptor proteins which detect light signals. Although previous evidence suggests that Z. tritici can detect light, no photoreceptor genes have been characterised in this pathogen. This study characterises ZtWco-1, a predicted photoreceptor gene in Z. tritici. The ZtWco-1 gene is a putative homolog to the blue light photoreceptor from Neurospora crassa, wc-1. Z. tritici mutants with deletions in ZtWco-1 have defects in hyphal branching, melanisation and virulence on wheat. In addition, we identify the putative circadian clock gene ZtFrq in Z. tritici. This study provides evidence for the genetic regulation of light detection in Z. tritici and it open avenues for future research into whether this pathogen has a circadian clock.
Collapse
Affiliation(s)
- Anna M M Tiley
- Agri-Food Biosciences Institute, 18a Newforge Ln, Belfast BT9 5PX, United Kingdom; School of Agriculture and Food Science, University College Dublin, Dublin 4, Republic of Ireland.
| | - Colleen Lawless
- School of Agriculture and Food Science, University College Dublin, Dublin 4, Republic of Ireland; School of Biology and Environmental Science, University College Dublin, Dublin 4, Republic of Ireland
| | - Paola Pilo
- School of Agriculture and Food Science, University College Dublin, Dublin 4, Republic of Ireland
| | - Sujit J Karki
- School of Agriculture and Food Science, University College Dublin, Dublin 4, Republic of Ireland
| | - Jijun Lu
- School of Agriculture and Food Science, University College Dublin, Dublin 4, Republic of Ireland
| | - Zhuowei Long
- School of Agriculture and Food Science, University College Dublin, Dublin 4, Republic of Ireland
| | - Hesham Gibriel
- School of Agriculture and Food Science, University College Dublin, Dublin 4, Republic of Ireland; Royal College of Surgeons in Ireland, Dublin 2, Ireland
| | - Andy M Bailey
- School of Biological Sciences, University of Bristol, 24 Tyndall Avenue, Bristol BS8 1TQ, United Kingdom
| | - Angela Feechan
- School of Agriculture and Food Science, University College Dublin, Dublin 4, Republic of Ireland.
| |
Collapse
|
27
|
Bennett APS, de la Torre-Escudero E, Dermott SSE, Threadgold LT, Hanna REB, Robinson MW. Fasciola hepatica Gastrodermal Cells Selectively Release Extracellular Vesicles via a Novel Atypical Secretory Mechanism. Int J Mol Sci 2022; 23:ijms23105525. [PMID: 35628335 PMCID: PMC9143473 DOI: 10.3390/ijms23105525] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Revised: 05/09/2022] [Accepted: 05/12/2022] [Indexed: 02/01/2023] Open
Abstract
The liver fluke, Fasciola hepatica, is an obligate blood-feeder, and the gastrodermal cells of the parasite form the interface with the host’s blood. Despite their importance in the host–parasite interaction, in-depth proteomic analysis of the gastrodermal cells is lacking. Here, we used laser microdissection of F. hepatica tissue sections to generate unique and biologically exclusive tissue fractions of the gastrodermal cells and tegument for analysis by mass spectrometry. A total of 226 gastrodermal cell proteins were identified, with proteases that degrade haemoglobin being the most abundant. Other detected proteins included those such as proton pumps and anticoagulants which maintain a microenvironment that facilitates digestion. By comparing the gastrodermal cell proteome and the 102 proteins identified in the laser microdissected tegument with previously published tegument proteomic datasets, we showed that one-quarter of proteins (removed by freeze–thaw extraction) or one-third of proteins (removed by detergent extraction) previously identified as tegumental were instead derived from the gastrodermal cells. Comparative analysis of the laser microdissected gastrodermal cells, tegument, and F. hepatica secretome revealed that the gastrodermal cells are the principal source of secreted proteins, as well as showed that both the gastrodermal cells and the tegument are likely to release subpopulations of extracellular vesicles (EVs). Microscopical examination of the gut caeca from flukes fixed immediately after their removal from the host bile ducts showed that selected gastrodermal cells underwent a progressive thinning of the apical plasma membrane which ruptured to release secretory vesicles en masse into the gut lumen. Our findings suggest that gut-derived EVs are released via a novel atypical secretory route and highlight the importance of the gastrodermal cells in nutrient acquisition and possible immunomodulation by the parasite.
Collapse
Affiliation(s)
- Adam P. S. Bennett
- School of Biological Sciences, The Queen’s University of Belfast, Belfast BT9 5DL, UK; (A.P.S.B.); (E.d.l.T.-E.)
| | - Eduardo de la Torre-Escudero
- School of Biological Sciences, The Queen’s University of Belfast, Belfast BT9 5DL, UK; (A.P.S.B.); (E.d.l.T.-E.)
| | - Susan S. E. Dermott
- School of Biological Sciences, The Queen’s University of Belfast, Belfast BT9 5DL, UK; (A.P.S.B.); (E.d.l.T.-E.)
| | - Lawrence T. Threadgold
- School of Biological Sciences, The Queen’s University of Belfast, Belfast BT9 5DL, UK; (A.P.S.B.); (E.d.l.T.-E.)
| | - Robert E. B. Hanna
- Veterinary Sciences Division, Agri-Food and Biosciences Institute (AFBI), Stormont, Belfast BT4 3SD, UK;
| | - Mark W. Robinson
- School of Biological Sciences, The Queen’s University of Belfast, Belfast BT9 5DL, UK; (A.P.S.B.); (E.d.l.T.-E.)
- Correspondence: ; Tel.: +44-(0)28-9097-2120
| |
Collapse
|
28
|
Xia W, Zheng L, Fang J, Li F, Zhou Y, Zeng Z, Zhang B, Li Z, Li H, Zhu F. PFmulDL: a novel strategy enabling multi-class and multi-label protein function annotation by integrating diverse deep learning methods. Comput Biol Med 2022; 145:105465. [PMID: 35366467 DOI: 10.1016/j.compbiomed.2022.105465] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 03/22/2022] [Accepted: 03/25/2022] [Indexed: 02/06/2023]
Abstract
Bioinformatic annotation of protein function is essential but extremely sophisticated, which asks for extensive efforts to develop effective prediction method. However, the existing methods tend to amplify the representativeness of the families with large number of proteins by misclassifying the proteins in the families with small number of proteins. That is to say, the ability of the existing methods to annotate proteins in the 'rare classes' remains limited. Herein, a new protein function annotation strategy, PFmulDL, integrating multiple deep learning methods, was thus constructed. First, the recurrent neural network was integrated, for the first time, with the convolutional neural network to facilitate the function annotation. Second, a transfer learning method was introduced to the model construction for further improving the prediction performances. Third, based on the latest data of Gene Ontology, the newly constructed model could annotate the largest number of protein families comparing with the existing methods. Finally, this newly constructed model was found capable of significantly elevating the prediction performance for the 'rare classes' without sacrificing that for the 'major classes'. All in all, due to the emerging requirements on improving the prediction performance for the proteins in 'rare classes', this new strategy would become an essential complement to the existing methods for protein function prediction. All the models and source codes are freely available and open to all users at: https://github.com/idrblab/PFmulDL.
Collapse
Affiliation(s)
- Weiqi Xia
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Lingyan Zheng
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China; Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China
| | - Jiebin Fang
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Fengcheng Li
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Ying Zhou
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China
| | - Zhenyu Zeng
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China
| | - Bing Zhang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China
| | - Zhaorong Li
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China
| | - Honglin Li
- School of Pharmacy, East China University of Science and Technology, Shanghai, 200237, China.
| | - Feng Zhu
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, 310058, China; Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, 330110, China.
| |
Collapse
|
29
|
Yang P, Ning K. How much metagenome data is needed for protein structure prediction: The advantages of targeted approach from the ecological and evolutionary perspectives. IMETA 2022; 1:e9. [PMID: 38867727 PMCID: PMC10989767 DOI: 10.1002/imt2.9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/07/2021] [Revised: 12/23/2021] [Accepted: 01/04/2022] [Indexed: 06/14/2024]
Abstract
It has been proven that three-dimensional protein structures could be modeled by supplementing homologous sequences with metagenome sequences. Even though a large volume of metagenome data is utilized for such purposes, a significant proportion of proteins remain unsolved. In this review, we focus on identifying ecological and evolutionary patterns in metagenome data, decoding the complicated relationships of these patterns with protein structures, and investigating how these patterns can be effectively used to improve protein structure prediction. First, we proposed the metagenome utilization efficiency and marginal effect model to quantify the divergent distribution of homologous sequences for the protein family. Second, we proposed that the targeted approach effectively identifies homologous sequences from specified biomes compared with the untargeted approach's blind search. Finally, we determined the lower bound for metagenome data required for predicting all the protein structures in the Pfam database and showed that the present metagenome data is insufficient for this purpose. In summary, we discovered ecological and evolutionary patterns in the metagenome data that may be used to predict protein structures effectively. The targeted approach is promising in terms of effectively extracting homologous sequences and predicting protein structures using these patterns.
Collapse
Affiliation(s)
- Pengshuo Yang
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-Imaging, Department of Bioinformatics and Systems Biology Center of AI Biology, College of Life Science and Technology, Huazhong University of Science and Technology Wuhan Hubei China
| | - Kang Ning
- Key Laboratory of Molecular Biophysics of the Ministry of Education, Hubei Key Laboratory of Bioinformatics and Molecular-Imaging, Department of Bioinformatics and Systems Biology Center of AI Biology, College of Life Science and Technology, Huazhong University of Science and Technology Wuhan Hubei China
| |
Collapse
|
30
|
Liu Y, Jin S, Gao H, Wang X, Wang C, Zhou W, Yu B. Predicting the multi-label protein subcellular localization through multi-information fusion and MLSI dimensionality reduction based on MLFE classifier. Bioinformatics 2021; 38:1223-1230. [PMID: 34864897 PMCID: PMC8690230 DOI: 10.1093/bioinformatics/btab811] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2021] [Revised: 11/17/2021] [Accepted: 11/30/2021] [Indexed: 01/05/2023] Open
Abstract
MOTIVATION Multi-label (ML) protein subcellular localization (SCL) is an indispensable way to study protein function. It can locate a certain protein (such as the human transmembrane protein that promotes the invasion of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)) or expression product at a specific location in a cell, which can provide a reference for clinical treatment of diseases such as coronavirus disease 2019 (COVID-19). RESULTS The article proposes a novel method named ML-locMLFE. First of all, six feature extraction methods are adopted to obtain protein effective information. These methods include pseudo amino acid composition, encoding based on grouped weight, gene ontology, multi-scale continuous and discontinuous, residue probing transformation and evolutionary distance transformation. In the next part, we utilize the ML information latent semantic index method to avoid the interference of redundant information. In the end, ML learning with feature-induced labeling information enrichment is adopted to predict the ML protein SCL. The Gram-positive bacteria dataset is chosen as a training set, while the Gram-negative bacteria dataset, virus dataset, newPlant dataset and SARS-CoV-2 dataset as the test sets. The overall actual accuracy of the first four datasets are 99.23%, 93.82%, 93.24% and 96.72% by the leave-one-out cross validation. It is worth mentioning that the overall actual accuracy prediction result of our predictor on the SARS-CoV-2 dataset is 72.73%. The results indicate that the ML-locMLFE method has obvious advantages in predicting the SCL of ML protein, which provides new ideas for further research on the SCL of ML protein. AVAILABILITY AND IMPLEMENTATION The source codes and datasets are publicly available at https://github.com/QUST-AIBBDRC/ML-locMLFE/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yushuang Liu
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao 266061, China,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao 266061, China
| | - Shuping Jin
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao 266061, China,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao 266061, China
| | - Hongli Gao
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao 266061, China,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao 266061, China
| | - Xue Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao 266061, China,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao 266061, China
| | - Congjing Wang
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao 266061, China,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao 266061, China
| | - Weifeng Zhou
- College of Mathematics and Physics, Qingdao University of Science and Technology, Qingdao 266061, China,Artificial Intelligence and Biomedical Big Data Research Center, Qingdao University of Science and Technology, Qingdao 266061, China
| | - Bin Yu
- School of Data Science, Qingdao University of Science and Technology, Qingdao 266061, China,College of Information Science and Technology, Qingdao University of Science and Technology, Qingdao 266061, China,To whom correspondence should be addressed.
| |
Collapse
|
31
|
A Deep Learning Approach with Data Augmentation to Predict Novel Spider Neurotoxic Peptides. Int J Mol Sci 2021; 22:ijms222212291. [PMID: 34830173 PMCID: PMC8619404 DOI: 10.3390/ijms222212291] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2021] [Revised: 11/09/2021] [Accepted: 11/11/2021] [Indexed: 11/17/2022] Open
Abstract
As major components of spider venoms, neurotoxic peptides exhibit structural diversity, target specificity, and have great pharmaceutical potential. Deep learning may be an alternative to the laborious and time-consuming methods for identifying these peptides. However, the major hurdle in developing a deep learning model is the limited data on neurotoxic peptides. Here, we present a peptide data augmentation method that improves the recognition of neurotoxic peptides via a convolutional neural network model. The neurotoxic peptides were augmented with the known neurotoxic peptides from UniProt database, and the models were trained using a training set with or without the generated sequences to verify the augmented data. The model trained with the augmented dataset outperformed the one with the unaugmented dataset, achieving accuracy of 0.9953, precision of 0.9922, recall of 0.9984, and F1 score of 0.9953 in simulation dataset. From the set of all RNA transcripts of Callobius koreanus spider, we discovered neurotoxic peptides via the model, resulting in 275 putative peptides of which 252 novel sequences and only 23 sequences showing homology with the known peptides by Basic Local Alignment Search Tool. Among these 275 peptides, four were selected and shown to have neuromodulatory effects on the human neuroblastoma cell line SH-SY5Y. The augmentation method presented here may be applied to the identification of other functional peptides from biological resources with insufficient data.
Collapse
|
32
|
Ramsey J, McIntosh B, Renfro D, Aleksander SA, LaBonte S, Ross C, Zweifel AE, Liles N, Farrar S, Gill JJ, Erill I, Ades S, Berardini TZ, Bennett JA, Brady S, Britton R, Carbon S, Caruso SM, Clements D, Dalia R, Defelice M, Doyle EL, Friedberg I, Gurney SMR, Hughes L, Johnson A, Kowalski JM, Li D, Lovering RC, Mans TL, McCarthy F, Moore SD, Murphy R, Paustian TD, Perdue S, Peterson CN, Prüß BM, Saha MS, Sheehy RR, Tansey JT, Temple L, Thorman AW, Trevino S, Vollmer AC, Walbot V, Willey J, Siegele DA, Hu JC. Crowdsourcing biocuration: The Community Assessment of Community Annotation with Ontologies (CACAO). PLoS Comput Biol 2021; 17:e1009463. [PMID: 34710081 PMCID: PMC8553046 DOI: 10.1371/journal.pcbi.1009463] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Experimental data about gene functions curated from the primary literature have enormous value for research scientists in understanding biology. Using the Gene Ontology (GO), manual curation by experts has provided an important resource for studying gene function, especially within model organisms. Unprecedented expansion of the scientific literature and validation of the predicted proteins have increased both data value and the challenges of keeping pace. Capturing literature-based functional annotations is limited by the ability of biocurators to handle the massive and rapidly growing scientific literature. Within the community-oriented wiki framework for GO annotation called the Gene Ontology Normal Usage Tracking System (GONUTS), we describe an approach to expand biocuration through crowdsourcing with undergraduates. This multiplies the number of high-quality annotations in international databases, enriches our coverage of the literature on normal gene function, and pushes the field in new directions. From an intercollegiate competition judged by experienced biocurators, Community Assessment of Community Annotation with Ontologies (CACAO), we have contributed nearly 5,000 literature-based annotations. Many of those annotations are to organisms not currently well-represented within GO. Over a 10-year history, our community contributors have spurred changes to the ontology not traditionally covered by professional biocurators. The CACAO principle of relying on community members to participate in and shape the future of biocuration in GO is a powerful and scalable model used to promote the scientific enterprise. It also provides undergraduate students with a unique and enriching introduction to critical reading of primary literature and acquisition of marketable skills.
Collapse
Affiliation(s)
- Jolene Ramsey
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
- Center for Phage Technology, Texas A&M University, College Station, Texas, United States of America
| | - Brenley McIntosh
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Daniel Renfro
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Suzanne A. Aleksander
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Sandra LaBonte
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Curtis Ross
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
- Center for Phage Technology, Texas A&M University, College Station, Texas, United States of America
| | - Adrienne E. Zweifel
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Nathan Liles
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Shabnam Farrar
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
| | - Jason J. Gill
- Center for Phage Technology, Texas A&M University, College Station, Texas, United States of America
- Department of Animal Science, Texas A&M University, College Station, Texas, United States of America
| | - Ivan Erill
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, Maryland, United States of America
- Department of Computer Science and Electrical Engineering, University of Maryland Baltimore County, Baltimore, Maryland, United States of America
| | - Sarah Ades
- Department of Biochemistry & Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Tanya Z. Berardini
- The Arabidopsis Information Resource, Phoenix Bioinformatics, Newark, California, United States of America
| | - Jennifer A. Bennett
- Department of Biology and Earth Science, Otterbein University, Westerville, Ohio, United States of America
| | - Siobhan Brady
- Department of Plant Biology and Genome Center, University of California Davis, Davis, California, United States of America
| | - Robert Britton
- Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, Michigan, United States of America
| | - Seth Carbon
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Steven M. Caruso
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, Maryland, United States of America
| | - Dave Clements
- Department of Biology, John Hopkins University, Baltimore, Maryland, United States of America
| | - Ritu Dalia
- Department of Biology, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Meredith Defelice
- Department of Biochemistry & Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania, United States of America
| | - Erin L. Doyle
- Biology Department, Doane University, Crete, Nebraska, United States of America
| | - Iddo Friedberg
- Department of Microbiology, Miami University, Oxford, Ohio, United States of America
| | - Susan M. R. Gurney
- Department of Biology, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Lee Hughes
- Department of Biological Sciences, University of North Texas, Denton, Texas, United States of America
| | - Allison Johnson
- Center for the Study of Biological Complexity, Virginia Commonwealth University, Richmond, Virginia, United States of America
| | - Jason M. Kowalski
- Biological Sciences Department, University of Wisconsin-Parkside, Kenosha, Wisconsin, United States of America
| | - Donghui Li
- The Arabidopsis Information Resource, Phoenix Bioinformatics, Newark, California, United States of America
| | - Ruth C. Lovering
- Institute of Cardiovascular Science, University College London, London, United Kingdom
| | - Tamara L. Mans
- Department of Biochemistry and Biotechnology, Minnesota State University Moorhead, Brooklyn Park, Minnesota, United States of America
| | - Fiona McCarthy
- Department of Basic Science, College of Veterinary Medicine, Mississippi State University, Starkville, Mississippi, United States of America
| | - Sean D. Moore
- Burnett School of Biomedical Sciences, University of Central Florida, Orlando, Florida, United States of America
| | - Rebecca Murphy
- Department of Biology, Centenary College of Louisiana, Shreveport, Louisiana, United States of America
| | - Timothy D. Paustian
- Department of Bacteriology, University of Wisconsin, Madison, Wisconsin, United States of America
| | - Sarah Perdue
- Biological Sciences Department, University of Wisconsin-Parkside, Kenosha, Wisconsin, United States of America
| | - Celeste N. Peterson
- Biology Department, Suffolk University, Boston, Massachusetts, United States of America
| | - Birgit M. Prüß
- Microbiological Sciences Department, North Dakota State University, Fargo, North Dakota, United States of America
| | - Margaret S. Saha
- Department of Biology, College of William & Mary, Williamsburg, Virginia, United States of America
| | - Robert R. Sheehy
- Biology Department, Radford University, Radford, Virginia, United States of America
| | - John T. Tansey
- Department of Biochemistry and Molecular Biology, Otterbein University, Westerville, Ohio, United States of America
| | - Louise Temple
- School of Integrated Sciences, James Madison University, Harrisonburg, Virginia, United States of America
| | - Alexander William Thorman
- Department of Environmental and Public Health Sciences, University of Cincinnati, Cincinnati, Ohio, United States of America
| | - Saul Trevino
- Department of Chemistry, Math, and Physics, Houston Baptist University, Houston, Texas, United States of America
| | - Amy Cheng Vollmer
- Department of Biology, Swarthmore College, Swarthmore, Pennsylvania, United States of America
| | - Virginia Walbot
- Department of Biology, Stanford University, Stanford, California, United States of America
| | - Joanne Willey
- Department of Science Education, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Hempstead, New York, United States of America
| | - Deborah A. Siegele
- Department of Biology, Texas A&M University, College Station, Texas, United States of America
| | - James C. Hu
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, Texas, United States of America
- Center for Phage Technology, Texas A&M University, College Station, Texas, United States of America
| |
Collapse
|
33
|
Gong W, Guerler A, Zhang C, Warner E, Li C, Zhang Y. Integrating Multimeric Threading With High-throughput Experiments for Structural Interactome of Escherichia coli. J Mol Biol 2021; 433:166944. [PMID: 33741411 DOI: 10.1016/j.jmb.2021.166944] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2020] [Revised: 03/06/2021] [Accepted: 03/09/2021] [Indexed: 10/21/2022]
Abstract
Genome-wide protein-protein interaction (PPI) determination remains a significant unsolved problem in structural biology. The difficulty is twofold since high-throughput experiments (HTEs) have often a relatively high false-positive rate in assigning PPIs, and PPI quaternary structures are more difficult to solve than tertiary structures using traditional structural biology techniques. We proposed a uniform pipeline, Threpp, to address both problems. Starting from a pair of monomer sequences, Threpp first threads both sequences through a complex structure library, where the alignment score is combined with HTE data using a naïve Bayesian classifier model to predict the likelihood of two chains to interact with each other. Next, quaternary complex structures of the identified PPIs are constructed by reassembling monomeric alignments with dimeric threading frameworks through interface-specific structural alignments. The pipeline was applied to the Escherichia coli genome and created 35,125 confident PPIs which is 4.5-fold higher than HTE alone. Graphic analyses of the PPI networks show a scale-free cluster size distribution, consistent with previous studies, which was found critical to the robustness of genome evolution and the centrality of functionally important proteins that are essential to E. coli survival. Furthermore, complex structure models were constructed for all predicted E. coli PPIs based on the quaternary threading alignments, where 6771 of them were found to have a high confidence score that corresponds to the correct fold of the complexes with a TM-score >0.5, and 39 showed a close consistency with the later released experimental structures with an average TM-score = 0.73. These results demonstrated the significant usefulness of threading-based homologous modeling in both genome-wide PPI network detection and complex structural construction.
Collapse
Affiliation(s)
- Weikang Gong
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA; Faculty of Environmental and Life Sciences, Beijing University of Technology, Beijing 100124, China
| | - Aysam Guerler
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Elisa Warner
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Chunhua Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA; Faculty of Environmental and Life Sciences, Beijing University of Technology, Beijing 100124, China.
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA; Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, 48109, USA.
| |
Collapse
|
34
|
Seyyedsalehi SF, Soleymani M, Rabiee HR, Mofrad MRK. PFP-WGAN: Protein function prediction by discovering Gene Ontology term correlations with generative adversarial networks. PLoS One 2021; 16:e0244430. [PMID: 33630862 PMCID: PMC7906332 DOI: 10.1371/journal.pone.0244430] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2020] [Accepted: 12/09/2020] [Indexed: 12/12/2022] Open
Abstract
Understanding the functionality of proteins has emerged as a critical problem in recent years due to significant roles of these macro-molecules in biological mechanisms. However, in-laboratory techniques for protein function prediction are not as efficient as methods developed and processed for protein sequencing. While more than 70 million protein sequences are available today, only the functionality of around one percent of them are known. These facts have encouraged researchers to develop computational methods to infer protein functionalities from their sequences. Gene Ontology is the most well-known database for protein functions which has a hierarchical structure, where deeper terms are more determinative and specific. However, the lack of experimentally approved annotations for these specific terms limits the performance of computational methods applied on them. In this work, we propose a method to improve protein function prediction using their sequences by deeply extracting relationships between Gene Ontology terms. To this end, we construct a conditional generative adversarial network which helps to effectively discover and incorporate term correlations in the annotation process. In addition to the baseline algorithms, we compare our method with two recently proposed deep techniques that attempt to utilize Gene Ontology term correlations. Our results confirm the superiority of the proposed method compared to the previous works. Moreover, we demonstrate how our model can effectively help to assign more specific terms to sequences.
Collapse
Affiliation(s)
- Seyyede Fatemeh Seyyedsalehi
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
- Department of Mechanical Engineering, University of California Berkeley, Berkeley, California, United States of America
| | - Mahdieh Soleymani
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - Hamid R. Rabiee
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - Mohammad R. K. Mofrad
- Department of Mechanical Engineering, University of California Berkeley, Berkeley, California, United States of America
| |
Collapse
|
35
|
Zohra Smaili F, Tian S, Roy A, Alazmi M, Arold ST, Mukherjee S, Scott Hefty P, Chen W, Gao X. QAUST: Protein Function Prediction Using Structure Similarity, Protein Interaction, and Functional Motifs. GENOMICS PROTEOMICS & BIOINFORMATICS 2021; 19:998-1011. [PMID: 33631427 PMCID: PMC9403031 DOI: 10.1016/j.gpb.2021.02.001] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/11/2018] [Revised: 04/03/2019] [Accepted: 05/17/2019] [Indexed: 11/25/2022]
Abstract
The number of available protein sequences in public databases is increasing exponentially. However, a significant percentage of these sequences lack functional annotation, which is essential for the understanding of how biological systems operate. Here, we propose a novel method, Quantitative Annotation of Unknown STructure (QAUST), to infer protein functions, specifically Gene Ontology (GO) terms and Enzyme Commission (EC) numbers. QAUST uses three sources of information: structure information encoded by global and local structure similarity search, biological network information inferred by protein–protein interaction data, and sequence information extracted from functionally discriminative sequence motifs. These three pieces of information are combined by consensus averaging to make the final prediction. Our approach has been tested on 500 protein targets from the Critical Assessment of Functional Annotation (CAFA) benchmark set. The results show that our method provides accurate functional annotation and outperforms other prediction methods based on sequence similarity search or threading. We further demonstrate that a previously unknown function of human tripartite motif-containing 22 (TRIM22) protein predicted by QAUST can be experimentally validated.
Collapse
Affiliation(s)
- Fatima Zohra Smaili
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| | - Shuye Tian
- Department of Biology, Southern University of Science and Technology of China (SUSTC), Shenzhen 518055, China
| | - Ambrish Roy
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Meshari Alazmi
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia; College of Computer Science and Engineering, University of Hail, Hail 55476, Saudi Arabia
| | - Stefan T Arold
- Biological and Environmental Sciences and Engineering (BESE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia
| | - Srayanta Mukherjee
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - P Scott Hefty
- Department of Molecular Bioscience, University of Kansas, Lawrence, KS 66047, USA
| | - Wei Chen
- Department of Biology, Southern University of Science and Technology of China (SUSTC), Shenzhen 518055, China.
| | - Xin Gao
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia.
| |
Collapse
|
36
|
Barot M, Gligorijević V, Cho K, Bonneau R. NetQuilt: Deep Multispecies Network-based Protein Function Prediction using Homology-informed Network Similarity. Bioinformatics 2021; 37:2414-2422. [PMID: 33576802 PMCID: PMC8388039 DOI: 10.1093/bioinformatics/btab098] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Revised: 02/04/2021] [Accepted: 02/09/2021] [Indexed: 02/02/2023] Open
Abstract
Motivation Transferring knowledge between species is challenging: different species contain distinct proteomes and cellular architectures, which cause their proteins to carry out different functions via different interaction networks. Many approaches to protein functional annotation use sequence similarity to transfer knowledge between species. These approaches cannot produce accurate predictions for proteins without homologues of known function, as many functions require cellular context for meaningful prediction. To supply this context, network-based methods use protein-protein interaction (PPI) networks as a source of information for inferring protein function and have demonstrated promising results in function prediction. However, most of these methods are tied to a network for a single species, and many species lack biological networks. Results In this work, we integrate sequence and network information across multiple species by computing IsoRank similarity scores to create a meta-network profile of the proteins of multiple species. We use this integrated multispecies meta-network as input to train a maxout neural network with Gene Ontology terms as target labels. Our multispecies approach takes advantage of more training examples, and consequently leads to significant improvements in function prediction performance compared to two network-based methods, a deep learning sequence-based method and the BLAST annotation method used in the Critial Assessment of Functional Annotation. We are able to demonstrate that our approach performs well even in cases where a species has no network information available: when an organism’s PPI network is left out we can use our multi-species method to make predictions for the left-out organism with good performance. Availability and implementation The code is freely available at https://github.com/nowittynamesleft/NetQuilt. The data, including sequences, PPI networks and GO annotations are available at https://string-db.org/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Meet Barot
- Center for Data Science, New York University, New York, 10011, USA
| | | | - Kyunghyun Cho
- Center for Data Science, New York University, New York, 10011, USA
| | - Richard Bonneau
- Center for Data Science, New York University, New York, 10011, USA.,Center for Computational Biology, Flatiron Institute, New York, 10010, USA
| |
Collapse
|
37
|
Zhang C, Zheng W, Cheng M, Omenn GS, Freddolino PL, Zhang Y. Functions of Essential Genes and a Scale-Free Protein Interaction Network Revealed by Structure-Based Function and Interaction Prediction for a Minimal Genome. J Proteome Res 2021; 20:1178-1189. [PMID: 33393786 PMCID: PMC7867644 DOI: 10.1021/acs.jproteome.0c00359] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
When the JCVI-syn3.0 genome was designed and implemented in 2016 as the minimal genome of a free-living organism, approximately one-third of the 438 protein-coding genes had no known function. Subsequent refinement into JCVI-syn3A led to inclusion of 16 additional protein-coding genes, including several unknown functions, resulting in an improved growth phenotype. Here, we seek to unveil the biological roles and protein-protein interaction (PPI) networks for these poorly characterized proteins using state-of-the-art deep learning contact-assisted structure prediction, followed by structure-based annotation of functions and PPI predictions. Our pipeline is able to confidently assign functions for many previously unannotated proteins such as putative vitamin transporters, which suggest the importance of nutrient uptake even in a minimized genome. Remarkably, despite the artificial selection of genes in the minimal syn3 genome, our reconstructed PPI network still shows a power law distribution of node degrees typical of naturally evolved bacterial PPI networks. Making use of our framework for combined structure/function/interaction modeling, we are able to identify both fundamental aspects of network biology that are retained in a minimal proteome and additional essential functions not yet recognized among the poorly annotated components of the syn3.0 and syn3A proteomes.
Collapse
Affiliation(s)
- Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Wei Zheng
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Micah Cheng
- Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Gilbert S Omenn
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, United States
- Departments of Internal Medicine and Human Genetics and School of Public Health, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Peter L Freddolino
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, United States
- Department of Biological Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, United States
- Department of Biological Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
| |
Collapse
|
38
|
Zhang Q, Zhang Y, Li S, Han Y, Jin S, Gu H, Yu B. Accurate prediction of multi-label protein subcellular localization through multi-view feature learning with RBRL classifier. Brief Bioinform 2021; 22:6127451. [PMID: 33537726 DOI: 10.1093/bib/bbab012] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Revised: 12/12/2020] [Accepted: 01/06/2021] [Indexed: 01/27/2023] Open
Abstract
Multi-label proteins can participate in carrier transportation, enzyme catalysis, hormone regulation and other life activities. Meanwhile, they play a key role in the fields of biopharmaceuticals, gene and cell therapy. This article proposes a prediction method called Mps-mvRBRL to predict the subcellular localization (SCL) of multi-label protein. Firstly, pseudo position-specific scoring matrix, dipeptide composition, position specific scoring matrix-transition probability composition, gene ontology and pseudo amino acid composition algorithms are used to obtain numerical information from different views. Based on the contribution of five individual feature extraction methods, differential evolution is used for the first time to learn the weight of single feature, and then these original features use a weighted combination method to fuse multi-view information. Secondly, the fused high-dimensional features use a weighted linear discriminant analysis framework based on binary weight form to eliminate irrelevant information. Finally, the best feature vector is input into the joint ranking support vector machine and binary relevance with robust low-rank learning classifier to predict the SCL. After applying leave-one-out cross-validation, the overall actual accuracy (OAA) and overall location accuracy (OLA) of Mps-mvRBRL on the training set of Gram-positive bacteria are both 99.81%. The OAA on the test sets of plant, virus and Gram-negative bacteria datasets are 97.24%, 98.55% and 98.20%, respectively, and the OLA are 97.16%, 97.62% and 98.28%, respectively. The results show that the model achieves good prediction performance for predicting the SCL of multi-label protein.
Collapse
Affiliation(s)
- Qi Zhang
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| | - Yandan Zhang
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| | - Shan Li
- School of Mathematics and Statistics, Central South University, China
| | - Yu Han
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| | - Shuping Jin
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| | - Haiming Gu
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| | - Bin Yu
- College of Mathematics and Physics, Qingdao University of Science and Technology, China
| |
Collapse
|
39
|
Wang X, Shi D, Zhao D, Hu D. Aberrant Methylation and Differential Expression of SLC2A1, TNS4, GAPDH, ATP8A2, and CASZ1 Are Associated with the Prognosis of Lung Adenocarcinoma. BIOMED RESEARCH INTERNATIONAL 2020; 2020:1807089. [PMID: 33029490 PMCID: PMC7532994 DOI: 10.1155/2020/1807089] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Revised: 08/31/2020] [Accepted: 09/03/2020] [Indexed: 02/06/2023]
Abstract
Lung cancer is one of the leading triggers for cancer death worldwide. In this study, the relationship of the aberrantly methylated and differentially expressed genes in lung adenocarcinoma (LUAD) with cancer prognosis was investigated, and 5 feature genes were identified eventually. Specifically, we firstly downloaded the LUAD-related mRNA expression profile (including 57 normal tissue samples and 464 LUAD tissue samples) and Methy450 expression data (including 32 normal tissue samples and 373 LUAD tissue samples) from the TCGA database. The package "limma" was used to screen differentially expressed genes and aberrantly methylated genes, which were intersected for identifying the hypermethylated downregulated genes (DGs Hyper) and the hypomethylated upregulated genes (UGs Hypo). GO annotation and KEGG pathway enrichment analysis were further performed, and it was found that these DGs Hyper and UGs Hypo were predominantly activated in the biological processes and signaling pathways such as the regulation of vasculature development, DNA-binding transcription activator activity, and Ras signaling pathway, indicating that these genes play a vital role in the initiation and progression of LUAD. Additionally, univariate and multivariate Cox regression analyses were conducted to find the genes significantly associated with LUAD prognosis. Five genes including SLC2A1, TNS4, GAPDH, ATP8A2, and CASZ1 were identified, with the former three highly expressed and the latter two poorly expressed in LUAD, indicating poor prognosis of LUAD patients as judged by survival analysis.
Collapse
Affiliation(s)
- Xia Wang
- Department of Pneumology, The First People's Hospital of Fuyang, Fuyang, China
| | - Dongming Shi
- Department of Pneumology, The First People's Hospital of Fuyang, Fuyang, China
| | - Dejun Zhao
- Department of Pneumology, The First People's Hospital of Fuyang, Fuyang, China
| | - Danping Hu
- Department of Pneumology, The First People's Hospital of Fuyang, Fuyang, China
| |
Collapse
|
40
|
A Lactococcal Phage Protein Promotes Viral Propagation and Alters the Host Proteomic Response During Infection. Viruses 2020; 12:v12080797. [PMID: 32722163 PMCID: PMC7472136 DOI: 10.3390/v12080797] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2020] [Revised: 07/21/2020] [Accepted: 07/22/2020] [Indexed: 12/13/2022] Open
Abstract
The lactococcal virulent phage p2 is a model for studying the Skunavirus genus, the most prevalent group of phages causing milk fermentation failures in cheese factories worldwide. This siphophage infects Lactococcus lactis MG1363, a model strain used to study Gram-positive lactic acid bacteria. The structural proteins of phage p2 have been thoroughly described, while most of its non-structural proteins remain uncharacterized. Here, we developed an integrative approach, making use of structural biology, genomics, physiology, and proteomics to provide insights into the function of ORF47, the most conserved non-structural protein of unknown function among the Skunavirus genus. This small phage protein, which is composed of three α-helices, was found to have a major impact on the bacterial proteome during phage infection and to significantly reduce the emergence of bacteriophage-insensitive mutants.
Collapse
|
41
|
Hu G, Wu Z, Oldfield CJ, Wang C, Kurgan L. Quality assessment for the putative intrinsic disorder in proteins. Bioinformatics 2020; 35:1692-1700. [PMID: 30329008 DOI: 10.1093/bioinformatics/bty881] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2018] [Revised: 09/19/2018] [Accepted: 10/15/2018] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION While putative intrinsic disorder is widely used, none of the predictors provides quality assessment (QA) scores. QA scores estimate the likelihood that predictions are correct at a residue level and have been applied in other bioinformatics areas. We recently reported that QA scores derived from putative disorder propensities perform relatively poorly for native disordered residues. Here we design and validate a general approach to construct QA predictors for disorder predictions. RESULTS The QUARTER (QUality Assessment for pRotein inTrinsic disordEr pRedictions) toolbox of methods accommodates a diverse set of ten disorder predictors. It builds upon several innovative design elements including use and scaling of selected physicochemical properties of the input sequence, post-processing of disorder propensity scores, and a feature selection that optimizes the predictive models to a specific disorder predictor. We empirically establish that each one of these elements contributes to the overall predictive performance of our tool and that QUARTER's outputs significantly outperform QA scores derived from the outputs generated the disorder predictors. The best performing QA scores for a single disorder predictor identify 13% of residues that are predicted with 98% precision. QA scores computed by combining results of the ten disorder predictors cover 40% of residues with 95% precision. Case studies are used to show how to interpret the QA scores. QA scores based on the high precision combined predictions are applied to analyze disorder in the human proteome. AVAILABILITY AND IMPLEMENTATION http://biomine.cs.vcu.edu/servers/QUARTER/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Gang Hu
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, People's Republic of China
| | - Zhonghua Wu
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, People's Republic of China
| | | | - Chen Wang
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| |
Collapse
|
42
|
You R, Yao S, Xiong Y, Huang X, Sun F, Mamitsuka H, Zhu S. NetGO: improving large-scale protein function prediction with massive network information. Nucleic Acids Res 2020; 47:W379-W387. [PMID: 31106361 PMCID: PMC6602452 DOI: 10.1093/nar/gkz388] [Citation(s) in RCA: 73] [Impact Index Per Article: 14.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2019] [Revised: 04/24/2019] [Accepted: 05/01/2019] [Indexed: 01/19/2023] Open
Abstract
Automated function prediction (AFP) of proteins is of great significance in biology. AFP can be regarded as a problem of the large-scale multi-label classification where a protein can be associated with multiple gene ontology terms as its labels. Based on our GOLabeler—a state-of-the-art method for the third critical assessment of functional annotation (CAFA3), in this paper we propose NetGO, a web server that is able to further improve the performance of the large-scale AFP by incorporating massive protein-protein network information. Specifically, the advantages of NetGO are threefold in using network information: (i) NetGO relies on a powerful learning to rank framework from machine learning to effectively integrate both sequence and network information of proteins; (ii) NetGO uses the massive network information of all species (>2000) in STRING (other than only some specific species) and (iii) NetGO still can use network information to annotate a protein by homology transfer, even if it is not contained in STRING. Separating training and testing data with the same time-delayed settings of CAFA, we comprehensively examined the performance of NetGO. Experimental results have clearly demonstrated that NetGO significantly outperforms GOLabeler and other competing methods. The NetGO web server is freely available at http://issubmission.sjtu.edu.cn/netgo/.
Collapse
Affiliation(s)
- Ronghui You
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai 200433, China.,Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China
| | - Shuwei Yao
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai 200433, China.,Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China
| | - Yi Xiong
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University
| | - Xiaodi Huang
- School of Computing and Mathematics, Charles Sturt University, Albury, NSW 2640, Australia
| | - Fengzhu Sun
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China.,Quantitative and Computational Biology, Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - Hiroshi Mamitsuka
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji 611-0011, Japan.,Department of Computer Science, Aalto University, Espoo and Helsinki, Finland
| | - Shanfeng Zhu
- School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai 200433, China.,Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China
| |
Collapse
|
43
|
Cai Y, Wang J, Deng L. SDN2GO: An Integrated Deep Learning Model for Protein Function Prediction. Front Bioeng Biotechnol 2020; 8:391. [PMID: 32411695 PMCID: PMC7201018 DOI: 10.3389/fbioe.2020.00391] [Citation(s) in RCA: 42] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2020] [Accepted: 04/07/2020] [Indexed: 02/01/2023] Open
Abstract
The assignment of function to proteins at a large scale is essential for understanding the molecular mechanism of life. However, only a very small percentage of the more than 179 million proteins in UniProtKB have Gene Ontology (GO) annotations supported by experimental evidence. In this paper, we proposed an integrated deep-learning-based classification model, named SDN2GO, to predict protein functions. SDN2GO applies convolutional neural networks to learn and extract features from sequences, protein domains, and known PPI networks, and then utilizes a weight classifier to integrate these features and achieve accurate predictions of GO terms. We constructed the training set and the independent test set according to the time-delayed principle of the Critical Assessment of Function Annotation (CAFA) and compared it with two highly competitive methods and the classic BLAST method on the independent test set. The results show that our method outperforms others on each sub-ontology of GO. We also investigated the performance of using protein domain information. We learned from the Natural Language Processing (NLP) to process domain information and pre-trained a deep learning sub-model to extract the comprehensive features of domains. The experimental results demonstrate that the domain features we obtained are much improved the performance of our model. Our deep learning models together with the data pre-processing scripts are publicly available as an open source software at https://github.com/Charrick/SDN2GO.
Collapse
Affiliation(s)
- Yideng Cai
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Jiacheng Wang
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha, China
- School of Software, Xinjiang University, Urumqi, China
| |
Collapse
|
44
|
Zhao G, Liu C, Li S, Wang X, Yao Y. Exploring the flavor formation mechanism under osmotic conditions during soy sauce fermentation in Aspergillus oryzae by proteomic analysis. Food Funct 2020; 11:640-648. [PMID: 31895399 DOI: 10.1039/c9fo02314c] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2023]
Abstract
Aspergillus oryzae is a common starter in the soy sauce industry and struggles to grow under complex fermentation conditions. However, little is known about the flavor formation mechanism under osmotic conditions (low-temperature and high-salt) in A. oryzae. This work investigated the flavors and the relative protein expression patterns by gas chromatography-mass spectrometry (GC-MS) and proteomic analysis. Low-temperature and a high-salt content are unfavorable to the secretion of hydrolases and the formation of fragrant aldehydes. The aldehyde contents under osmotic conditions were reduced to 1.4-3.7 times lower than that of the control. Besides, copper amine oxidases which decreased under low-temperature stress and salt stress were shown to be important in catalyzing the oxidative deamination of several amine substrates to fragrant aldehydes. Furthermore, alcohol dehydrogenase and polyketide synthase are beneficial to the formation of alcohols and aromatic flavors under low-temperature stress and salt stress. Particularly, the ethanol content under 16 °C stress was 3.5 times higher than that under 28 °C.
Collapse
Affiliation(s)
- Guozhong Zhao
- State Key Laboratory of Food Nutrition and Safety, Key Laboratory of Food Nutrition and Safety, Ministry of Education, College of Food Science and Engineering, Tianjin University of Science & Technology, Tianjin 300457, China.
| | | | | | | | | |
Collapse
|
45
|
Yi D, Wang K, Zhu B, Li S, Liu X. Identification of neuropathic pain-associated genes and pathways via random walk with restart algorithm. J Neurosurg Sci 2020; 65:414-420. [PMID: 32536116 DOI: 10.23736/s0390-5616.20.04920-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
BACKGROUND Neuropathic pain (NP) develops from neuropathic lesions or diseases affecting the nervous system, and has become a serious public health issue due to its complex symptoms, high incidence and long duration. At present, the exact pathogenesis of NP is still unclear. In this study, we sought to identify the genes as well as the related molecular mechanisms associated with NP occurrence and development. METHODS We firstly identified the differentially expressed genes between NP spinal nerve ligation (SNL) rats and control sham rats and then projected them onto a STRING network for functional association analysis. Then, Random Walk with Restart (RWR) was conducted to find some new NP-related genes, with their potential functions sequentially analyzed by GO annotation and KEGG pathway analysis. RESULTS Some new NP-related genes, like Gng13, C3 and Cxcl2, were identified by RWR analysis. Meanwhile, some biological functions like inflammatory responses, chemotaxis and immune responses, as well as some signaling pathways, such as those involved in neuroactive ligand-receptor interactions, complement and blood coagulation cascade reactions, and cytokine-receptor interactions that the new NP- related genes were most activated were found to be associated with NP occurrence and development. CONCLUSIONS This study extends our knowledge of NP occurrence and development and provides new therapeutic targets for future NP treatment.
Collapse
Affiliation(s)
- Duan Yi
- Department of Pain Medicine Center, Peking University Third Hospital, Beijing China
| | - Kai Wang
- Department of Pain Medicine Center, Peking University Third Hospital, Beijing China
| | - Bin Zhu
- Department of Pain Medicine Center, Peking University Third Hospital, Beijing China
| | - Shuiqing Li
- Department of Pain Medicine Center, Peking University Third Hospital, Beijing China
| | - Xiaoguang Liu
- Department of Orthopedic, Peking University Third Hospital, Beijing China -
| |
Collapse
|
46
|
Zhang C, Lane L, Omenn GS, Zhang Y. Blinded Testing of Function Annotation for uPE1 Proteins by I-TASSER/COFACTOR Pipeline Using the 2018-2019 Additions to neXtProt and the CAFA3 Challenge. J Proteome Res 2019; 18:4154-4166. [PMID: 31581775 PMCID: PMC6900986 DOI: 10.1021/acs.jproteome.9b00537] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
In 2018, we reported a hybrid pipeline that predicts protein structures with I-TASSER and function with COFACTOR. I-TASSER/COFACTOR achieved Gene Ontology (GO) high prediction accuracies of Fmax = 0.69 and 0.57 for molecular function (MF) and biological process (BP), respectively, on 100 comprehensively annotated proteins. Now we report blinded analyses of newly annotated proteins in the critical assessment of function annotation (CAFA) three function prediction challenge and in neXtProt. For CAFA3 results released in May 2019, our predictions on 267 and 912 human proteins with newly annotated MF and BP terms achieved Fmax = 0.50 and 0.42, respectively, on "No Knowledge" proteins, and 0.51 and 0.74, respectively, on "Limited Knowledge" proteins. While COFACTOR consistently outperforms simple homology-based analysis, its accuracy still depends on template availability. Meanwhile, in neXtProt 2019-01, 25 proteins acquired new function annotation through literature curation at UniProt/Swiss-Prot. Before the release of these curated results, we submitted to neXtProt blinded predictions of free-text function annotation based on predicted GO terms. For 10 of the 25, a good match of free-text or GO term annotation was obtained. These blind tests represent rigorous assessments of I-TASSER/COFACTOR. neXtProt now provides links to precomputed I-TASSER/COFACTOR predictions for proteins without function annotation to facilitate experimental planning on "dark proteins".
Collapse
Affiliation(s)
- Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109-2218, United States
| | - Lydie Lane
- CALIPHO Group, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
- Department of Microbiology and Molecular Medicine, Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Gilbert S. Omenn
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109-2218, United States
- Departments of Internal Medicine and Human Genetics and School of Public Health, and University of Michigan, Ann Arbor, Michigan 48109-2218, United States
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109-2218, United States
- Department of Biological Chemistry, University of Michigan, Ann Arbor, Michigan 48109-2218, United States
| |
Collapse
|
47
|
Wang Y, Shi Q, Yang P, Zhang C, Mortuza SM, Xue Z, Ning K, Zhang Y. Fueling ab initio folding with marine metagenomics enables structure and function predictions of new protein families. Genome Biol 2019; 20:229. [PMID: 31676016 PMCID: PMC6825341 DOI: 10.1186/s13059-019-1823-z] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2019] [Accepted: 09/13/2019] [Indexed: 02/01/2023] Open
Abstract
INTRODUCTION The ocean microbiome represents one of the largest microbiomes and produces nearly half of the primary energy on the planet through photosynthesis or chemosynthesis. Using recent advances in marine genomics, we explore new applications of oceanic metagenomes for protein structure and function prediction. RESULTS By processing 1.3 TB of high-quality reads from the Tara Oceans data, we obtain 97 million non-redundant genes. Of the 5721 Pfam families that lack experimental structures, 2801 have at least one member associated with the oceanic metagenomics dataset. We apply C-QUARK, a deep-learning contact-guided ab initio structure prediction pipeline, to model 27 families, where 20 are predicted to have a reliable fold with estimated template modeling score (TM-score) at least 0.5. Detailed analyses reveal that the abundance of microbial genera in the ocean is highly correlated to the frequency of occurrence in the modeled Pfam families, suggesting the significant role of the Tara Oceans genomes in the contact-map prediction and subsequent ab initio folding simulations. Of interesting note, PF15461, which has a majority of members coming from ocean-related bacteria, is identified as an important photosynthetic protein by structure-based function annotations. The pipeline is extended to a set of 417 Pfam families, built on the combination of Tara with other metagenomics datasets, which results in 235 families with an estimated TM-score over 0.5. CONCLUSIONS These results demonstrate a new avenue to improve the capacity of protein structure and function modeling through marine metagenomics, especially for difficult proteins with few homologous sequences.
Collapse
Affiliation(s)
- Yan Wang
- College of Life Science and Technology and College of Software, Huazhong University of Science and Technology, Wuhan, 430074, Hubei, China
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Qiang Shi
- College of Life Science and Technology and College of Software, Huazhong University of Science and Technology, Wuhan, 430074, Hubei, China
| | - Pengshuo Yang
- College of Life Science and Technology and College of Software, Huazhong University of Science and Technology, Wuhan, 430074, Hubei, China
| | - Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48109, USA
| | - S M Mortuza
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48109, USA
| | - Zhidong Xue
- College of Life Science and Technology and College of Software, Huazhong University of Science and Technology, Wuhan, 430074, Hubei, China.
| | - Kang Ning
- College of Life Science and Technology and College of Software, Huazhong University of Science and Technology, Wuhan, 430074, Hubei, China.
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, 48109, USA.
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, 48109, USA.
| |
Collapse
|
48
|
Lloyd Evans D, Hlongwane TT, Joshi SV, Riaño Pachón DM. The sugarcane mitochondrial genome: assembly, phylogenetics and transcriptomics. PeerJ 2019; 7:e7558. [PMID: 31579570 PMCID: PMC6764373 DOI: 10.7717/peerj.7558] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2019] [Accepted: 07/26/2019] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Chloroplast genomes provide insufficient phylogenetic information to distinguish between closely related sugarcane cultivars, due to the recent origin of many cultivars and the conserved sequence of the chloroplast. In comparison, the mitochondrial genome of plants is much larger and more plastic and could contain increased phylogenetic signals. We assembled a consensus reference mitochondrion with Illumina TruSeq synthetic long reads and Oxford Nanopore Technologies MinION long reads. Based on this assembly we also analyzed the mitochondrial transcriptomes of sugarcane and sorghum and improved the annotation of the sugarcane mitochondrion as compared with other species. METHODS Mitochondrial genomes were assembled from genomic read pools using a bait and assemble methodology. The mitogenome was exhaustively annotated using BLAST and transcript datasets were mapped with HISAT2 prior to analysis with the Integrated Genome Viewer. RESULTS The sugarcane mitochondrion is comprised of two independent chromosomes, for which there is no evidence of recombination. Based on the reference assembly from the sugarcane cultivar SP80-3280 the mitogenomes of four additional cultivars (R570, LCP85-384, RB72343 and SP70-1143) were assembled (with the SP70-1143 assembly utilizing both genomic and transcriptomic data). We demonstrate that the sugarcane plastome is completely transcribed and we assembled the chloroplast genome of SP80-3280 using transcriptomic data only. Phylogenomic analysis using mitogenomes allow closely related sugarcane cultivars to be distinguished and supports the discrimination between Saccharum officinarum and Saccharum cultum as modern sugarcane's female parent. From whole chloroplast comparisons, we demonstrate that modern sugarcane arose from a limited number of Saccharum cultum female founders. Transcriptomic and spliceosomal analyses reveal that the two chromosomes of the sugarcane mitochondrion are combined at the transcript level and that splice sites occur more frequently within gene coding regions than without. We reveal one confirmed and one potential cytoplasmic male sterility (CMS) factor in the sugarcane mitochondrion, both of which are transcribed. CONCLUSION Transcript processing in the sugarcane mitochondrion is highly complex with diverse splice events, the majority of which span the two chromosomes. PolyA baited transcripts are consistent with the use of polyadenylation for transcript degradation. For the first time we annotate two CMS factors within the sugarcane mitochondrion and demonstrate that sugarcane possesses all the molecular machinery required for CMS and rescue. A mechanism of cross-chromosomal splicing based on guide RNAs is proposed. We also demonstrate that mitogenomes can be used to perform phylogenomic studies on sugarcane cultivars.
Collapse
Affiliation(s)
- Dyfed Lloyd Evans
- Plant Breeding, South African Sugarcane Research Institute, Durban, KwaZulu-Natal, South Africa
- Cambridge Sequence Services (CSS), Waterbeach, Cambridgeshire, UK
- Department of Computer Sciences, Université Cheikh Anta Diop de Dakar, Dakar, Sénégal
| | | | - Shailesh V. Joshi
- Plant Breeding, South African Sugarcane Research Institute, Durban, KwaZulu-Natal, South Africa
- School of Life Sciences, College of Agriculture Engineering and Science, University of KwaZulu-Natal, Durban, KwaZulu-Natal, South Africa
| | - Diego M. Riaño Pachón
- Computational, Evolutionary and Systems Biology Laboratory, Center for Nuclear Energy in Agriculture, University of São Paulo, Piracicaba, São Paulo, Brazil
| |
Collapse
|
49
|
Wan C, Cozzetto D, Fa R, Jones DT. Using deep maxout neural networks to improve the accuracy of function prediction from protein interaction networks. PLoS One 2019; 14:e0209958. [PMID: 31335894 PMCID: PMC6650051 DOI: 10.1371/journal.pone.0209958] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2018] [Accepted: 07/01/2019] [Indexed: 12/02/2022] Open
Abstract
Protein-protein interaction network data provides valuable information that infers direct links between genes and their biological roles. This information brings a fundamental hypothesis for protein function prediction that interacting proteins tend to have similar functions. With the help of recently-developed network embedding feature generation methods and deep maxout neural networks, it is possible to extract functional representations that encode direct links between protein-protein interactions information and protein function. Our novel method, STRING2GO, successfully adopts deep maxout neural networks to learn functional representations simultaneously encoding both protein-protein interactions and functional predictive information. The experimental results show that STRING2GO outperforms other protein-protein interaction network-based prediction methods and one benchmark method adopted in a recent large scale protein function prediction competition.
Collapse
Affiliation(s)
- Cen Wan
- Bioinformatics Group, Department of Computer Science, University College London, London, United Kingdom
- Biomedical Data Science Laboratory, The Francis Crick Institute, London, United Kingdom
| | - Domenico Cozzetto
- Bioinformatics Group, Department of Computer Science, University College London, London, United Kingdom
- Biomedical Data Science Laboratory, The Francis Crick Institute, London, United Kingdom
| | - Rui Fa
- Bioinformatics Group, Department of Computer Science, University College London, London, United Kingdom
- Biomedical Data Science Laboratory, The Francis Crick Institute, London, United Kingdom
| | - David T. Jones
- Bioinformatics Group, Department of Computer Science, University College London, London, United Kingdom
- Biomedical Data Science Laboratory, The Francis Crick Institute, London, United Kingdom
- * E-mail:
| |
Collapse
|
50
|
Gandhi Muruganandhan S, Manian R. Computational and artificial neural network based study of functional SNPs of human LEPR protein associated with reproductive function. J Cell Biochem 2019; 120:18910-18926. [PMID: 31237021 DOI: 10.1002/jcb.29212] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2019] [Accepted: 05/31/2019] [Indexed: 01/22/2023]
Abstract
Genetic polymorphisms are mostly associated with inherited diseases, detecting and analyzing the biological significance of functional single-nucleotide polymorphisms (SNPs) using wet laboratory experiments is an arduous task hence the computational analysis of putative SNPs is essential before conducting a study on a large population. SNP in the leptin receptor (LEPR) could result in the retention of intracellular signalling due to the structural and functional instability of the receptor causing abnormal reproductive function in human. In this first comprehensive computational analysis of LEPR gene mutation, we have identified and analyzed the functional consequence and structural significance of the SNPs in LEPR using recently developed several computational algorithms. Thirteen deleterious mutations such as W13C, S93G, I232R, Q307H, Y354C, E497A, Q571H, R612H, K656N, T690A, T699M V741M, and L760R were identified in the LEPR gene coding region. Backpropagation algorithm has been developed to forestall the deleterious nature of SNP and to validate the outcome of the tested computational tools. From ConSurf prediction three SNPs (Q571H, R612H, and T699M) were highly conserved on LEPR protein and the most deleterious variant R612H had one hydrogen bond abolished and severely reduced protein stability. Molecular docking suggested that the mutant (R612H) LEPR had lowest binding energy than native LEPR with the ligand molecule. Thus the energetically destructive changeover of ARG to HIS in R612H could possibly affect the LEPR protein structural stability and functional constancy due to interruption in the amino acid interactions and could result in reproductive disorders in human and increases the complication in obstetric and pregnancy outcome.
Collapse
Affiliation(s)
| | - Rameshpathy Manian
- Department of Industrial Biotechnology, Vellore Institute of Technology, Vellore, Tamil Nadu, India
| |
Collapse
|