1
|
Yin J, Waman VP, Sen N, Firdaus-Raih M, Lam SD, Orengo C. Understanding the structural and functional diversity of ATP-PPases using protein domains and functional families in the CATH database. Structure 2025; 33:613-631.e6. [PMID: 39826548 DOI: 10.1016/j.str.2024.12.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 04/18/2024] [Accepted: 12/19/2024] [Indexed: 01/22/2025]
Abstract
ATP-pyrophosphatases (ATP-PPases) are the most primordial lineage of the large and diverse HUP (high-motif proteins, universal stress proteins, ATP-pyrophosphatase) superfamily. There are four different ATP-PPase substrate-specificity groups (SSGs), and members of each group show considerable sequence variation across the domains of life despite sharing the same catalytic function. Owing to the expansion in the number of ATP-PPase domain structures from advances in protein structure prediction by AlphaFold2 (AF2), we have characterized the two most populated ATP-PPase SSGs, the nicotinamide adenine dinucleotide synthases (NADSs) and guanosine monophosphate synthases (GMPSs). Local structural and sequence comparisons of NADS and GMPS identified taxonomic-group-specific functional motifs. As GMPS and NADS are potential drug targets of pathogenic microorganisms including Mycobacterium tuberculosis, bacterial GMPS and NADS specific functional motifs reported in this study, may contribute to antibacterial-drug development.
Collapse
Affiliation(s)
- Jialin Yin
- Department of Structural and Molecular Biology, University College London, London, UK
| | - Vaishali P Waman
- Department of Structural and Molecular Biology, University College London, London, UK
| | - Neeladri Sen
- Department of Structural and Molecular Biology, University College London, London, UK
| | - Mohd Firdaus-Raih
- Department of Applied Physics, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, 43600 UKM Bangi, Malaysia
| | - Su Datt Lam
- Department of Structural and Molecular Biology, University College London, London, UK; Department of Applied Physics, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, 43600 UKM Bangi, Malaysia
| | - Christine Orengo
- Department of Structural and Molecular Biology, University College London, London, UK.
| |
Collapse
|
2
|
Li J, Chen X, Huang H, Zeng M, Yu J, Gong X, Ye Q. $\mathcal{S}$ able: bridging the gap in protein structure understanding with an empowering and versatile pre-training paradigm. Brief Bioinform 2025; 26:bbaf120. [PMID: 40163822 PMCID: PMC11957296 DOI: 10.1093/bib/bbaf120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2024] [Revised: 01/23/2025] [Accepted: 02/23/2025] [Indexed: 04/02/2025] Open
Abstract
Protein pre-training has emerged as a transformative approach for solving diverse biological tasks. While many contemporary methods focus on sequence-based language models, recent findings highlight that protein sequences alone are insufficient to capture the extensive information inherent in protein structures. Recognizing the crucial role of protein structure in defining function and interactions, we introduce $\mathcal{S}$able, a versatile pre-training model designed to comprehensively understand protein structures. $\mathcal{S}$able incorporates a novel structural encoding mechanism that enhances inter-atomic information exchange and spatial awareness, combined with robust pre-training strategies and lightweight decoders optimized for specific downstream tasks. This approach enables $\mathcal{S}$able to consistently outperform existing methods in tasks such as generation, classification, and regression, demonstrating its superior capability in protein structure representation. The code and models can be accessed via GitHub repository at https://github.com/baaihealth/Sable.
Collapse
Affiliation(s)
- Jiashan Li
- Institute for Mathematical Sciences, Renmin University of China, 59 Zhongguancun Street, Beijing 100872, China
| | - Xi Chen
- Bio Computing Center, Beijing Academy of Artificial Intelligence, 150 Chengfu Road, Beijing 100084, China
| | - He Huang
- Bio Computing Center, Beijing Academy of Artificial Intelligence, 150 Chengfu Road, Beijing 100084, China
| | - Mingliang Zeng
- Bio Computing Center, Beijing Academy of Artificial Intelligence, 150 Chengfu Road, Beijing 100084, China
| | - Jingcheng Yu
- Bio Computing Center, Beijing Academy of Artificial Intelligence, 150 Chengfu Road, Beijing 100084, China
| | - Xinqi Gong
- Institute for Mathematical Sciences, Renmin University of China, 59 Zhongguancun Street, Beijing 100872, China
| | - Qiwei Ye
- Bio Computing Center, Beijing Academy of Artificial Intelligence, 150 Chengfu Road, Beijing 100084, China
| |
Collapse
|
3
|
Asghar R, Wu N, Ali N, Wang Y, Akkaya M. Computational studies reveal structural characterization and novel families of Puccinia striiformis f. sp. tritici effectors. PLoS Comput Biol 2025; 21:e1012503. [PMID: 40153705 PMCID: PMC11952758 DOI: 10.1371/journal.pcbi.1012503] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2024] [Accepted: 02/24/2025] [Indexed: 03/30/2025] Open
Abstract
Understanding the biological functions of Puccinia striiformis f. sp. tritici (Pst) effectors is fundamental for uncovering the mechanisms of pathogenicity and variability, thereby paving the way for developing durable and effective control strategies for stripe rust. However, due to the lack of an efficient genetic transformation system in Pst, progress in effector function studies has been slow. Here, we modeled the structures of 15,201 effectors from twelve Pst races or isolates, a Puccinia striiformis isolate, and one Puccinia striiformis f. sp. hordei isolate using AlphaFold2. Of these, 8,102 folds were successfully predicted, and we performed sequence- and structure-based annotations of these effectors. These effectors were classified into 410 structure clusters and 1,005 sequence clusters. Sequence lengths varied widely, with a concentration between 101-250 amino acids, and motif analysis revealed that 47% and 5.81% of the predicted effectors contain known effector motifs [Y/F/W]xC and RxLR, respectively highlighting the structural conservation across a substantial portion of the effectors. Subcellular localization predictions indicated a predominant cytoplasmic localization, with notable chloroplast and nuclear presence. Structure-guided analysis significantly enhances effector prediction efficiency as demonstrated by the 75% among 8,102 have structural annotation. The clustering and annotation prediction both based on the sequence and structure homologies allowed us to determine the adopted folding or fold families of the effectors. A common feature observed was the formation of structural homologies from different sequences. In our study, one of the comparative structural analyses revealed a new structure family with a core structure of four helices, including Pst27791, PstGSRE4, and PstSIE1, which target key wheat immune pathway proteins, impacting the host immune functions. Further comparative structural analysis showed similarities between Pst effectors and effectors from other pathogens, such as AvrSr35, AvrSr50, Zt-KP4-1, and MoHrip2, highlighting a possibility of convergent evolutionary strategies, yet to be supported by further data encompassing on some evolutionarily distant species. Currently, our initial analysis is the most one on Pst effectors' sequence, structural and annotation relationships providing a novel foundation to advance our future understanding of Pst pathogenicity and evolution.
Collapse
Affiliation(s)
- Raheel Asghar
- School of Bioengineering, Dalian University of Technology, Dalian, Liaoning, China
| | - Nan Wu
- School of Bioengineering, Dalian University of Technology, Dalian, Liaoning, China
| | - Noman Ali
- School of Bioengineering, Dalian University of Technology, Dalian, Liaoning, China
| | - Yulei Wang
- School of Bioengineering, Dalian University of Technology, Dalian, Liaoning, China
| | - Mahinur Akkaya
- School of Bioengineering, Dalian University of Technology, Dalian, Liaoning, China
| |
Collapse
|
4
|
Lee Y, Gao P, Xu Y, Wang Z, Li S, Chen J. MEGA-GO: functions prediction of diverse protein sequence length using Multi-scalE Graph Adaptive neural network. Bioinformatics 2025; 41:btaf032. [PMID: 39847542 PMCID: PMC11810639 DOI: 10.1093/bioinformatics/btaf032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2024] [Revised: 01/13/2025] [Accepted: 01/21/2025] [Indexed: 01/25/2025] Open
Abstract
MOTIVATION The increasing accessibility of large-scale protein sequences through advanced sequencing technologies has necessitated the development of efficient and accurate methods for predicting protein function. Computational prediction models have emerged as a promising solution to expedite the annotation process. However, despite making significant progress in protein research, graph neural networks face challenges in capturing long-range structural correlations and identifying critical residues in protein graphs. Furthermore, existing models have limitations in effectively predicting the function of newly sequenced proteins that are not included in protein interaction networks. This highlights the need for novel approaches integrating protein structure and sequence data. RESULTS We introduce Multi-scalE Graph Adaptive neural network (MEGA-GO), highlighting the capability of capturing diverse protein sequence length features from multiple scales. The unique graph adaptive neural network architecture of MEGA-GO enables a more nuanced extraction of graph structure features, effectively capturing intricate relationships within biological data. Experimental results demonstrate that MEGA-GO outperforms mainstream protein function prediction models in the accuracy of Gene Ontology term classification, yielding 33.4%, 68.9%, and 44.6% of area under the precision-recall curve on biological process, molecular function, and cellular component domains, respectively. The rest of the experimental results reveal that our model consistently surpasses the state-of-the-art methods. AVAILABILITY AND IMPLEMENTATION The source code and data of MEGA-GO are available at https://github.com/Cheliosoops/MEGA-GO.
Collapse
Affiliation(s)
- Yujian Lee
- Guangdong Provincial Key Laboratory IRADS, Beijing Normal University-Hong Kong Baptist University United International College, Zhuhai 519087, China
- Department of Computer Science, Beijing Normal University-Hong Kong Baptist University United International College, Zhuhai 519087, China
| | - Peng Gao
- Department of Computer Science, Beijing Normal University-Hong Kong Baptist University United International College, Zhuhai 519087, China
| | - Yongqi Xu
- Department of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510520, China
| | - Ziyang Wang
- Department of Science of Chinese Materia Medica, Guangdong Medical University, Dongguan 524023, China
| | - Shuaicheng Li
- Department of Computer Science, City University of Hong Kong, Hong Kong, China
| | - Jiaxing Chen
- Guangdong Provincial Key Laboratory IRADS, Beijing Normal University-Hong Kong Baptist University United International College, Zhuhai 519087, China
| |
Collapse
|
5
|
Blum M, Andreeva A, Florentino L, Chuguransky S, Grego T, Hobbs E, Pinto B, Orr A, Paysan-Lafosse T, Ponamareva I, Salazar G, Bordin N, Bork P, Bridge A, Colwell L, Gough J, Haft D, Letunic I, Llinares-López F, Marchler-Bauer A, Meng-Papaxanthos L, Mi H, Natale D, Orengo C, Pandurangan A, Piovesan D, Rivoire C, Sigrist CA, Thanki N, Thibaud-Nissen F, Thomas P, Tosatto SE, Wu C, Bateman A. InterPro: the protein sequence classification resource in 2025. Nucleic Acids Res 2025; 53:D444-D456. [PMID: 39565202 PMCID: PMC11701551 DOI: 10.1093/nar/gkae1082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2024] [Revised: 10/11/2024] [Accepted: 10/23/2024] [Indexed: 11/21/2024] Open
Abstract
InterPro (https://www.ebi.ac.uk/interpro) is a freely accessible resource for the classification of protein sequences into families. It integrates predictive models, known as signatures, from multiple member databases to classify sequences into families and predict the presence of domains and significant sites. The InterPro database provides annotations for over 200 million sequences, ensuring extensive coverage of UniProtKB, the standard repository of protein sequences, and includes mappings to several other major resources, such as Gene Ontology (GO), Protein Data Bank in Europe (PDBe) and the AlphaFold Protein Structure Database. In this publication, we report on the status of InterPro (version 101.0), detailing new developments in the database, associated web interface and software. Notable updates include the increased integration of structures predicted by AlphaFold and the enhanced description of protein families using artificial intelligence. Over the past two years, more than 5000 new InterPro entries have been created. The InterPro website now offers access to 85 000 protein families and domains from its member databases and serves as a long-term archive for retired databases. InterPro data, software and tools are freely available.
Collapse
Affiliation(s)
- Matthias Blum
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Antonina Andreeva
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Laise Cavalcanti Florentino
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Sara Rocio Chuguransky
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Tiago Grego
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Emma Hobbs
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Beatriz Lazaro Pinto
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Ailsa Orr
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Typhaine Paysan-Lafosse
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Irina Ponamareva
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Gustavo A Salazar
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Nicola Bordin
- Department of Structural and Molecular Biology, University College London, Gower St, Bloomsbury, London WC1E 6BT, UK
| | - Peer Bork
- European Molecular Biology Laboratory, Structural and Computational Biology Unit, Meyerhofstraße 1, 69117 Heidelberg, Germany
| | - Alan Bridge
- Swiss-Prot Group, Swiss Institute of Bioinformatics, CMU, 1 rue Michel Servet, CH-1211, Geneva, Switzerland
| | | | - Julian Gough
- Medical Research Council Laboratory of Molecular Biology, Cambridge Biomedical Campus, Francis Crick Ave, Trumpington, Cambridge CB2 0QH, UK
| | - Daniel H Haft
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Ivica Letunic
- Biobyte Solutions GmbH, Bothestr 142, 69126 Heidelberg, Germany
| | | | - Aron Marchler-Bauer
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | | | - Huaiyu Mi
- Division of Bioinformatics, Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA 90033, USA
| | - Darren A Natale
- Protein Information Resource, Georgetown University Medical Center, WA, DC 20007, USA
| | - Christine A Orengo
- Department of Structural and Molecular Biology, University College London, Gower St, Bloomsbury, London WC1E 6BT, UK
| | - Arun P Pandurangan
- Medical Research Council Laboratory of Molecular Biology, Cambridge Biomedical Campus, Francis Crick Ave, Trumpington, Cambridge CB2 0QH, UK
| | - Damiano Piovesan
- Department of Biomedical Sciences, University of Padova, Padova 35121, Italy
| | - Catherine Rivoire
- Swiss-Prot Group, Swiss Institute of Bioinformatics, CMU, 1 rue Michel Servet, CH-1211, Geneva, Switzerland
| | - Christian J A Sigrist
- Swiss-Prot Group, Swiss Institute of Bioinformatics, CMU, 1 rue Michel Servet, CH-1211, Geneva, Switzerland
| | - Narmada Thanki
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Françoise Thibaud-Nissen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Paul D Thomas
- Division of Bioinformatics, Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA 90033, USA
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padova, Padova 35121, Italy
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR-IBIOM), Bari 70126, Italy
| | - Cathy H Wu
- Protein Information Resource, Georgetown University Medical Center, WA, DC 20007, USA
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| |
Collapse
|
6
|
Waman V, Bordin N, Lau A, Kandathil S, Wells J, Miller D, Velankar S, Jones D, Sillitoe I, Orengo C. CATH v4.4: major expansion of CATH by experimental and predicted structural data. Nucleic Acids Res 2025; 53:D348-D355. [PMID: 39565206 PMCID: PMC11701635 DOI: 10.1093/nar/gkae1087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2024] [Revised: 10/18/2024] [Accepted: 10/24/2024] [Indexed: 11/21/2024] Open
Abstract
CATH (https://www.cathdb.info) is a structural classification database that assigns domains to the structures in the Protein Data Bank (PDB) and AlphaFold Protein Structure Database (AFDB) and adds layers of biological information, including homology and functional annotation. This article covers developments in the CATH classification since 2021. We report the significant expansion of structural information (180-fold) for CATH superfamilies through classification of PDB domains and predicted domain structures from the Encyclopedia of Domains (TED) resource. TED provides information on predicted domains in AFDB. CATH v4.4 represents an expansion of ∼64 844 experimentally determined domain structures from PDB. We also present a mapping of ∼90 million predicted domains from TED to CATH superfamilies. New PDB and TED data increases the number of superfamilies from 5841 to 6573, folds from 1349 to 2078 and architectures from 41 to 77. TED data comprises predicted structures, so these new folds and architectures remain hypothetical until experimentally confirmed. CATH also classifies domains into functional families (FunFams) within a superfamily. We have updated sequences in FunFams by scanning FunFam-HMMs against UniProt release 2024_02, giving a 276% increase in FunFams coverage. The mapping of TED structural domains has resulted in a 4-fold increase in FunFams with structural information.
Collapse
Affiliation(s)
- Vaishali P Waman
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Andy Lau
- Department of Computer Science, University College London, London WC1E 6BT, UK
- InstaDeep Ltd, 5 Merchant Square, London W2 1AY, UK
| | - Shaun Kandathil
- Department of Computer Science, University College London, London WC1E 6BT, UK
| | - Jude Wells
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
- Centre for Artificial Intelligence, University College London, London WC1V 6BH, UK
| | - David Miller
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
- Centre for Artificial Intelligence, University College London, London WC1V 6BH, UK
| | - Sameer Velankar
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK
| | - David T Jones
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
- Department of Computer Science, University College London, London WC1E 6BT, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| |
Collapse
|
7
|
de Crécy-Lagard V, Dias R, Friedberg I, Yuan Y, Swairjo MA. Limitations of Current Machine-Learning Models in Predicting Enzymatic Functions for Uncharacterized Proteins. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.01.601547. [PMID: 39005379 PMCID: PMC11244979 DOI: 10.1101/2024.07.01.601547] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/16/2024]
Abstract
Thirty to seventy percent of proteins in any given genome have no assigned function and have been labeled as the protein "unknome". This large knowledge gap prevents the biological community from fully leveraging the plethora of genomic data that is now available. Machine-learning approaches are showing some promise in propagating functional knowledge from experimentally characterized proteins to the correct set of isofunctional orthologs. However, they largely fail to predict enzymatic functions unseen in the training set, as shown by dissecting the predictions made for over 450 enzymes of unknown function from the model bacteria Escherichia coli uxgsing the DeepECTransformer platform. Lessons from these failures can help the community develop machine-learning methods that assist domain experts in making testable functional predictions for more members of the uncharacterized proteome. Article Summary Many proteins in any genome, ranging from 30 to 70%, lack an assigned function. This knowledge gap limits the full use of the vast available genomic data. Machine learning has shown promise in transferring functional knowledge from proteins of known functions to similar ones, but largely fails to predict novel functions not seen in its training data. Understanding these failures can guide the development of better machine-learning methods to help experts make accurate functional predictions for uncharacterized proteins.
Collapse
|
8
|
Meng L, Wang X. TAWFN: a deep learning framework for protein function prediction. Bioinformatics 2024; 40:btae571. [PMID: 39312678 PMCID: PMC11639667 DOI: 10.1093/bioinformatics/btae571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Revised: 08/27/2024] [Accepted: 09/19/2024] [Indexed: 09/25/2024] Open
Abstract
MOTIVATION Proteins play pivotal roles in biological systems, and precise prediction of their functions is indispensable for practical applications. Despite the surge in protein sequence data facilitated by high-throughput techniques, unraveling the exact functionalities of proteins still demands considerable time and resources. Currently, numerous methods rely on protein sequences for prediction, while methods targeting protein structures are scarce, often employing convolutional neural networks (CNN) or graph convolutional networks (GCNs) individually. RESULTS To address these challenges, our approach starts from protein structures and proposes a method that combines CNN and GCN into a unified framework called the two-model adaptive weight fusion network (TAWFN) for protein function prediction. First, amino acid contact maps and sequences are extracted from the protein structure. Then, the sequence is used to generate one-hot encoded features and deep semantic features. These features, along with the constructed graph, are fed into the adaptive graph convolutional networks (AGCN) module and the multi-layer convolutional neural network (MCNN) module as needed, resulting in preliminary classification outcomes. Finally, the preliminary classification results are inputted into the adaptive weight computation network, where adaptive weights are calculated to fuse the initial predictions from both networks, yielding the final prediction result. To evaluate the effectiveness of our method, experiments were conducted on the PDBset and AFset datasets. For molecular function, biological process, and cellular component tasks, TAWFN achieved area under the precision-recall curve (AUPR) values of 0.718, 0.385, and 0.488 respectively, with corresponding Fmax scores of 0.762, 0.628, and 0.693, and Smin scores of 0.326, 0.483, and 0.454. The experimental results demonstrate that TAWFN exhibits promising performance, outperforming existing methods. AVAILABILITY AND IMPLEMENTATION The TAWFN source code can be found at: https://github.com/ss0830/TAWFN.
Collapse
Affiliation(s)
- Lu Meng
- College of Information Science and Engineering, Northeastern University, Shenyang, Liaoning, 110000, China
| | - Xiaoran Wang
- College of Information Science and Engineering, Northeastern University, Shenyang, Liaoning, 110000, China
| |
Collapse
|
9
|
Bordin N, Scholes H, Rauer C, Roca-Martínez J, Sillitoe I, Orengo C. Clustering protein functional families at large scale with hierarchical approaches. Protein Sci 2024; 33:e5140. [PMID: 39145441 PMCID: PMC11325189 DOI: 10.1002/pro.5140] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2024] [Revised: 07/22/2024] [Accepted: 07/24/2024] [Indexed: 08/16/2024]
Abstract
Proteins, fundamental to cellular activities, reveal their function and evolution through their structure and sequence. CATH functional families (FunFams) are coherent clusters of protein domain sequences in which the function is conserved across their members. The increasing volume and complexity of protein data enabled by large-scale repositories like MGnify or AlphaFold Database requires more powerful approaches that can scale to the size of these new resources. In this work, we introduce MARC and FRAN, two algorithms developed to build upon and address limitations of GeMMA/FunFHMMER, our original methods developed to classify proteins with related functions using a hierarchical approach. We also present CATH-eMMA, which uses embeddings or Foldseek distances to form relationship trees from distance matrices, reducing computational demands and handling various data types effectively. CATH-eMMA offers a highly robust and much faster tool for clustering protein functions on a large scale, providing a new tool for future studies in protein function and evolution.
Collapse
Affiliation(s)
- Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Harry Scholes
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Clemens Rauer
- Institute of Structural and Molecular Biology, University College London, London, UK
- Universidad Autonoma de Madrid, Ciudad Universitaria de Cantoblanco, Madrid, Spain
| | - Joel Roca-Martínez
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London, UK
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, London, UK
| |
Collapse
|
10
|
Waman VP, Bordin N, Alcraft R, Vickerstaff R, Rauer C, Chan Q, Sillitoe I, Yamamori H, Orengo C. CATH 2024: CATH-AlphaFlow Doubles the Number of Structures in CATH and Reveals Nearly 200 New Folds. J Mol Biol 2024; 436:168551. [PMID: 38548261 DOI: 10.1016/j.jmb.2024.168551] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 03/20/2024] [Accepted: 03/22/2024] [Indexed: 04/07/2024]
Abstract
CATH (https://www.cathdb.info) classifies domain structures from experimental protein structures in the PDB and predicted structures in the AlphaFold Database (AFDB). To cope with the scale of the predicted data a new NextFlow workflow (CATH-AlphaFlow), has been developed to classify high-quality domains into CATH superfamilies and identify novel fold groups and superfamilies. CATH-AlphaFlow uses a novel state-of-the-art structure-based domain boundary prediction method (ChainSaw) for identifying domains in multi-domain proteins. We applied CATH-AlphaFlow to process PDB structures not classified in CATH and AFDB structures from 21 model organisms, expanding CATH by over 100%. Domains not classified in existing CATH superfamilies or fold groups were used to seed novel folds, giving 253 new folds from PDB structures (September 2023 release) and 96 from AFDB structures of proteomes of 21 model organisms. Where possible, functional annotations were obtained using (i) predictions from publicly available methods (ii) annotations from structural relatives in AFDB/UniProt50. We also predicted functional sites and highly conserved residues. Some folds are associated with important functions such as photosynthetic acclimation (in flowering plants), iron permease activity (in fungi) and post-natal spermatogenesis (in mice). CATH-AlphaFlow will allow us to identify many more CATH relatives in the AFDB, further characterising the protein structure landscape.
Collapse
Affiliation(s)
- Vaishali P Waman
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Rachel Alcraft
- Advanced Research Computing Centre, University College London, London, United Kingdom
| | - Robert Vickerstaff
- Advanced Research Computing Centre, University College London, London, United Kingdom
| | - Clemens Rauer
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Qian Chan
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Hazuki Yamamori
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom.
| |
Collapse
|
11
|
Jang YJ, Qin QQ, Huang SY, Peter ATJ, Ding XM, Kornmann B. Accurate prediction of protein function using statistics-informed graph networks. Nat Commun 2024; 15:6601. [PMID: 39097570 PMCID: PMC11297950 DOI: 10.1038/s41467-024-50955-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Accepted: 07/15/2024] [Indexed: 08/05/2024] Open
Abstract
Understanding protein function is pivotal in comprehending the intricate mechanisms that underlie many crucial biological activities, with far-reaching implications in the fields of medicine, biotechnology, and drug development. However, more than 200 million proteins remain uncharacterized, and computational efforts heavily rely on protein structural information to predict annotations of varying quality. Here, we present a method that utilizes statistics-informed graph networks to predict protein functions solely from its sequence. Our method inherently characterizes evolutionary signatures, allowing for a quantitative assessment of the significance of residues that carry out specific functions. PhiGnet not only demonstrates superior performance compared to alternative approaches but also narrows the sequence-function gap, even in the absence of structural information. Our findings indicate that applying deep learning to evolutionary data can highlight functional sites at the residue level, providing valuable support for interpreting both existing properties and new functionalities of proteins in research and biomedicine.
Collapse
Affiliation(s)
- Yaan J Jang
- Department of Biochemistry, University of Oxford, Oxford, UK.
- AmoAi Technologies, Oxford, UK.
| | - Qi-Qi Qin
- AmoAi Technologies, Oxford, UK
- School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China
| | - Si-Yu Huang
- AmoAi Technologies, Oxford, UK
- Oxford Martin School, University of Oxford, Oxford, UK
- School of Systems Science, Beijing Normal University, Beijing, China
| | | | - Xue-Ming Ding
- School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China
| | - Benoît Kornmann
- Department of Biochemistry, University of Oxford, Oxford, UK.
| |
Collapse
|
12
|
Waman VP, Ashford P, Lam SD, Sen N, Abbasian M, Woodridge L, Goldtzvik Y, Bordin N, Wu J, Sillitoe I, Orengo CA. Predicting human and viral protein variants affecting COVID-19 susceptibility and repurposing therapeutics. Sci Rep 2024; 14:14208. [PMID: 38902252 PMCID: PMC11190248 DOI: 10.1038/s41598-024-61541-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Accepted: 05/07/2024] [Indexed: 06/22/2024] Open
Abstract
The COVID-19 disease is an ongoing global health concern. Although vaccination provides some protection, people are still susceptible to re-infection. Ostensibly, certain populations or clinical groups may be more vulnerable. Factors causing these differences are unclear and whilst socioeconomic and cultural differences are likely to be important, human genetic factors could influence susceptibility. Experimental studies indicate SARS-CoV-2 uses innate immune suppression as a strategy to speed-up entry and replication into the host cell. Therefore, it is necessary to understand the impact of variants in immunity-associated human proteins on susceptibility to COVID-19. In this work, we analysed missense coding variants in several SARS-CoV-2 proteins and their human protein interactors that could enhance binding affinity to SARS-CoV-2. We curated a dataset of 19 SARS-CoV-2: human protein 3D-complexes, from the experimentally determined structures in the Protein Data Bank and models built using AlphaFold2-multimer, and analysed the impact of missense variants occurring in the protein-protein interface region. We analysed 468 missense variants from human proteins and 212 variants from SARS-CoV-2 proteins and computationally predicted their impacts on binding affinities for the human viral protein complexes. We predicted a total of 26 affinity-enhancing variants from 13 human proteins implicated in increased binding affinity to SARS-CoV-2. These include key-immunity associated genes (TOMM70, ISG15, IFIH1, IFIT2, RPS3, PALS1, NUP98, AXL, ARF6, TRIMM, TRIM25) as well as important spike receptors (KREMEN1, AXL and ACE2). We report both common (e.g., Y13N in IFIH1) and rare variants in these proteins and discuss their likely structural and functional impact, using information on known and predicted functional sites. Potential mechanisms associated with immune suppression implicated by these variants are discussed. Occurrence of certain predicted affinity-enhancing variants should be monitored as they could lead to increased susceptibility and reduced immune response to SARS-CoV-2 infection in individuals/populations carrying them. Our analyses aid in understanding the potential impact of genetic variation in immunity-associated proteins on COVID-19 susceptibility and help guide drug-repurposing strategies.
Collapse
Affiliation(s)
- Vaishali P Waman
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Paul Ashford
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Su Datt Lam
- Department of Applied Physics, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Malaysia
| | - Neeladri Sen
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Mahnaz Abbasian
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Laurel Woodridge
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Yonathan Goldtzvik
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Jiaxin Wu
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Christine A Orengo
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK.
| |
Collapse
|
13
|
Ulusoy E, Doğan T. Mutual annotation-based prediction of protein domain functions with Domain2GO. Protein Sci 2024; 33:e4988. [PMID: 38757367 PMCID: PMC11099699 DOI: 10.1002/pro.4988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 02/25/2024] [Accepted: 03/30/2024] [Indexed: 05/18/2024]
Abstract
Identifying unknown functional properties of proteins is essential for understanding their roles in both health and disease states. The domain composition of a protein can reveal critical information in this context, as domains are structural and functional units that dictate how the protein should act at the molecular level. The expensive and time-consuming nature of wet-lab experimental approaches prompted researchers to develop computational strategies for predicting the functions of proteins. In this study, we proposed a new method called Domain2GO that infers associations between protein domains and function-defining gene ontology (GO) terms, thus redefining the problem as domain function prediction. Domain2GO uses documented protein-level GO annotations together with proteins' domain annotations. Co-annotation patterns of domains and GO terms in the same proteins are examined using statistical resampling to obtain reliable associations. As a use-case study, we evaluated the biological relevance of examples selected from the Domain2GO-generated domain-GO term mappings via literature review. Then, we applied Domain2GO to predict unknown protein functions by propagating domain-associated GO terms to proteins annotated with these domains. For function prediction performance evaluation and comparison against other methods, we employed Critical Assessment of Function Annotation 3 (CAFA3) challenge datasets. The results demonstrated the high potential of Domain2GO, particularly for predicting molecular function and biological process terms, along with advantages such as producing interpretable results and having an exceptionally low computational cost. The approach presented here can be extended to other ontologies and biological entities to investigate unknown relationships in complex and large-scale biological data. The source code, datasets, results, and user instructions for Domain2GO are available at https://github.com/HUBioDataLab/Domain2GO. Additionally, we offer a user-friendly online tool at https://huggingface.co/spaces/HUBioDataLab/Domain2GO, which simplifies the prediction of functions of previously unannotated proteins solely using amino acid sequences.
Collapse
Affiliation(s)
- Erva Ulusoy
- Biological Data Science Lab, Department of Computer EngineeringHacettepe UniversityAnkaraTurkey
- Department of BioinformaticsGraduate School of Health Sciences, Hacettepe UniversityAnkaraTurkey
| | - Tunca Doğan
- Biological Data Science Lab, Department of Computer EngineeringHacettepe UniversityAnkaraTurkey
- Department of BioinformaticsGraduate School of Health Sciences, Hacettepe UniversityAnkaraTurkey
| |
Collapse
|
14
|
Bonello J, Orengo C. FunPredCATH: An ensemble method for predicting protein function using CATH. BIOCHIMICA ET BIOPHYSICA ACTA. PROTEINS AND PROTEOMICS 2024; 1872:140985. [PMID: 38122964 DOI: 10.1016/j.bbapap.2023.140985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 12/05/2023] [Accepted: 12/06/2023] [Indexed: 12/23/2023]
Abstract
MOTIVATION The growth of unannotated proteins in UniProt increases at a very high rate every year due to more efficient sequencing methods. However, the experimental annotation of proteins is a lengthy and expensive process. Using computational techniques to narrow the search can speed up the process by providing highly specific Gene Ontology (GO) terms. METHODOLOGY We propose an ensemble approach that combines three generic base predictors that predict Gene Ontology (BP, CC and MF) terms from sequences across different species. We train our models on UniProtGOA annotation data and use the CATH domain resources to identify the protein families. We then calculate a score based on the prevalence of individual GO terms in the functional families that is then used as an indicator of confidence when assigning the GO term to an uncharacterised protein. METHODS In the ensemble, we use a statistics-based method that scores the occurrence of GO terms in a CATH FunFam against a background set of proteins annotated by the same GO term. We also developed a set-based method that uses Set Intersection and Set Union to score the occurrence of GO terms within the same CATH FunFam. Finally, we also use FunFams-Plus, a predictor method developed by the Orengo Group at UCL to predict GO terms for uncharacterised proteins in the CAFA3 challenge. EVALUATION We evaluated the methods against the CAFA3 benchmark and DomFun. We used the Precision, Recall and Fmax metrics and the benchmark datasets that are used in CAFA3 to evaluate our models and compare them to the CAFA3 results. Our results show that FunPredCATH compares well with top CAFA methods in the different ontologies and benchmarks. CONTRIBUTIONS FunPredCATH compares well with other prediction methods on CAFA3, and the ensemble approach outperforms the base methods. We show that non-IEA models obtain higher Fmax scores than the IEA counterparts, while the models including IEA annotations have higher coverage at the expense of a lower Fmax score.
Collapse
Affiliation(s)
- Joseph Bonello
- Department of Structural and Molecular Biology, University College London, Gower Street, London WC1E 6BT, United Kingdom; Department of Computer Information Systems, University of Malta, Faculty of ICT, Msida, MSD 2080, Malta.
| | - Christine Orengo
- Department of Structural and Molecular Biology, University College London, Gower Street, London WC1E 6BT, United Kingdom
| |
Collapse
|
15
|
Fu Y, Gu Z, Luo X, Guo Q, Lai L, Deng M. Learning a generalized graph transformer for protein function prediction in dissimilar sequences. Gigascience 2024; 13:giae093. [PMID: 39657158 PMCID: PMC11734293 DOI: 10.1093/gigascience/giae093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Revised: 07/04/2024] [Accepted: 10/25/2024] [Indexed: 12/17/2024] Open
Abstract
BACKGROUND In the face of a growing disparity between high-throughput sequence data and low-throughput experimental studies, the emerging field of deep learning stands as a promising alternative. Generally, many data-driven approaches are capable of facilitating fast and accurate predictions of protein functions. Nevertheless, the inherent statistical nature of deep learning techniques may limit their generalization capabilities when applied to novel nonhomologous proteins that diverge significantly from existing ones. RESULTS In this work, we herein propose a novel, generalized approach named Graph Adversarial Learning with Alignment (GALA) for protein function prediction. Our GALA method integrates a graph transformer architecture with an attention pooling module to extract embeddings from both protein sequences and structures, facilitating unified learning of protein representations. Particularly noteworthy, GALA incorporates a domain discriminator conditioned on both learnable representations and predicted probabilities, which undergoes adversarial learning to ensure representation invariance across diverse environments. To optimize the model with abundant label information, we generate label embeddings in the hidden space, explicitly aligning them with protein representations. Benchmarked on datasets derived from the PDB database and Swiss-Prot database, our GALA achieves considerable performance comparable to several state-of-the-art methods. Even more, GALA demonstrates wonderful biological interpretability by identifying significant functional residues associated with Gene Ontology terms through class activation mapping. CONCLUSIONS GALA, which leverages adversarial learning and label embedding alignment to acquire domain-invariant protein representations, exhibits outstanding generalizability in function prediction for proteins from previously unseen sequence space. By incorporating the structures predicted by AlphaFold2, GALA demonstrates significant potential for function annotation in newly discovered sequences. A detailed implementation of our GALA is available at https://github.com/fuyw-aisw/GALA.
Collapse
Affiliation(s)
- Yiwei Fu
- School of Mathematical Sciences, Peking University, Beijing 100871, China
| | - Zhonghui Gu
- Peking-Tsinghua Center for Life Sciences, Peking University, Beijing 100871, China
| | - Xiao Luo
- Department of Computer Science, University of California, Los Angeles, CA 90024, USA
| | - Qirui Guo
- Center for Quantitative Biology, Peking University, Beijing 100871, China
| | - Luhua Lai
- Peking-Tsinghua Center for Life Sciences, Peking University, Beijing 100871, China
- Center for Quantitative Biology, Peking University, Beijing 100871, China
| | - Minghua Deng
- School of Mathematical Sciences, Peking University, Beijing 100871, China
- Center for Quantitative Biology, Peking University, Beijing 100871, China
- Center for Statistical Science, Peking University, Beijing 100871, China
| |
Collapse
|
16
|
Wang J, Chen C, Yao G, Ding J, Wang L, Jiang H. Intelligent Protein Design and Molecular Characterization Techniques: A Comprehensive Review. Molecules 2023; 28:7865. [PMID: 38067593 PMCID: PMC10707872 DOI: 10.3390/molecules28237865] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Revised: 11/13/2023] [Accepted: 11/23/2023] [Indexed: 12/18/2023] Open
Abstract
In recent years, the widespread application of artificial intelligence algorithms in protein structure, function prediction, and de novo protein design has significantly accelerated the process of intelligent protein design and led to many noteworthy achievements. This advancement in protein intelligent design holds great potential to accelerate the development of new drugs, enhance the efficiency of biocatalysts, and even create entirely new biomaterials. Protein characterization is the key to the performance of intelligent protein design. However, there is no consensus on the most suitable characterization method for intelligent protein design tasks. This review describes the methods, characteristics, and representative applications of traditional descriptors, sequence-based and structure-based protein characterization. It discusses their advantages, disadvantages, and scope of application. It is hoped that this could help researchers to better understand the limitations and application scenarios of these methods, and provide valuable references for choosing appropriate protein characterization techniques for related research in the field, so as to better carry out protein research.
Collapse
Affiliation(s)
| | | | | | - Junjie Ding
- State Key Laboratory of NBC Protection for Civilian, Beijing 102205, China; (J.W.); (C.C.); (G.Y.)
| | - Liangliang Wang
- State Key Laboratory of NBC Protection for Civilian, Beijing 102205, China; (J.W.); (C.C.); (G.Y.)
| | - Hui Jiang
- State Key Laboratory of NBC Protection for Civilian, Beijing 102205, China; (J.W.); (C.C.); (G.Y.)
| |
Collapse
|
17
|
Rodríguez-López M, Bordin N, Lees J, Scholes H, Hassan S, Saintain Q, Kamrad S, Orengo C, Bähler J. Broad functional profiling of fission yeast proteins using phenomics and machine learning. eLife 2023; 12:RP88229. [PMID: 37787768 PMCID: PMC10547477 DOI: 10.7554/elife.88229] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/04/2023] Open
Abstract
Many proteins remain poorly characterized even in well-studied organisms, presenting a bottleneck for research. We applied phenomics and machine-learning approaches with Schizosaccharomyces pombe for broad cues on protein functions. We assayed colony-growth phenotypes to measure the fitness of deletion mutants for 3509 non-essential genes in 131 conditions with different nutrients, drugs, and stresses. These analyses exposed phenotypes for 3492 mutants, including 124 mutants of 'priority unstudied' proteins conserved in humans, providing varied functional clues. For example, over 900 proteins were newly implicated in the resistance to oxidative stress. Phenotype-correlation networks suggested roles for poorly characterized proteins through 'guilt by association' with known proteins. For complementary functional insights, we predicted Gene Ontology (GO) terms using machine learning methods exploiting protein-network and protein-homology data (NET-FF). We obtained 56,594 high-scoring GO predictions, of which 22,060 also featured high information content. Our phenotype-correlation data and NET-FF predictions showed a strong concordance with existing PomBase GO annotations and protein networks, with integrated analyses revealing 1675 novel GO predictions for 783 genes, including 47 predictions for 23 priority unstudied proteins. Experimental validation identified new proteins involved in cellular aging, showing that these predictions and phenomics data provide a rich resource to uncover new protein functions.
Collapse
Affiliation(s)
- María Rodríguez-López
- University College London, Institute of Healthy Ageing and Department of Genetics, Evolution & EnvironmentLondonUnited Kingdom
| | - Nicola Bordin
- University College London, Institute of Structural and Molecular BiologyLondonUnited Kingdom
| | - Jon Lees
- University College London, Institute of Structural and Molecular BiologyLondonUnited Kingdom
- University of BristolBristolUnited Kingdom
| | - Harry Scholes
- University College London, Institute of Structural and Molecular BiologyLondonUnited Kingdom
| | - Shaimaa Hassan
- University College London, Institute of Healthy Ageing and Department of Genetics, Evolution & EnvironmentLondonUnited Kingdom
- Helwan University, Faculty of PharmacyCairoEgypt
| | - Quentin Saintain
- University College London, Institute of Healthy Ageing and Department of Genetics, Evolution & EnvironmentLondonUnited Kingdom
| | - Stephan Kamrad
- University College London, Institute of Healthy Ageing and Department of Genetics, Evolution & EnvironmentLondonUnited Kingdom
| | - Christine Orengo
- University College London, Institute of Structural and Molecular BiologyLondonUnited Kingdom
| | - Jürg Bähler
- University College London, Institute of Healthy Ageing and Department of Genetics, Evolution & EnvironmentLondonUnited Kingdom
| |
Collapse
|
18
|
Sieg J, Rarey M. Searching similar local 3D micro-environments in protein structure databases with MicroMiner. Brief Bioinform 2023; 24:bbad357. [PMID: 37833838 DOI: 10.1093/bib/bbad357] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Revised: 08/28/2023] [Accepted: 09/18/2023] [Indexed: 10/15/2023] Open
Abstract
The available protein structure data are rapidly increasing. Within these structures, numerous local structural sites depict the details characterizing structure and function. However, searching and analyzing these sites extensively and at scale poses a challenge. We present a new method to search local sites in protein structure databases using residue-defined local 3D micro-environments. We implemented the method in a new tool called MicroMiner and demonstrate the capabilities of residue micro-environment search on the example of structural mutation analysis. Usually, experimental structures for both the wild-type and the mutant are unavailable for comparison. With MicroMiner, we extracted $>255 \times 10^{6}$ amino acid pairs in protein structures from the PDB, exemplifying single mutations' local structural changes for single chains and $>45 \times 10^{6}$ pairs for protein-protein interfaces. We further annotate existing data sets of experimentally measured mutation effects, like $\Delta \Delta G$ measurements, with the extracted structure pairs to combine the mutation effect measurement with the structural change upon mutation. In addition, we show how MicroMiner can bridge the gap between mutation analysis and structure-based drug design tools. MicroMiner is available as a command line tool and interactively on the https://proteins.plus/ webserver.
Collapse
Affiliation(s)
- Jochen Sieg
- Universität Hamburg, ZBH - Center for Bioinformatics, Bundesstraße 43, 20146 Hamburg, Germany
| | - Matthias Rarey
- Universität Hamburg, ZBH - Center for Bioinformatics, Bundesstraße 43, 20146 Hamburg, Germany
| |
Collapse
|
19
|
Li L, Zhou L, Jiang C, Liu Z, Meng D, Luo F, He Q, Yin H. AI-driven pan-proteome analyses reveal insights into the biohydrometallurgical properties of Acidithiobacillia. Front Microbiol 2023; 14:1243987. [PMID: 37744906 PMCID: PMC10512742 DOI: 10.3389/fmicb.2023.1243987] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Accepted: 08/21/2023] [Indexed: 09/26/2023] Open
Abstract
Microorganism-mediated biohydrometallurgy, a sustainable approach for metal recovery from ores, relies on the metabolic activity of acidophilic bacteria. Acidithiobacillia with sulfur/iron-oxidizing capacities are extensively studied and applied in biohydrometallurgy-related processes. However, only 14 distinct proteins from Acidithiobacillia have experimentally determined structures currently available. This significantly hampers in-depth investigations of Acidithiobacillia's structure-based biological mechanisms pertaining to its relevant biohydrometallurgical processes. To address this issue, we employed a state-of-the-art artificial intelligence (AI)-driven approach, with a median model confidence of 0.80, to perform high-quality full-chain structure predictions on the pan-proteome (10,458 proteins) of the type strain Acidithiobacillia. Additionally, we conducted various case studies on de novo protein structural prediction, including sulfate transporter and iron oxidase, to demonstrate how accurate structure predictions and gene co-occurrence networks can contribute to the development of mechanistic insights and hypotheses regarding sulfur and iron utilization proteins. Furthermore, for the unannotated proteins that constitute 35.8% of the Acidithiobacillia proteome, we employed the deep-learning algorithm DeepFRI to make structure-based functional predictions. As a result, we successfully obtained gene ontology (GO) terms for 93.6% of these previously unknown proteins. This study has a significant impact on improving protein structure and function predictions, as well as developing state-of-the-art techniques for high-throughput analysis of large proteomic data.
Collapse
Affiliation(s)
- Liangzhi Li
- School of Minerals Processing and Bioengineering, Central South University, Changsha, China
- Key Laboratory of Biometallurgy of Ministry of Education, Central South University, Changsha, China
| | - Lei Zhou
- Beijing Research Institute of Chemical Engineering and Metallurgy, Beijing, China
| | - Chengying Jiang
- State Key Laboratory of Microbial Resources, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Zhenghua Liu
- School of Minerals Processing and Bioengineering, Central South University, Changsha, China
- Key Laboratory of Biometallurgy of Ministry of Education, Central South University, Changsha, China
| | - Delong Meng
- School of Minerals Processing and Bioengineering, Central South University, Changsha, China
- Key Laboratory of Biometallurgy of Ministry of Education, Central South University, Changsha, China
| | - Feng Luo
- School of Computing, Clemson University, Clemson, SC, United States
| | - Qiang He
- Department of Civil and Environmental Engineering, University of Tennessee, Knoxville, Knoxville, TN, United States
| | - Huaqun Yin
- School of Minerals Processing and Bioengineering, Central South University, Changsha, China
- Key Laboratory of Biometallurgy of Ministry of Education, Central South University, Changsha, China
| |
Collapse
|
20
|
Pan T, Li C, Bi Y, Wang Z, Gasser RB, Purcell AW, Akutsu T, Webb GI, Imoto S, Song J. PFresGO: an attention mechanism-based deep-learning approach for protein annotation by integrating gene ontology inter-relationships. Bioinformatics 2023; 39:7043095. [PMID: 36794913 PMCID: PMC9978587 DOI: 10.1093/bioinformatics/btad094] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Revised: 02/10/2023] [Accepted: 02/15/2023] [Indexed: 02/17/2023] Open
Abstract
MOTIVATION The rapid accumulation of high-throughput sequence data demands the development of effective and efficient data-driven computational methods to functionally annotate proteins. However, most current approaches used for functional annotation simply focus on the use of protein-level information but ignore inter-relationships among annotations. RESULTS Here, we established PFresGO, an attention-based deep-learning approach that incorporates hierarchical structures in Gene Ontology (GO) graphs and advances in natural language processing algorithms for the functional annotation of proteins. PFresGO employs a self-attention operation to capture the inter-relationships of GO terms, updates its embedding accordingly and uses a cross-attention operation to project protein representations and GO embedding into a common latent space to identify global protein sequence patterns and local functional residues. We demonstrate that PFresGO consistently achieves superior performance across GO categories when compared with 'state-of-the-art' methods. Importantly, we show that PFresGO can identify functionally important residues in protein sequences by assessing the distribution of attention weightings. PFresGO should serve as an effective tool for the accurate functional annotation of proteins and functional domains within proteins. AVAILABILITY AND IMPLEMENTATION PFresGO is available for academic purposes at https://github.com/BioColLab/PFresGO. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tong Pan
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Chen Li
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Yue Bi
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Zhikang Wang
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Robin B Gasser
- Department of Veterinary Biosciences, Melbourne Veterinary School, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Anthony W Purcell
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji 611-0011, Japan
| | - Geoffrey I Webb
- Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Seiya Imoto
- Division of Health Medical Intelligence, Human Genome Center, Institute of Medical Science, The University of Tokyo, Minato-ku, Tokyo 108-8639, Japan.,Collaborative Research Institute for Innovative Microbiology, The University of Tokyo, Bunkyo-ku, Tokyo 113-8657, Japan
| | - Jiangning Song
- Department of Biochemistry and Molecular Biology, Biomedicine Discovery Institute, Monash University, Melbourne, VIC 3800, Australia.,Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji 611-0011, Japan.,Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
21
|
AlphaFold2 reveals commonalities and novelties in protein structure space for 21 model organisms. Commun Biol 2023; 6:160. [PMID: 36755055 PMCID: PMC9908985 DOI: 10.1038/s42003-023-04488-9] [Citation(s) in RCA: 41] [Impact Index Per Article: 20.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Accepted: 01/16/2023] [Indexed: 02/10/2023] Open
Abstract
Deep-learning (DL) methods like DeepMind's AlphaFold2 (AF2) have led to substantial improvements in protein structure prediction. We analyse confident AF2 models from 21 model organisms using a new classification protocol (CATH-Assign) which exploits novel DL methods for structural comparison and classification. Of ~370,000 confident models, 92% can be assigned to 3253 superfamilies in our CATH domain superfamily classification. The remaining cluster into 2367 putative novel superfamilies. Detailed manual analysis on 618 of these, having at least one human relative, reveal extremely remote homologies and further unusual features. Only 25 novel superfamilies could be confirmed. Although most models map to existing superfamilies, AF2 domains expand CATH by 67% and increases the number of unique 'global' folds by 36% and will provide valuable insights on structure function relationships. CATH-Assign will harness the huge expansion in structural data provided by DeepMind to rationalise evolutionary changes driving functional divergence.
Collapse
|
22
|
Adeyelu T, Bordin N, Waman VP, Sadlej M, Sillitoe I, Moya-Garcia AA, Orengo CA. KinFams: De-Novo Classification of Protein Kinases Using CATH Functional Units. Biomolecules 2023; 13:277. [PMID: 36830646 PMCID: PMC9953599 DOI: 10.3390/biom13020277] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Revised: 01/24/2023] [Accepted: 01/26/2023] [Indexed: 02/05/2023] Open
Abstract
Protein kinases are important targets for treating human disorders, and they are the second most targeted families after G-protein coupled receptors. Several resources provide classification of kinases into evolutionary families (based on sequence homology); however, very few systematically classify functional families (FunFams) comprising evolutionary relatives that share similar functional properties. We have developed the FunFam-MARC (Multidomain ARchitecture-based Clustering) protocol, which uses multi-domain architectures of protein kinases and specificity-determining residues for functional family classification. FunFam-MARC predicts 2210 kinase functional families (KinFams), which have increased functional coherence, in terms of EC annotations, compared to the widely used KinBase classification. Our protocol provides a comprehensive classification for kinase sequences from >10,000 organisms. We associate human KinFams with diseases and drugs and identify 28 druggable human KinFams, i.e., enriched in clinically approved drugs. Since relatives in the same druggable KinFam tend to be structurally conserved, including the drug-binding site, these KinFams may be valuable for shortlisting therapeutic targets. Information on the human KinFams and associated 3D structures from AlphaFold2 are provided via our CATH FTP website and Zenodo. This gives the domain structure representative of each KinFam together with information on any drug compounds available. For 32% of the KinFams, we provide information on highly conserved residue sites that may be associated with specificity.
Collapse
Affiliation(s)
- Tolulope Adeyelu
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
- Department of Comparative Biomedical Science, Louisiana State University, Baton Rouge, LA 70803, USA
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Vaishali P. Waman
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Marta Sadlej
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Aurelio A. Moya-Garcia
- Departamento de Biología Molecular y Bioquímica, Universidad de Málaga, 29071 Málaga, Spain
- Laboratorio de Biología Molecular del Cáncer, Centro de Investigaciones Médico-Sanitarias (CIMES), Universidad de Málaga, 29071 Málaga, Spain
| | - Christine A. Orengo
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| |
Collapse
|
23
|
Rai KK, Singh S, Rai R, Rai LC. Functional characterization of two WD40 family proteins, Alr0671 and All2352, from Anabaena PCC 7120 and deciphering their role in abiotic stress management. PLANT MOLECULAR BIOLOGY 2022; 110:545-563. [PMID: 35997919 DOI: 10.1007/s11103-022-01306-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/17/2022] [Accepted: 08/01/2022] [Indexed: 06/15/2023]
Abstract
WD40 domain-containing proteins are one of the eukaryotes' most ancient and ubiquitous protein families. Little is known about the presence and function of these proteins in cyanobacteria in general and Anabaena in particular. In silico analysis confirmed the presence of WD40 repeats. Gene expression analysis indicated that the transcript levels of both the target proteins were up-regulated up to 4 fold in Cd and drought and 2-3 fold in heat, salt, and UV-B stress. Using a fluorescent oxidative stress indicator, we showed that the recombinant proteins were scavenging reactive oxygen species (ROS) (4-5 fold) more efficiently than empty vectors. Chromatin immunoprecipitation analysis (ChIP) and electrophoretic mobility shift assay (EMSA) revealed that the target proteins function as transcription factors after binding to the promoter sequences. The presence of kinase activity (2-4 fold) in the selected proteins indicated that these proteins could modulate the functions of other cellular proteins under stress conditions by inducing phosphorylation of specific amino acids. The chosen proteins also demonstrated interaction with Zn, Cd, and Cu (1.4-2.5 fold), which might stabilize the proteins' structure and biophysical functions under multiple abiotic stresses. The functionally characterized Alr0671 and All2352 proteins act as transcription factors and offer tolerance to agriculturally relevant abiotic stresses.
Collapse
Affiliation(s)
- Krishna Kumar Rai
- Molecular Biology Section, Centre of Advanced Study in Botany, Institute of Science, Banaras Hindu University, 221005, Varanasi, India
| | - Shilpi Singh
- Molecular Biology Section, Centre of Advanced Study in Botany, Institute of Science, Banaras Hindu University, 221005, Varanasi, India
| | - Ruchi Rai
- Molecular Biology Section, Centre of Advanced Study in Botany, Institute of Science, Banaras Hindu University, 221005, Varanasi, India
| | - L C Rai
- Molecular Biology Section, Centre of Advanced Study in Botany, Institute of Science, Banaras Hindu University, 221005, Varanasi, India.
| |
Collapse
|
24
|
Zhu YH, Zhang C, Yu DJ, Zhang Y. Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction. PLoS Comput Biol 2022; 18:e1010793. [PMID: 36548439 PMCID: PMC9822105 DOI: 10.1371/journal.pcbi.1010793] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Revised: 01/06/2023] [Accepted: 12/05/2022] [Indexed: 12/24/2022] Open
Abstract
Accurate identification of protein function is critical to elucidate life mechanisms and design new drugs. We proposed a novel deep-learning method, ATGO, to predict Gene Ontology (GO) attributes of proteins through a triplet neural-network architecture embedded with pre-trained language models from protein sequences. The method was systematically tested on 1068 non-redundant benchmarking proteins and 3328 targets from the third Critical Assessment of Protein Function Annotation (CAFA) challenge. Experimental results showed that ATGO achieved a significant increase of the GO prediction accuracy compared to the state-of-the-art approaches in all aspects of molecular function, biological process, and cellular component. Detailed data analyses showed that the major advantage of ATGO lies in the utilization of pre-trained transformer language models which can extract discriminative functional pattern from the feature embeddings. Meanwhile, the proposed triplet network helps enhance the association of functional similarity with feature similarity in the sequence embedding space. In addition, it was found that the combination of the network scores with the complementary homology-based inferences could further improve the accuracy of the predicted models. These results demonstrated a new avenue for high-accuracy deep-learning function prediction that is applicable to large-scale protein function annotations from sequence alone.
Collapse
Affiliation(s)
- Yi-Heng Zhu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, People’s Republic of China
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, People’s Republic of China
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
- Department of Biological Chemistry, University of Michigan, Ann Arbor, Michigan, United States of America
| |
Collapse
|
25
|
Rai N, Rai KK, Singh MK, Singh J, Kaushik P. Investigating NAC Transcription Factor Role in Redox Homeostasis in Solanum lycopersicum L.: Bioinformatics, Physiological and Expression Analysis under Drought Stress. PLANTS (BASEL, SWITZERLAND) 2022; 11:2930. [PMID: 36365384 PMCID: PMC9654907 DOI: 10.3390/plants11212930] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/17/2022] [Revised: 10/24/2022] [Accepted: 10/25/2022] [Indexed: 06/16/2023]
Abstract
NAC transcription factors regulate stress-defence pathways and developmental processes in crop plants. However, their detailed functional characterization in tomatoes needs to be investigated comprehensively. In the present study, tomato hybrids subjected to 60 and 80 days of drought stress conditions showed a significant increase in membrane damage and reduced relative water, chlorophyll and proline content. However, hybrids viz., VRTH-16-3 and VRTH-17-68 showed superior growth under drought stress, as they were marked with low electrolytic leakage, enhanced relative water content, proline content and an enhanced activity of enzymatic antioxidants, along with the upregulation of NAC and other stress-defence pathway genes. Candidate gene(s) exhibiting maximum expression in all the hybrids under drought stress were subjected to detailed in silico characterization to provide significant insight into its structural and functional classification. The homology modelling and superimposition analysis of predicted tomato NAC protein showed that similar amino acid residues were involved in forming the conserved WKAT domain. DNA docking discovered that the SlNAC1 protein becomes activated and exerts a stress-defence response after the possible interaction of conserved DNA elements using Pro72, Asn73, Trp81, Lys82, Ala83, Thr84, Gly85, Thr86 and Asp87 residues. A protein-protein interaction analysis identified ten functional partners involved in the induction of stress-defence tolerance.
Collapse
Affiliation(s)
- Nagendra Rai
- Indian Institute of Vegetable Research (IIVR), Varanasi 221305, UP, India
| | - Krishna Kumar Rai
- Indian Institute of Vegetable Research (IIVR), Varanasi 221305, UP, India
- Department of Botany, Institute of Science, Banaras Hindu University, Varanasi 221005, UP, India
| | - Manish Kumar Singh
- Indian Institute of Vegetable Research (IIVR), Varanasi 221305, UP, India
| | - Jagdish Singh
- Indian Institute of Vegetable Research (IIVR), Varanasi 221305, UP, India
| | - Prashant Kaushik
- Instituto de Conservación y Mejora de la Agrodiversidad Valenciana, Universitat Politècnica de València, 46022 Valencia, Spain
| |
Collapse
|
26
|
Berger J, Legendre F, Zelosko KM, Harrison MC, Grandcolas P, Bornberg-Bauer E, Fouks B. Eusocial Transition in Blattodea: Transposable Elements and Shifts of Gene Expression. Genes (Basel) 2022; 13:1948. [PMID: 36360186 PMCID: PMC9689775 DOI: 10.3390/genes13111948] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Revised: 10/19/2022] [Accepted: 10/20/2022] [Indexed: 11/07/2023] Open
Abstract
(1) Unravelling the molecular basis underlying major evolutionary transitions can shed light on how complex phenotypes arise. The evolution of eusociality, a major evolutionary transition, has been demonstrated to be accompanied by enhanced gene regulation. Numerous pieces of evidence suggest the major impact of transposon insertion on gene regulation and its role in adaptive evolution. Transposons have been shown to be play a role in gene duplication involved in the eusocial transition in termites. However, evidence of the molecular basis underlying the eusocial transition in Blattodea remains scarce. Could transposons have facilitated the eusocial transition in termites through shifts of gene expression? (2) Using available cockroach and termite genomes and transcriptomes, we investigated if transposons insert more frequently in genes with differential expression in queens and workers and if those genes could be linked to specific functions essential for eusocial transition. (3) The insertion rate of transposons differs among differentially expressed genes and displays opposite trends between termites and cockroaches. The functions of termite transposon-rich queen- and worker-biased genes are related to reproduction and ageing and behaviour and gene expression, respectively. (4) Our study provides further evidence on the role of transposons in the evolution of eusociality, potentially through shifts in gene expression.
Collapse
Affiliation(s)
- Juliette Berger
- Institut de Systématique, Évolution, Biodiversité (ISYEB), Muséum National d’Histoire Naturelle, CNRS, Sorbonne Université, EPHE, Université des Antilles, CP50, 57 rue Cuvier, 75005 Paris, France
| | - Frédéric Legendre
- Institut de Systématique, Évolution, Biodiversité (ISYEB), Muséum National d’Histoire Naturelle, CNRS, Sorbonne Université, EPHE, Université des Antilles, CP50, 57 rue Cuvier, 75005 Paris, France
| | - Kevin-Markus Zelosko
- Institute for Evolution and Biodiversity, Molecular Evolution and Bioinformatics, Westfälische Wilhelms-Universität, Hüfferstrasse 1, 48149 Münster, Germany
| | - Mark C. Harrison
- Institute for Evolution and Biodiversity, Molecular Evolution and Bioinformatics, Westfälische Wilhelms-Universität, Hüfferstrasse 1, 48149 Münster, Germany
| | - Philippe Grandcolas
- Institut de Systématique, Évolution, Biodiversité (ISYEB), Muséum National d’Histoire Naturelle, CNRS, Sorbonne Université, EPHE, Université des Antilles, CP50, 57 rue Cuvier, 75005 Paris, France
| | - Erich Bornberg-Bauer
- Institute for Evolution and Biodiversity, Molecular Evolution and Bioinformatics, Westfälische Wilhelms-Universität, Hüfferstrasse 1, 48149 Münster, Germany
- Department of Protein Evolution, Max Planck Institute for Developmental Biology, Max-Planck-Ring 5, 72076 Tübingen, Germany
| | - Bertrand Fouks
- Institute for Evolution and Biodiversity, Molecular Evolution and Bioinformatics, Westfälische Wilhelms-Universität, Hüfferstrasse 1, 48149 Münster, Germany
| |
Collapse
|
27
|
Zhu YH, Zhang C, Liu Y, Omenn GS, Freddolino PL, Yu DJ, Zhang Y. TripletGO: Integrating Transcript Expression Profiles with Protein Homology Inferences for Gene Function Prediction. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:1013-1027. [PMID: 35568117 PMCID: PMC10025770 DOI: 10.1016/j.gpb.2022.03.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/20/2021] [Revised: 03/02/2022] [Accepted: 04/16/2022] [Indexed: 01/13/2023]
Abstract
Gene Ontology (GO) has been widely used to annotate functions of genes and gene products. Here, we proposed a new method, TripletGO, to deduce GO terms of protein-coding and non-coding genes, through the integration of four complementary pipelines built on transcript expression profile, genetic sequence alignment, protein sequence alignment, and naïve probability. TripletGO was tested on a large set of 5754 genes from 8 species (human, mouse, Arabidopsis, rat, fly, budding yeast, fission yeast, and nematoda) and 2433 proteins with available expression data from the third Critical Assessment of Protein Function Annotation challenge (CAFA3). Experimental results show that TripletGO achieves function annotation accuracy significantly beyond the current state-of-the-art approaches. Detailed analyses show that the major advantage of TripletGO lies in the coupling of a new triplet network-based profiling method with the feature space mapping technique, which can accurately recognize function patterns from transcript expression profiles. Meanwhile, the combination of multiple complementary models, especially those from transcript expression and protein-level alignments, improves the coverage and accuracy of the final GO annotation results. The standalone package and an online server of TripletGO are freely available at https://zhanggroup.org/TripletGO/.
Collapse
Affiliation(s)
- Yi-Heng Zhu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China; Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yan Liu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Gilbert S Omenn
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA; Departments of Internal Medicine and Human Genetics, and School of Public Health, University of Michigan, Ann Arbor, MI 48109, USA
| | - Peter L Freddolino
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA; Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China.
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA; Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA.
| |
Collapse
|
28
|
Mansoor M, Nauman M, Rehman HU, Omar M. Gene Ontology Capsule GAN: an improved architecture for protein function prediction. PeerJ Comput Sci 2022; 8:e1014. [PMID: 36092003 PMCID: PMC9454774 DOI: 10.7717/peerj-cs.1014] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Accepted: 05/31/2022] [Indexed: 06/15/2023]
Abstract
Proteins are the core of all functions pertaining to living things. They consist of an extended amino acid chain folding into a three-dimensional shape that dictates their behavior. Currently, convolutional neural networks (CNNs) have been pivotal in predicting protein functions based on protein sequences. While it is a technology crucial to the niche, the computation cost and translational invariance associated with CNN make it impossible to detect spatial hierarchies between complex and simpler objects. Therefore, this research utilizes capsule networks to capture spatial information as opposed to CNNs. Since capsule networks focus on hierarchical links, they have a lot of potential for solving structural biology challenges. In comparison to the standard CNNs, our results exhibit an improvement in accuracy. Gene Ontology Capsule GAN (GOCAPGAN) achieved an F1 score of 82.6%, a precision score of 90.4% and recall score of 76.1%.
Collapse
Affiliation(s)
- Musadaq Mansoor
- National University of Computer and Emerging Sciences, Islamabad, Peshawar, KPK, Pakistan
| | - Mohammad Nauman
- National University of Computer and Emerging Sciences, Islamabad, Peshawar, KPK, Pakistan
| | - Hafeez Ur Rehman
- National University of Computer and Emerging Sciences, Islamabad, Peshawar, KPK, Pakistan
| | - Maryam Omar
- National University of Computer and Emerging Sciences, Islamabad, Peshawar, KPK, Pakistan
| |
Collapse
|
29
|
de Crécy-lagard V, Amorin de Hegedus R, Arighi C, Babor J, Bateman A, Blaby I, Blaby-Haas C, Bridge AJ, Burley SK, Cleveland S, Colwell LJ, Conesa A, Dallago C, Danchin A, de Waard A, Deutschbauer A, Dias R, Ding Y, Fang G, Friedberg I, Gerlt J, Goldford J, Gorelik M, Gyori BM, Henry C, Hutinet G, Jaroch M, Karp PD, Kondratova L, Lu Z, Marchler-Bauer A, Martin MJ, McWhite C, Moghe GD, Monaghan P, Morgat A, Mungall CJ, Natale DA, Nelson WC, O’Donoghue S, Orengo C, O’Toole KH, Radivojac P, Reed C, Roberts RJ, Rodionov D, Rodionova IA, Rudolf JD, Saleh L, Sheynkman G, Thibaud-Nissen F, Thomas PD, Uetz P, Vallenet D, Carter EW, Weigele PR, Wood V, Wood-Charlson EM, Xu J. A roadmap for the functional annotation of protein families: a community perspective. Database (Oxford) 2022; 2022:baac062. [PMID: 35961013 PMCID: PMC9374478 DOI: 10.1093/database/baac062] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 06/28/2022] [Accepted: 08/03/2022] [Indexed: 12/23/2022]
Abstract
Over the last 25 years, biology has entered the genomic era and is becoming a science of 'big data'. Most interpretations of genomic analyses rely on accurate functional annotations of the proteins encoded by more than 500 000 genomes sequenced to date. By different estimates, only half the predicted sequenced proteins carry an accurate functional annotation, and this percentage varies drastically between different organismal lineages. Such a large gap in knowledge hampers all aspects of biological enterprise and, thereby, is standing in the way of genomic biology reaching its full potential. A brainstorming meeting to address this issue funded by the National Science Foundation was held during 3-4 February 2022. Bringing together data scientists, biocurators, computational biologists and experimentalists within the same venue allowed for a comprehensive assessment of the current state of functional annotations of protein families. Further, major issues that were obstructing the field were identified and discussed, which ultimately allowed for the proposal of solutions on how to move forward.
Collapse
Affiliation(s)
- Valérie de Crécy-lagard
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | | | - Cecilia Arighi
- Department of Computer and Information Sciences, University of Delaware, Newark, DE 19713, USA
| | - Jill Babor
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Ian Blaby
- US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Crysten Blaby-Haas
- Biology Department, Brookhaven National Laboratory, Upton, NY 11973, USA
| | - Alan J Bridge
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva 4 CH-1211, Switzerland
| | - Stephen K Burley
- RCSB Protein Data Bank, Institute for Quantitative Biomedicine, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| | - Stacey Cleveland
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | - Lucy J Colwell
- Departmenf of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK
| | - Ana Conesa
- Spanish National Research Council, Institute for Integrative Systems Biology, Paterna, Valencia 46980, Spain
| | - Christian Dallago
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology, i12, Boltzmannstr. 3, Garching/Munich 85748, Germany
| | - Antoine Danchin
- School of Biomedical Sciences, Li KaShing Faculty of Medicine, The University of Hong Kong, 21 Sassoon Road, Pokfulam, SAR Hong Kong 999077, China
| | - Anita de Waard
- Research Collaboration Unit, Elsevier, Jericho, VT 05465, USA
| | - Adam Deutschbauer
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Raquel Dias
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | - Yousong Ding
- Department of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development, University of Florida, Gainesville, FL 32610, USA
| | - Gang Fang
- NYU-Shanghai, Shanghai 200120, China
| | - Iddo Friedberg
- Department of Veterinary Microbiology and Preventive Medicine, Iowa State University, Ames, IA 50011, USA
| | - John Gerlt
- Institute for Genomic Biology and Departments of Biochemistry and Chemistry, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Joshua Goldford
- Physics of Living Systems, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Mark Gorelik
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | - Benjamin M Gyori
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA 02115, USA
| | - Christopher Henry
- Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439, USA
| | - Geoffrey Hutinet
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | - Marshall Jaroch
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | - Peter D Karp
- Bioinformatics Research Group, SRI International, Menlo Park, CA 94025, USA
| | | | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20817, USA
| | - Aron Marchler-Bauer
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20817, USA
| | - Maria-Jesus Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Claire McWhite
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA
| | - Gaurav D Moghe
- Plant Biology Section, School of Integrative Plant Science, Cornell University, Ithaca, NY 14853, USA
| | - Paul Monaghan
- Department of Agricultural Education and Communication, University of Florida, Gainesville, FL 32611, USA
| | - Anne Morgat
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva 4 CH-1211, Switzerland
| | - Christopher J Mungall
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Darren A Natale
- Georgetown University Medical Center, Washington, DC 20007, USA
| | - William C Nelson
- Biological Sciences Division, Pacific Northwest National Laboratories, Richland, WA 99354, USA
| | - Seán O’Donoghue
- School of Biotechnology and Biomolecular Sciences, University of NSW, Sydney, NSW 2052, Australia
| | - Christine Orengo
- Department of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | | | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA
| | - Colbie Reed
- Department of Microbiology and Cell Sciences, University of Florida, Gainesville, FL 32611, USA
| | | | - Dmitri Rodionov
- Sanford Burnham Prebys Medical Discovery Institute, La Jolla, CA 92037, USA
| | - Irina A Rodionova
- Department of Bioengineering, Division of Engineering, University of California at San Diego, La Jolla, CA 92093-0412, USA
| | - Jeffrey D Rudolf
- Department of Chemistry, University of Florida, Gainesville, FL 32611, USA
| | - Lana Saleh
- New England Biolabs, Ipswich, MA 01938, USA
| | - Gloria Sheynkman
- Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA, USA
| | - Francoise Thibaud-Nissen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20817, USA
| | - Paul D Thomas
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA 90033, USA
| | - Peter Uetz
- Center for Biological Data Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - David Vallenet
- LABGeM, Génomique Métabolique, CEA, Genoscope, Institut François Jacob, Université d’Évry, Université Paris-Saclay, CNRS, Evry 91057, France
| | - Erica Watson Carter
- Department of Plant Pathology, University of Florida Citrus Research and Education Center, 700 Experiment Station Rd., Lake Alfred, FL 33850, USA
| | | | - Valerie Wood
- Department of Biochemistry, University of Cambridge, Cambridge CB2 1GA, UK
| | - Elisha M Wood-Charlson
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Jin Xu
- Department of Plant Pathology, University of Florida Citrus Research and Education Center, 700 Experiment Station Rd., Lake Alfred, FL 33850, USA
| |
Collapse
|
30
|
Sen N, Anishchenko I, Bordin N, Sillitoe I, Velankar S, Baker D, Orengo C. Characterizing and explaining the impact of disease-associated mutations in proteins without known structures or structural homologs. Brief Bioinform 2022; 23:bbac187. [PMID: 35641150 PMCID: PMC9294430 DOI: 10.1093/bib/bbac187] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2021] [Revised: 04/23/2022] [Accepted: 04/27/2022] [Indexed: 12/12/2022] Open
Abstract
Mutations in human proteins lead to diseases. The structure of these proteins can help understand the mechanism of such diseases and develop therapeutics against them. With improved deep learning techniques, such as RoseTTAFold and AlphaFold, we can predict the structure of proteins even in the absence of structural homologs. We modeled and extracted the domains from 553 disease-associated human proteins without known protein structures or close homologs in the Protein Databank. We noticed that the model quality was higher and the Root mean square deviation (RMSD) lower between AlphaFold and RoseTTAFold models for domains that could be assigned to CATH families as compared to those which could only be assigned to Pfam families of unknown structure or could not be assigned to either. We predicted ligand-binding sites, protein-protein interfaces and conserved residues in these predicted structures. We then explored whether the disease-associated missense mutations were in the proximity of these predicted functional sites, whether they destabilized the protein structure based on ddG calculations or whether they were predicted to be pathogenic. We could explain 80% of these disease-associated mutations based on proximity to functional sites, structural destabilization or pathogenicity. When compared to polymorphisms, a larger percentage of disease-associated missense mutations were buried, closer to predicted functional sites, predicted as destabilizing and pathogenic. Usage of models from the two state-of-the-art techniques provide better confidence in our predictions, and we explain 93 additional mutations based on RoseTTAFold models which could not be explained based solely on AlphaFold models.
Collapse
Affiliation(s)
- Neeladri Sen
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Ivan Anishchenko
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Sameer Velankar
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - David Baker
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, USA
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| |
Collapse
|
31
|
Exploiting protein family and protein network data to identify novel drug targets for bladder cancer. Oncotarget 2022; 13:105-117. [PMID: 35035776 PMCID: PMC8758182 DOI: 10.18632/oncotarget.28175] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Accepted: 12/08/2021] [Indexed: 12/11/2022] Open
Abstract
Bladder cancer remains one of the most common forms of cancer and yet there are limited small molecule targeted therapies. Here, we present a computational platform to identify new potential targets for bladder cancer therapy. Our method initially exploited a set of known driver genes for bladder cancer combined with predicted bladder cancer genes from mutationally enriched protein domain families. We enriched this initial set of genes using protein network data to identify a comprehensive set of 323 putative bladder cancer targets. Pathway and cancer hallmarks analyses highlighted putative mechanisms in agreement with those previously reported for this cancer and revealed protein network modules highly enriched in potential drivers likely to be good targets for targeted therapies. 21 of our potential drug targets are targeted by FDA approved drugs for other diseases — some of them are known drivers or are already being targeted for bladder cancer (FGFR3, ERBB3, HDAC3, EGFR). A further 4 potential drug targets were identified by inheriting drug mappings across our in-house CATH domain functional families (FunFams). Our FunFam data also allowed us to identify drug targets in families that are less prone to side effects i.e., where structurally similar protein domain relatives are less dispersed across the human protein network. We provide information on our novel potential cancer driver genes, together with information on pathways, network modules and hallmarks associated with the predicted and known bladder cancer drivers and we highlight those drivers we predict to be likely drug targets.
Collapse
|
32
|
Rojano E, Jabato FM, Perkins JR, Córdoba-Caballero J, García-Criado F, Sillitoe I, Orengo C, Ranea JAG, Seoane-Zonjic P. Assigning protein function from domain-function associations using DomFun. BMC Bioinformatics 2022; 23:43. [PMID: 35033002 PMCID: PMC8761305 DOI: 10.1186/s12859-022-04565-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Accepted: 01/05/2022] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND Protein function prediction remains a key challenge. Domain composition affects protein function. Here we present DomFun, a Ruby gem that uses associations between protein domains and functions, calculated using multiple indices based on tripartite network analysis. These domain-function associations are combined at the protein level, to generate protein-function predictions. RESULTS We analysed 16 tripartite networks connecting homologous superfamily and FunFam domains from CATH-Gene3D with functional annotations from the three Gene Ontology (GO) sub-ontologies, KEGG, and Reactome. We validated the results using the CAFA 3 benchmark platform for GO annotation, finding that out of the multiple association metrics and domain datasets tested, Simpson index for FunFam domain-function associations combined with Stouffer's method leads to the best performance in almost all scenarios. We also found that using FunFams led to better performance than superfamilies, and better results were found for GO molecular function compared to GO biological process terms. DomFun performed as well as the highest-performing method in certain CAFA 3 evaluation procedures in terms of [Formula: see text] and [Formula: see text] We also implemented our own benchmark procedure, Pathway Prediction Performance (PPP), which can be used to validate function prediction for additional annotations sources, such as KEGG and Reactome. Using PPP, we found similar results to those found with CAFA 3 for GO, moreover we found good performance for the other annotation sources. As with CAFA 3, Simpson index with Stouffer's method led to the top performance in almost all scenarios. CONCLUSIONS DomFun shows competitive performance with other methods evaluated in CAFA 3 when predicting proteins function with GO, although results vary depending on the evaluation procedure. Through our own benchmark procedure, PPP, we have shown it can also make accurate predictions for KEGG and Reactome. It performs best when using FunFams, combining Simpson index derived domain-function associations using Stouffer's method. The tool has been implemented so that it can be easily adapted to incorporate other protein features, such as domain data from other sources, amino acid k-mers and motifs. The DomFun Ruby gem is available from https://rubygems.org/gems/DomFun . Code maintained at https://github.com/ElenaRojano/DomFun . Validation procedure scripts can be found at https://github.com/ElenaRojano/DomFun_project .
Collapse
Affiliation(s)
- Elena Rojano
- Department of Molecular Biology and Biochemistry, University of Malaga, Bulevar Louis Pasteur, 31, 29010 Malaga, Spain
- Institute of Biomedical Research in Malaga (IBIMA), Dr. Miguel Díaz Recio, 28, 29010 Malaga, Spain
| | - Fernando M. Jabato
- Department of Molecular Biology and Biochemistry, University of Malaga, Bulevar Louis Pasteur, 31, 29010 Malaga, Spain
- Institute of Biomedical Research in Malaga (IBIMA), Dr. Miguel Díaz Recio, 28, 29010 Malaga, Spain
| | - James R. Perkins
- Department of Molecular Biology and Biochemistry, University of Malaga, Bulevar Louis Pasteur, 31, 29010 Malaga, Spain
- CIBER of Rare Diseases, Av. Monforte de Lemos, 3-5. Pabellon 11. Planta 0, 28029 Madrid, Spain
- Institute of Biomedical Research in Malaga (IBIMA), Dr. Miguel Díaz Recio, 28, 29010 Malaga, Spain
| | - José Córdoba-Caballero
- Department of Molecular Biology and Biochemistry, University of Malaga, Bulevar Louis Pasteur, 31, 29010 Malaga, Spain
| | - Federico García-Criado
- Department of Molecular Biology and Biochemistry, University of Malaga, Bulevar Louis Pasteur, 31, 29010 Malaga, Spain
| | - Ian Sillitoe
- Department of Structural and Molecular Biology, University College London, Gower Street, London, WC1E 6BT UK
| | - Christine Orengo
- Department of Structural and Molecular Biology, University College London, Gower Street, London, WC1E 6BT UK
| | - Juan A. G. Ranea
- Department of Molecular Biology and Biochemistry, University of Malaga, Bulevar Louis Pasteur, 31, 29010 Malaga, Spain
- CIBER of Rare Diseases, Av. Monforte de Lemos, 3-5. Pabellon 11. Planta 0, 28029 Madrid, Spain
- Institute of Biomedical Research in Malaga (IBIMA), Dr. Miguel Díaz Recio, 28, 29010 Malaga, Spain
| | - Pedro Seoane-Zonjic
- Department of Molecular Biology and Biochemistry, University of Malaga, Bulevar Louis Pasteur, 31, 29010 Malaga, Spain
- CIBER of Rare Diseases, Av. Monforte de Lemos, 3-5. Pabellon 11. Planta 0, 28029 Madrid, Spain
- Institute of Biomedical Research in Malaga (IBIMA), Dr. Miguel Díaz Recio, 28, 29010 Malaga, Spain
| |
Collapse
|
33
|
Alfaiz FA. Molecular studies of immunological enzyme clumping factor B for the inhibition of Staphylococcus aureus with essential oils of Nigella sativa. J Mol Recognit 2021; 34:e2941. [PMID: 34626016 DOI: 10.1002/jmr.2941] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Revised: 09/14/2021] [Accepted: 09/26/2021] [Indexed: 11/09/2022]
Abstract
Essential oils from black cumin seeds (Nigella sativa) have largely been used in the manufacturing of nutraceuticals and functional food products due to the presence of a wide variety of bioactive compounds. However, their applications in the pharmaceutical sector have recently attracted interest and started blooming. The present research elucidates the in silico and in vitro efficacies of active leads from essential oil of N sativa against the human pathogenic bacterium Staphylococcus aureus. Biofilm development has become an inevitable situation in the health care sector. Lowering the efficacies of antimicrobial drugs is one of the vital ramifications that resulted in the emergence of multidrug resistance. Clumping factor B (clfB) of S aureus plays a key role in the human immune functions during pathogenesis. Through STRING analysis, the interacting protein partners of clfB were found to regulate biofilm pathway. Therefore, eight ligands from essential oil are docked with the critical clfB protein, which revealed p-cymene, thymoquinone and carvacrol as the robust ligands with highest binding affinity. Therefore, antibiofilm potential of N sativa essential oil at in vitro states was evaluated against S aureus. Further, real time PCR analysis showed that the expression of clfB and intercellular adhesion gene (icaA and icaD) was significantly altered upon treatment with essential oil. Altogether, the findings confirmed the antibiofilm efficacy of N sativa essential oil against S aureus. Hence, the essential oil from N. sativa was envisaged to be promising candidate to treat S aureus biofilm mediated infection.
Collapse
Affiliation(s)
- Faiz Abdulaziz Alfaiz
- Department of Biology, College of Science in Zulfi, Majmaah University, Majmaah, Saudi Arabia
| |
Collapse
|
34
|
Rauer C, Sen N, Waman VP, Abbasian M, Orengo CA. Computational approaches to predict protein functional families and functional sites. Curr Opin Struct Biol 2021; 70:108-122. [PMID: 34225010 DOI: 10.1016/j.sbi.2021.05.012] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 05/13/2021] [Accepted: 05/25/2021] [Indexed: 01/06/2023]
Abstract
Understanding the mechanisms of protein function is indispensable for many biological applications, such as protein engineering and drug design. However, experimental annotations are sparse, and therefore, theoretical strategies are needed to fill the gap. Here, we present the latest developments in building functional subclassifications of protein superfamilies and using evolutionary conservation to detect functional determinants, for example, catalytic-, binding- and specificity-determining residues important for delineating the functional families. We also briefly review other features exploited for functional site detection and new machine learning strategies for combining multiple features.
Collapse
Affiliation(s)
- Clemens Rauer
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Neeladri Sen
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Vaishali P Waman
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Mahnaz Abbasian
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Christine A Orengo
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK.
| |
Collapse
|
35
|
Foster CA, Silversmith RE, Immormino RM, Vass LR, Kennedy EN, Pazy Y, Collins EJ, Bourret RB. Role of Position K+4 in the Phosphorylation and Dephosphorylation Reaction Kinetics of the CheY Response Regulator. Biochemistry 2021; 60:2130-2151. [PMID: 34167303 DOI: 10.1021/acs.biochem.1c00246] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Two-component signaling is a primary method by which microorganisms interact with their environments. A kinase detects stimuli and modulates autophosphorylation activity. The signal propagates by phosphotransfer from the kinase to a response regulator, eliciting a response. Response regulators operate over a range of time scales, corresponding to their related biological processes. Response regulator active site chemistry is highly conserved, but certain variable residues can influence phosphorylation kinetics. An Ala-to-Pro substitution (K+4, residue 113) in the Escherichia coli response regulator CheY triggers a constitutively active phenotype; however, the A113P substitution is too far from the active site to directly affect phosphochemistry. To better understand the activating mechanism(s) of the substitution, we analyzed receiver domain sequences to characterize the evolutionary role of the K+4 position. Although most featured Pro, Leu, Ile, and Val residues, chemotaxis-related proteins exhibited atypical Ala, Gly, Asp, and Glu residues at K+4. Structural and in silico analyses revealed that CheY A113P adopted a partially active configuration. Biochemical data showed that A113P shifted CheY toward a more activated state, enhancing autophosphorylation. By characterizing CheY variants, we determined that this functionality was transmitted through a hydrophobic network bounded by the β5α5 loop and the α1 helix of CheY. This region also interacts with the phosphodonor CheAP1, suggesting that binding generates an activating perturbation similar to the A113P substitution. Atypical residues like Ala at the K+4 position likely serve two purposes. First, restricting autophosphorylation may minimize background noise generated by intracellular phosphodonors such as acetyl phosphate. Second, optimizing interactions with upstream partners may help prime the receiver domain for phosphorylation.
Collapse
Affiliation(s)
- Clay A Foster
- Department of Microbiology and Immunology, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| | - Ruth E Silversmith
- Department of Microbiology and Immunology, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| | - Robert M Immormino
- Department of Microbiology and Immunology, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| | - Luke R Vass
- Department of Microbiology and Immunology, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| | - Emily N Kennedy
- Department of Microbiology and Immunology, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| | - Yael Pazy
- Department of Microbiology and Immunology, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| | - Edward J Collins
- Department of Microbiology and Immunology, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| | - Robert B Bourret
- Department of Microbiology and Immunology, The University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, United States
| |
Collapse
|
36
|
Structure-based protein function prediction using graph convolutional networks. Nat Commun 2021; 12:3168. [PMID: 34039967 PMCID: PMC8155034 DOI: 10.1038/s41467-021-23303-9] [Citation(s) in RCA: 344] [Impact Index Per Article: 86.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2020] [Accepted: 04/22/2021] [Indexed: 02/04/2023] Open
Abstract
The rapid increase in the number of proteins in sequence databases and the diversity of their functions challenge computational approaches for automated function prediction. Here, we introduce DeepFRI, a Graph Convolutional Network for predicting protein functions by leveraging sequence features extracted from a protein language model and protein structures. It outperforms current leading methods and sequence-based Convolutional Neural Networks and scales to the size of current sequence repositories. Augmenting the training set of experimental structures with homology models allows us to significantly expand the number of predictable functions. DeepFRI has significant de-noising capability, with only a minor drop in performance when experimental structures are replaced by protein models. Class activation mapping allows function predictions at an unprecedented resolution, allowing site-specific annotations at the residue-level in an automated manner. We show the utility and high performance of our method by annotating structures from the PDB and SWISS-MODEL, making several new confident function predictions. DeepFRI is available as a webserver at https://beta.deepfri.flatironinstitute.org/ .
Collapse
|
37
|
Bordin N, Sillitoe I, Lees JG, Orengo C. Tracing Evolution Through Protein Structures: Nature Captured in a Few Thousand Folds. Front Mol Biosci 2021; 8:668184. [PMID: 34041266 PMCID: PMC8141709 DOI: 10.3389/fmolb.2021.668184] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 04/27/2021] [Indexed: 11/13/2022] Open
Abstract
This article is dedicated to the memory of Cyrus Chothia, who was a leading light in the world of protein structure evolution. His elegant analyses of protein families and their mechanisms of structural and functional evolution provided important evolutionary and biological insights and firmly established the value of structural perspectives. He was a mentor and supervisor to many other leading scientists who continued his quest to characterise structure and function space. He was also a generous and supportive colleague to those applying different approaches. In this article we review some of his accomplishments and the history of protein structure classifications, particularly SCOP and CATH. We also highlight some of the evolutionary insights these two classifications have brought. Finally, we discuss how the expansion and integration of protein sequence data into these structural families helps reveal the dark matter of function space and can inform the emergence of novel functions in Metazoa. Since we cover 25 years of structural classification, it has not been feasible to review all structure based evolutionary studies and hence we focus mainly on those undertaken by the SCOP and CATH groups and their collaborators.
Collapse
Affiliation(s)
- Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| | - Jonathan G Lees
- Department of Biological and Medical Sciences, Faculty of Health and Life Sciences, Oxford Brookes University, Oxford, United Kingdom
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, London, United Kingdom
| |
Collapse
|
38
|
Heavy metal pollution: Insights into chromium eco-toxicity and recent advancement in its remediation. ACTA ACUST UNITED AC 2021. [DOI: 10.1016/j.enmm.2020.100388] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
|
39
|
Lam SD, Babu MM, Lees J, Orengo CA. Biological impact of mutually exclusive exon switching. PLoS Comput Biol 2021; 17:e1008708. [PMID: 33651795 PMCID: PMC7954323 DOI: 10.1371/journal.pcbi.1008708] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2020] [Revised: 03/12/2021] [Accepted: 01/14/2021] [Indexed: 12/27/2022] Open
Abstract
Alternative splicing can expand the diversity of proteomes. Homologous mutually exclusive exons (MXEs) originate from the same ancestral exon and result in polypeptides with similar structural properties but altered sequence. Why would some genes switch homologous exons and what are their biological impact? Here, we analyse the extent of sequence, structural and functional variability in MXEs and report the first large scale, structure-based analysis of the biological impact of MXE events from different genomes. MXE-specific residues tend to map to single domains, are highly enriched in surface exposed residues and cluster at or near protein functional sites. Thus, MXE events are likely to maintain the protein fold, but alter specificity and selectivity of protein function. This comprehensive resource of MXE events and their annotations is available at: http://gene3d.biochem.ucl.ac.uk/mxemod/. These findings highlight how small, but significant changes at critical positions on a protein surface are exploited in evolution to alter function.
Collapse
Affiliation(s)
- Su Datt Lam
- Institute of Structural and Molecular Biology, University College London, Darwin Building, Gower Street, London, United Kingdom
- Department of Applied Physics, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Malaysia
- * E-mail: (SDL); (JL); (CO)
| | - M. Madan Babu
- MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, United Kingdom
- Department of Structural Biology and Center for Data Driven Discovery, St Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
| | - Jonathan Lees
- Faculty of Health and Life Sciences, Oxford Brookes University, Oxford, United Kingdom
- * E-mail: (SDL); (JL); (CO)
| | - Christine A. Orengo
- Institute of Structural and Molecular Biology, University College London, Darwin Building, Gower Street, London, United Kingdom
- * E-mail: (SDL); (JL); (CO)
| |
Collapse
|
40
|
Bojkova D, McGreig JE, McLaughlin KM, Masterson SG, Antczak M, Widera M, Krähling V, Ciesek S, Wass MN, Michaelis M, Cinatl J. Differentially conserved amino acid positions may reflect differences in SARS-CoV-2 and SARS-CoV behaviour. Bioinformatics 2021; 37:2282-2288. [PMID: 33560365 PMCID: PMC7929367 DOI: 10.1093/bioinformatics/btab094] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Revised: 12/23/2020] [Accepted: 02/05/2021] [Indexed: 12/18/2022] Open
Abstract
Motivation SARS-CoV-2 is a novel coronavirus currently causing a pandemic. Here, we performed a combined in-silico and cell culture comparison of SARS-CoV-2 and the closely related SARS-CoV. Results Many amino acid positions are differentially conserved between SARS-CoV-2 and SARS-CoV, which reflects the discrepancies in virus behaviour, i.e. more effective human-to-human transmission of SARS-CoV-2 and higher mortality associated with SARS-CoV. Variations in the S protein (mediates virus entry) were associated with differences in its interaction with ACE2 (cellular S receptor) and sensitivity to TMPRSS2 (enables virus entry via S cleavage) inhibition. Anti-ACE2 antibodies more strongly inhibited SARS-CoV than SARS-CoV-2 infection, probably due to a stronger SARS-CoV-2 S-ACE2 affinity relative to SARS-CoV S. Moreover, SARS-CoV-2 and SARS-CoV displayed differences in cell tropism. Cellular ACE2 and TMPRSS2 levels did not indicate susceptibility to SARS-CoV-2. In conclusion, we identified genomic variation between SARS-CoV-2 and SARS-CoV that may reflect the differences in their clinical and biological behaviour. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Denisa Bojkova
- Institute for Medical Virology, University Hospital, Goethe University Frankfurt am Main, Germany
| | - Jake E McGreig
- School of Biosciences, University of Kent, Canterbury, UK
| | | | | | | | - Marek Widera
- Institute of Virology, Biomedical Research Center (BMFZ), Philipps University Marburg, Germany
| | - Verena Krähling
- Institute of Virology, Biomedical Research Center (BMFZ), Philipps University Marburg, Germany
| | - Sandra Ciesek
- Institute for Medical Virology, University Hospital, Goethe University Frankfurt am Main, Germany.,German Center for Infection Research, DZIF, Braunschweig, Germany
| | - Mark N Wass
- School of Biosciences, University of Kent, Canterbury, UK
| | | | - Jindrich Cinatl
- Institute for Medical Virology, University Hospital, Goethe University Frankfurt am Main, Germany
| |
Collapse
|
41
|
Assessing Protein Function Through Structural Similarities with CATH. Methods Mol Biol 2021. [PMID: 32006277 DOI: 10.1007/978-1-0716-0270-6_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2023]
Abstract
The functional diversity of proteins is closely related to their differences in sequence and structure. Despite variations in functional sites, global structural similarity is a valuable source of information when assessing potential functional similarities between proteins. The CATH database contains a well-established hierarchical classification of more than 430,000 protein domain structures and nearly 95 million protein domain sequences, with integrated functional annotations for each represented family. The present chapter provides an overview of the main features of CATH with emphasis on exploiting structural similarities to obtain functional information for proteins.
Collapse
|
42
|
Sillitoe I, Bordin N, Dawson N, Waman VP, Ashford P, Scholes HM, Pang CSM, Woodridge L, Rauer C, Sen N, Abbasian M, Le Cornu S, Lam SD, Berka K, Varekova I, Svobodova R, Lees J, Orengo CA. CATH: increased structural coverage of functional space. Nucleic Acids Res 2021; 49:D266-D273. [PMID: 33237325 PMCID: PMC7778904 DOI: 10.1093/nar/gkaa1079] [Citation(s) in RCA: 263] [Impact Index Per Article: 65.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2020] [Revised: 10/20/2020] [Accepted: 11/02/2020] [Indexed: 12/11/2022] Open
Abstract
CATH (https://www.cathdb.info) identifies domains in protein structures from wwPDB and classifies these into evolutionary superfamilies, thereby providing structural and functional annotations. There are two levels: CATH-B, a daily snapshot of the latest domain structures and superfamily assignments, and CATH+, with additional derived data, such as predicted sequence domains, and functionally coherent sequence subsets (Functional Families or FunFams). The latest CATH+ release, version 4.3, significantly increases coverage of structural and sequence data, with an addition of 65,351 fully-classified domains structures (+15%), providing 500 238 structural domains, and 151 million predicted sequence domains (+59%) assigned to 5481 superfamilies. The FunFam generation pipeline has been re-engineered to cope with the increased influx of data. Three times more sequences are captured in FunFams, with a concomitant increase in functional purity, information content and structural coverage. FunFam expansion increases the structural annotations provided for experimental GO terms (+59%). We also present CATH-FunVar web-pages displaying variations in protein sequences and their proximity to known or predicted functional sites. We present two case studies (1) putative cancer drivers and (2) SARS-CoV-2 proteins. Finally, we have improved links to and from CATH including SCOP, InterPro, Aquaria and 2DProt.
Collapse
Affiliation(s)
- Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Natalie Dawson
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Vaishali P Waman
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Paul Ashford
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Harry M Scholes
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Camilla S M Pang
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Laurel Woodridge
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Clemens Rauer
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Neeladri Sen
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Mahnaz Abbasian
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Sean Le Cornu
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Su Datt Lam
- Department of Applied Physics, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Selangor 43600, Malaysia
| | - Karel Berka
- Regional Centre of Advanced Technologies and Materials, Department of Physical Chemistry, Faculty of Science, Palacký University Olomouc, Olomouc 771 46, Czech Republic
| | - Ivana Hutařová Varekova
- National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno 602 00, Czech Republic
| | - Radka Svobodova
- Central European Institute of Technology, Masaryk University, Brno 625 00, Czech Republic| National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno 602 00, Czech Republic
| | - Jon Lees
- Department of Biological and Medical Sciences, Faculty of Health and Life Sciences, Oxford Brookes University, Oxford OX3 0BP, UK
| | - Christine A Orengo
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| |
Collapse
|
43
|
Liu YW, Hsu TW, Chang CY, Liao WH, Chang JM. GODoc: high-throughput protein function prediction using novel k-nearest-neighbor and voting algorithms. BMC Bioinformatics 2020; 21:276. [PMID: 33203348 PMCID: PMC7672824 DOI: 10.1186/s12859-020-03556-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Accepted: 05/25/2020] [Indexed: 12/01/2022] Open
Abstract
Background Biological data has grown explosively with the advance of next-generation sequencing. However, annotating protein function with wet lab experiments is time-consuming. Fortunately, computational function prediction can help wet labs formulate biological hypotheses and prioritize experiments. Gene Ontology (GO) is a framework for unifying the representation of protein function in a hierarchical tree composed of GO terms. Results We propose GODoc, a general protein GO prediction framework based on sequence information which combines feature engineering, feature reduction, and a novel k-nearest-neighbor algorithm to resolve the multiple GO prediction problem. Comprehensive evaluation on CAFA2 shows that GODoc performs better than two baseline models. In the CAFA3 competition (68 teams), GODoc ranks 10th in Cellular Component Ontology. Regarding the species-specific task, the proposed method ranks 10th and 8th in the eukaryotic Cellular Component Ontology and the prokaryotic Molecular Function Ontology, respectively. In the term-centric task, GODoc performs third and is tied for first for the biofilm formation of Pseudomonas aeruginosa and the long-term memory of Drosophila melanogaster, respectively. Conclusions We have developed a novel and effective strategy to incorporate a training procedure into the k-nearest neighbor algorithm (instance-based learning) which is capable of solving the Gene Ontology multiple-label prediction problem, which is especially notable given the thousands of Gene Ontology terms.
Collapse
Affiliation(s)
- Yi-Wei Liu
- Department of Computer Science, National Chengchi University, 11605, Taipei, Taiwan
| | - Tz-Wei Hsu
- Department of Computer Science, National Chengchi University, 11605, Taipei, Taiwan
| | - Che-Yu Chang
- Department of Computer Science, National Chengchi University, 11605, Taipei, Taiwan
| | - Wen-Hung Liao
- Department of Computer Science, National Chengchi University, 11605, Taipei, Taiwan.
| | - Jia-Ming Chang
- Department of Computer Science, National Chengchi University, 11605, Taipei, Taiwan.
| |
Collapse
|
44
|
Makrodimitris S, van Ham RCHJ, Reinders MJT. Automatic Gene Function Prediction in the 2020's. Genes (Basel) 2020; 11:E1264. [PMID: 33120976 PMCID: PMC7692357 DOI: 10.3390/genes11111264] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Revised: 10/19/2020] [Accepted: 10/21/2020] [Indexed: 02/06/2023] Open
Abstract
The current rate at which new DNA and protein sequences are being generated is too fast to experimentally discover the functions of those sequences, emphasizing the need for accurate Automatic Function Prediction (AFP) methods. AFP has been an active and growing research field for decades and has made considerable progress in that time. However, it is certainly not solved. In this paper, we describe challenges that the AFP field still has to overcome in the future to increase its applicability. The challenges we consider are how to: (1) include condition-specific functional annotation, (2) predict functions for non-model species, (3) include new informative data sources, (4) deal with the biases of Gene Ontology (GO) annotations, and (5) maximally exploit the GO to obtain performance gains. We also provide recommendations for addressing those challenges, by adapting (1) the way we represent proteins and genes, (2) the way we represent gene functions, and (3) the algorithms that perform the prediction from gene to function. Together, we show that AFP is still a vibrant research area that can benefit from continuing advances in machine learning with which AFP in the 2020s can again take a large step forward reinforcing the power of computational biology.
Collapse
Affiliation(s)
- Stavros Makrodimitris
- Delft Bioinformatics Lab, Delft University of Technology, 2628XE Delft, The Netherlands; (R.C.H.J.v.H.); (M.J.T.R.)
- Keygene N.V., 6708PW Wageningen, The Netherlands
| | - Roeland C. H. J. van Ham
- Delft Bioinformatics Lab, Delft University of Technology, 2628XE Delft, The Netherlands; (R.C.H.J.v.H.); (M.J.T.R.)
- Keygene N.V., 6708PW Wageningen, The Netherlands
| | - Marcel J. T. Reinders
- Delft Bioinformatics Lab, Delft University of Technology, 2628XE Delft, The Netherlands; (R.C.H.J.v.H.); (M.J.T.R.)
- Leiden Computational Biology Center, Leiden University Medical Center, 2333ZC Leiden, The Netherlands
| |
Collapse
|
45
|
Vishnoi S, Matre H, Garg P, Pandey SK. Artificial intelligence and machine learning for protein toxicity prediction using proteomics data. Chem Biol Drug Des 2020; 96:902-920. [DOI: 10.1111/cbdd.13701] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2019] [Revised: 04/23/2020] [Accepted: 04/26/2020] [Indexed: 12/13/2022]
Affiliation(s)
- Shubham Vishnoi
- Department of Physics, Bernal Institute University of Limerick Limerick Ireland
| | - Himani Matre
- Department of Biotechnology National Institute of Pharmaceutical Education and Research S.A.S. Nagar India
| | - Prabha Garg
- Department of Pharmacoinformatics National Institute of Pharmaceutical Education and Research Mohali India
| | - Shubham Kumar Pandey
- Department of Pharmacoinformatics National Institute of Pharmaceutical Education and Research Mohali India
| |
Collapse
|
46
|
Lam SD, Bordin N, Waman VP, Scholes HM, Ashford P, Sen N, van Dorp L, Rauer C, Dawson NL, Pang CSM, Abbasian M, Sillitoe I, Edwards SJL, Fraternali F, Lees JG, Santini JM, Orengo CA. SARS-CoV-2 spike protein predicted to form complexes with host receptor protein orthologues from a broad range of mammals. Sci Rep 2020; 10:16471. [PMID: 33020502 PMCID: PMC7536205 DOI: 10.1038/s41598-020-71936-5] [Citation(s) in RCA: 81] [Impact Index Per Article: 16.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2020] [Accepted: 08/17/2020] [Indexed: 01/04/2023] Open
Abstract
SARS-CoV-2 has a zoonotic origin and was transmitted to humans via an undetermined intermediate host, leading to infections in humans and other mammals. To enter host cells, the viral spike protein (S-protein) binds to its receptor, ACE2, and is then processed by TMPRSS2. Whilst receptor binding contributes to the viral host range, S-protein:ACE2 complexes from other animals have not been investigated widely. To predict infection risks, we modelled S-protein:ACE2 complexes from 215 vertebrate species, calculated changes in the energy of the complex caused by mutations in each species, relative to human ACE2, and correlated these changes with COVID-19 infection data. We also analysed structural interactions to better understand the key residues contributing to affinity. We predict that mutations are more detrimental in ACE2 than TMPRSS2. Finally, we demonstrate phylogenetically that human SARS-CoV-2 strains have been isolated in animals. Our results suggest that SARS-CoV-2 can infect a broad range of mammals, but few fish, birds or reptiles. Susceptible animals could serve as reservoirs of the virus, necessitating careful ongoing animal management and surveillance.
Collapse
Affiliation(s)
- S D Lam
- Department of Applied Physics, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor, Malaysia
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - N Bordin
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - V P Waman
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - H M Scholes
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - P Ashford
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - N Sen
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
- Indian Institute of Science Education and Research, Pune, 411008, India
| | - L van Dorp
- UCL Genetics Institute, University College London, London, WC1E 6BT, UK
| | - C Rauer
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - N L Dawson
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - C S M Pang
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - M Abbasian
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - I Sillitoe
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - S J L Edwards
- Department of Science and Technology Studies, University College London, London, WC1E 6BT, UK
| | - F Fraternali
- Randall Division of Cell and Molecular Biophysics, Guy's Campus, New Hunt's House, King's College London, London, SE1 1UL, UK
| | - J G Lees
- Department of Biological and Medical Sciences, Faculty of Health and Life Sciences, Oxford Brookes University, Oxford, OX3 OBP, UK
| | - J M Santini
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - C A Orengo
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK.
| |
Collapse
|
47
|
Koo DCE, Bonneau R. Towards region-specific propagation of protein functions. Bioinformatics 2020; 35:1737-1744. [PMID: 30304483 PMCID: PMC6513163 DOI: 10.1093/bioinformatics/bty834] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2018] [Revised: 08/23/2018] [Accepted: 10/08/2018] [Indexed: 01/06/2023] Open
Abstract
MOTIVATION Due to the nature of experimental annotation, most protein function prediction methods operate at the protein-level, where functions are assigned to full-length proteins based on overall similarities. However, most proteins function by interacting with other proteins or molecules, and many functional associations should be limited to specific regions rather than the entire protein length. Most domain-centric function prediction methods depend on accurate domain family assignments to infer relationships between domains and functions, with regions that are unassigned to a known domain-family left out of functional evaluation. Given the abundance of residue-level annotations currently available, we present a function prediction methodology that automatically infers function labels of specific protein regions using protein-level annotations and multiple types of region-specific features. RESULTS We apply this method to local features obtained from InterPro, UniProtKB and amino acid sequences and show that this method improves both the accuracy and region-specificity of protein function transfer and prediction. We compare region-level predictive performance of our method against that of a whole-protein baseline method using proteins with structurally verified binding sites and also compare protein-level temporal holdout predictive performances to expand the variety and specificity of GO terms we could evaluate. Our results can also serve as a starting point to categorize GO terms into region-specific and whole-protein terms and select prediction methods for different classes of GO terms. AVAILABILITY AND IMPLEMENTATION The code and features are freely available at: https://github.com/ek1203/rsfp. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Da Chen Emily Koo
- Department of Biology, Center for Genomics and Systems Biology, New York University, New York, NY, USA
| | - Richard Bonneau
- Department of Biology, Center for Genomics and Systems Biology, New York University, New York, NY, USA.,Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA.,Center for Data Science, New York University, New York, NY, USA
| |
Collapse
|
48
|
Ribeiro AJM, Das S, Dawson N, Zaru R, Orchard S, Thornton JM, Orengo C, Zeqiraj E, Murphy JM, Eyers PA. Emerging concepts in pseudoenzyme classification, evolution, and signaling. Sci Signal 2019; 12:eaat9797. [PMID: 31409758 DOI: 10.1126/scisignal.aat9797] [Citation(s) in RCA: 68] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
The 21st century is witnessing an explosive surge in our understanding of pseudoenzyme-driven regulatory mechanisms in biology. Pseudoenzymes are proteins that have sequence homology with enzyme families but that are proven or predicted to lack enzyme activity due to mutations in otherwise conserved catalytic amino acids. The best-studied pseudoenzymes are pseudokinases, although examples from other families are emerging at a rapid rate as experimental approaches catch up with an avalanche of freely available informatics data. Kingdom-wide analysis in prokaryotes, archaea and eukaryotes reveals that between 5 and 10% of proteins that make up enzyme families are pseudoenzymes, with notable expansions and contractions seemingly associated with specific signaling niches. Pseudoenzymes can allosterically activate canonical enzymes, act as scaffolds to control assembly of signaling complexes and their localization, serve as molecular switches, or regulate signaling networks through substrate or enzyme sequestration. Molecular analysis of pseudoenzymes is rapidly advancing knowledge of how they perform noncatalytic functions and is enabling the discovery of unexpected, and previously unappreciated, functions of their intensively studied enzyme counterparts. Notably, upon further examination, some pseudoenzymes have previously unknown enzymatic activities that could not have been predicted a priori. Pseudoenzymes can be targeted and manipulated by small molecules and therefore represent new therapeutic targets (or anti-targets, where intervention should be avoided) in various diseases. In this review, which brings together broad bioinformatics and cell signaling approaches in the field, we highlight a selection of findings relevant to a contemporary understanding of pseudoenzyme-based biology.
Collapse
Affiliation(s)
- António J M Ribeiro
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Sayoni Das
- Structural and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK
| | - Natalie Dawson
- Structural and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK
| | - Rossana Zaru
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Sandra Orchard
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Janet M Thornton
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Christine Orengo
- Structural and Molecular Biology, University College London, Gower Street, London WC1E 6BT, UK
| | - Elton Zeqiraj
- Astbury Centre for Structural Molecular Biology, Molecular and Cellular Biology, Faculty of Biological Sciences, Astbury Building, Room 8.109, University of Leeds, Leeds LS2 9JT, UK
| | - James M Murphy
- Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Victoria 3052, Australia
- Department of Medical Biology, University of Melbourne, Parkville, Victoria 3052, Australia
| | - Patrick A Eyers
- Department of Biochemistry, Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, UK.
| |
Collapse
|
49
|
Mallick H, Franzosa EA, Mclver LJ, Banerjee S, Sirota-Madi A, Kostic AD, Clish CB, Vlamakis H, Xavier RJ, Huttenhower C. Predictive metabolomic profiling of microbial communities using amplicon or metagenomic sequences. Nat Commun 2019; 10:3136. [PMID: 31316056 PMCID: PMC6637180 DOI: 10.1038/s41467-019-10927-1] [Citation(s) in RCA: 161] [Impact Index Per Article: 26.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2018] [Accepted: 06/06/2019] [Indexed: 02/07/2023] Open
Abstract
Microbial community metabolomics, particularly in the human gut, are beginning to provide a new route to identify functions and ecology disrupted in disease. However, these data can be costly and difficult to obtain at scale, while amplicon or shotgun metagenomic sequencing data are readily available for populations of many thousands. Here, we describe a computational approach to predict potentially unobserved metabolites in new microbial communities, given a model trained on paired metabolomes and metagenomes from the environment of interest. Focusing on two independent human gut microbiome datasets, we demonstrate that our framework successfully recovers community metabolic trends for more than 50% of associated metabolites. Similar accuracy is maintained using amplicon profiles of coral-associated, murine gut, and human vaginal microbiomes. We also provide an expected performance score to guide application of the model in new samples. Our results thus demonstrate that this 'predictive metabolomic' approach can aid in experimental design and provide useful insights into the thousands of community profiles for which only metagenomes are currently available.
Collapse
Affiliation(s)
- Himel Mallick
- Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, 02115, USA
| | - Eric A Franzosa
- Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, 02115, USA
| | - Lauren J Mclver
- Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, 02115, USA
| | - Soumya Banerjee
- Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, 02115, USA
| | - Alexandra Sirota-Madi
- Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, 02115, USA
| | - Aleksandar D Kostic
- Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, 02115, USA
| | - Clary B Clish
- Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Hera Vlamakis
- Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Ramnik J Xavier
- Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA.
- Center for Computational and Integrative Biology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, 02114, USA.
- Gastrointestinal Unit and Center for the Study of Inflammatory Bowel Disease, Massachusetts General Hospital and Harvard Medical School, Boston, MA, 02114, USA.
- Center for Microbiome Informatics and Therapeutics, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
| | - Curtis Huttenhower
- Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA.
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, 02115, USA.
| |
Collapse
|
50
|
Environmental conditions shape the nature of a minimal bacterial genome. Nat Commun 2019; 10:3100. [PMID: 31308405 PMCID: PMC6629657 DOI: 10.1038/s41467-019-10837-2] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2018] [Accepted: 06/04/2019] [Indexed: 12/16/2022] Open
Abstract
Of the 473 genes in the genome of the bacterium with the smallest genome generated to date, 149 genes have unknown function, emphasising a universal problem; less than 1% of proteins have experimentally determined annotations. Here, we combine the results from state-of-the-art in silico methods for functional annotation and assign functions to 66 of the 149 proteins. Proteins that are still not annotated lack orthologues, lack protein domains, and/ or are membrane proteins. Twenty-four likely transporter proteins are identified indicating the importance of nutrient uptake into and waste disposal out of the minimal bacterial cell in a nutrient-rich environment after removal of metabolic enzymes. Hence, the environment shapes the nature of a minimal genome. Our findings also show that the combination of multiple different state-of-the-art in silico methods for annotating proteins is able to predict functions, even for difficult to characterise proteins and identify crucial gaps for further development.
Collapse
|