1
|
Cui XC, Zheng Y, Liu Y, Yuchi Z, Yuan YJ. AI-driven de novo enzyme design: Strategies, applications, and future prospects. Biotechnol Adv 2025; 82:108603. [PMID: 40368118 DOI: 10.1016/j.biotechadv.2025.108603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2025] [Revised: 04/22/2025] [Accepted: 05/10/2025] [Indexed: 05/16/2025]
Abstract
Enzymes are indispensable for biological processes and diverse applications across industries. While top-down modification strategies, such as directed evolution, have achieved remarkable success in optimizing existing enzymes, bottom-up de novo enzyme design has emerged as a transformative approach for engineering novel enzymes with customized catalytic functions, independent of natural templates. Recent advancements in artificial intelligence (AI) and computational power have significantly accelerated this field, enabling breakthroughs in enzyme engineering. These technologies facilitate the rapid generation of enzyme structures and amino acid sequences optimized for specific functions, thereby enhancing design efficiency. They also support functional validation and activity optimization, improving the catalytic performance, stability, and robustness of de novo designed enzymes. This review highlights recent advancements in AI-driven de novo enzyme design, discusses strategies for validation and optimization, and examines the challenges and future prospects of integrating these technologies into enzyme development.
Collapse
Affiliation(s)
- Xi-Chen Cui
- State Key Laboratory of Synthetic Biology, Tianjin University, Tianjin 30072, PR China; Frontiers Science Center for Synthetic Biology(Ministry of Education), School of Synthetic Biology and Biomanufacturing, Tianjin University, Tianjin 300072, PR China
| | - Yan Zheng
- State Key Laboratory of Synthetic Biology, Tianjin University, Tianjin 30072, PR China; Frontiers Science Center for Synthetic Biology(Ministry of Education), School of Synthetic Biology and Biomanufacturing, Tianjin University, Tianjin 300072, PR China
| | - Ye Liu
- State Key Laboratory of Synthetic Biology, Tianjin University, Tianjin 30072, PR China; Frontiers Science Center for Synthetic Biology(Ministry of Education), School of Synthetic Biology and Biomanufacturing, Tianjin University, Tianjin 300072, PR China; School of Pharmaceutical Science and Technology, Tianjin University, Tianjin 300072, PR China
| | - Zhiguang Yuchi
- State Key Laboratory of Synthetic Biology, Tianjin University, Tianjin 30072, PR China; Frontiers Science Center for Synthetic Biology(Ministry of Education), School of Synthetic Biology and Biomanufacturing, Tianjin University, Tianjin 300072, PR China; School of Pharmaceutical Science and Technology, Tianjin University, Tianjin 300072, PR China.
| | - Ying-Jin Yuan
- State Key Laboratory of Synthetic Biology, Tianjin University, Tianjin 30072, PR China; Frontiers Science Center for Synthetic Biology(Ministry of Education), School of Synthetic Biology and Biomanufacturing, Tianjin University, Tianjin 300072, PR China.
| |
Collapse
|
2
|
Schottlender G, Prieto JM, Clemente C, Schuster CD, Dumas V, Fernández Do Porto D, Martí MA. Bacterial cytochrome P450s: a bioinformatics odyssey of substrate discovery. Front Microbiol 2024; 15:1343029. [PMID: 38384262 PMCID: PMC10879549 DOI: 10.3389/fmicb.2024.1343029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Accepted: 01/23/2024] [Indexed: 02/23/2024] Open
Abstract
Bacterial P450 cytochromes (BacCYPs) are versatile heme-containing proteins responsible for oxidation reactions on a wide range of substrates, contributing to the production of valuable natural products with limitless biotechnological potential. While the sequencing of microbial genomes has provided a wealth of BacCYP sequences, functional characterization lags behind, hindering our understanding of their roles. This study employs a comprehensive approach to predict BacCYP substrate specificity, bridging the gap between sequence and function. We employed an integrated approach combining sequence and functional data analysis, genomic context exploration, 3D structural modeling with molecular docking, and phylogenetic clustering. The research begins with an in-depth analysis of BacCYP sequence diversity and structural characteristics, revealing conserved motifs and recurrent residues in the active site. Phylogenetic analysis identifies distinct groups within the BacCYP family based on sequence similarity. However, our study reveals that sequence alone does not consistently predict substrate specificity, necessitating additional perspectives. The study delves into the genetic context of BacCYPs, utilizing neighboring gene information to infer potential substrates, a method proven very effective in many cases. Molecular docking is employed to assess BacCYP-substrate interactions, confirming potential substrates and providing insights into selectivity. Finally, a comprehensive strategy is proposed for predicting BacCYP substrates, involving all the evaluated approaches. The effectiveness of this strategy is demonstrated with two case studies, highlighting its potential for substrate discovery.
Collapse
Affiliation(s)
- Gustavo Schottlender
- Facultad de Ciencias Exactas y Naturales, Instituto de Cálculo, Universidad de Buenos Aires, CONICET, Universidad de Buenos Aires, Buenos Aires, Argentina
| | - Juan Manuel Prieto
- Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN) CONICET, Buenos Aires, Argentina
| | - Camila Clemente
- Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN) CONICET, Buenos Aires, Argentina
| | - Claudio David Schuster
- Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN) CONICET, Buenos Aires, Argentina
| | - Victoria Dumas
- Departamento de Química Biológica, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires (FCEyN-UBA), Buenos Aires, Argentina
| | - Darío Fernández Do Porto
- Facultad de Ciencias Exactas y Naturales, Instituto de Cálculo, Universidad de Buenos Aires, CONICET, Universidad de Buenos Aires, Buenos Aires, Argentina
- Departamento de Química Biológica, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires (FCEyN-UBA), Buenos Aires, Argentina
| | - Marcelo Adrian Martí
- Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales (IQUIBICEN) CONICET, Buenos Aires, Argentina
- Departamento de Química Biológica, Facultad de Ciencias Exactas y Naturales, Universidad de Buenos Aires (FCEyN-UBA), Buenos Aires, Argentina
| |
Collapse
|
3
|
Russo ET, Barone F, Bateman A, Cozzini S, Punta M, Laio A. DPCfam: Unsupervised protein family classification by Density Peak Clustering of large sequence datasets. PLoS Comput Biol 2022; 18:e1010610. [PMID: 36260616 PMCID: PMC9621593 DOI: 10.1371/journal.pcbi.1010610] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Revised: 10/31/2022] [Accepted: 09/26/2022] [Indexed: 11/07/2022] Open
Abstract
Proteins that are known only at a sequence level outnumber those with an experimental characterization by orders of magnitude. Classifying protein regions (domains) into homologous families can generate testable functional hypotheses for yet unannotated sequences. Existing domain family resources typically use at least some degree of manual curation: they grow slowly over time and leave a large fraction of the protein sequence space unclassified. We here describe automatic clustering by Density Peak Clustering of UniRef50 v. 2017_07, a protein sequence database including approximately 23M sequences. We performed a radical re-implementation of a pipeline we previously developed in order to allow handling millions of sequences and data volumes of the order of 3 TeraBytes. The modified pipeline, which we call DPCfam, finds ∼ 45,000 protein clusters in UniRef50. Our automatic classification is in close correspondence to the ones of the Pfam and ECOD resources: in particular, about 81% of medium-large Pfam families and 72% of ECOD families can be mapped to clusters generated by DPCfam. In addition, our protocol finds more than 14,000 clusters constituted of protein regions with no Pfam annotation, which are therefore candidates for representing novel protein families. These results are made available to the scientific community through a dedicated repository.
Collapse
Affiliation(s)
| | - Federico Barone
- SISSA, Trieste, Italy
- AREA SCIENCE PARK, Trieste, Italy
- Department of Mathematics and Geosciences, University of Trieste, Trieste, Italy
| | - Alex Bateman
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, United Kingdom
| | | | - Marco Punta
- Center for Omics Sciences, IRCCS San Raffaele Institute, Milan, Italy
- Unit of Immunogenetics, Leukemia Genomics and Immunobiology, Division of Immunology, Transplantation and Infectious Disease, IRCCS San Raffaele Scientific Institute, Milan, Italy
| | | |
Collapse
|
4
|
Raghavan V, Kraft L, Mesny F, Rigerte L. A simple guide to de novo transcriptome assembly and annotation. Brief Bioinform 2022; 23:6514404. [PMID: 35076693 PMCID: PMC8921630 DOI: 10.1093/bib/bbab563] [Citation(s) in RCA: 53] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 12/03/2021] [Accepted: 12/09/2021] [Indexed: 12/13/2022] Open
Abstract
A transcriptome constructed from short-read RNA sequencing (RNA-seq) is an easily attainable proxy catalog of protein-coding genes when genome assembly is unnecessary, expensive or difficult. In the absence of a sequenced genome to guide the reconstruction process, the transcriptome must be assembled de novo using only the information available in the RNA-seq reads. Subsequently, the sequences must be annotated in order to identify sequence-intrinsic and evolutionary features in them (for example, protein-coding regions). Although straightforward at first glance, de novo transcriptome assembly and annotation can quickly prove to be challenging undertakings. In addition to familiarizing themselves with the conceptual and technical intricacies of the tasks at hand and the numerous pre- and post-processing steps involved, those interested must also grapple with an overwhelmingly large choice of tools. The lack of standardized workflows, fast pace of development of new tools and techniques and paucity of authoritative literature have served to exacerbate the difficulty of the task even further. Here, we present a comprehensive overview of de novo transcriptome assembly and annotation. We discuss the procedures involved, including pre- and post-processing steps, and present a compendium of corresponding tools.
Collapse
Affiliation(s)
- Venket Raghavan
- Corresponding authors: Venket Raghavan, Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany. E-mail: ; Louis Kraft, Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany. E-mail:
| | - Louis Kraft
- Corresponding authors: Venket Raghavan, Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany. E-mail: ; Louis Kraft, Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, 37077 Göttingen, Germany. E-mail:
| | | | | |
Collapse
|
5
|
Russo ET, Laio A, Punta M. Density Peak clustering of protein sequences associated to a Pfam clan reveals clear similarities and interesting differences with respect to manual family annotation. BMC Bioinformatics 2021; 22:121. [PMID: 33711918 PMCID: PMC7955657 DOI: 10.1186/s12859-021-04013-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2020] [Accepted: 02/09/2021] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND The identification of protein families is of outstanding practical importance for in silico protein annotation and is at the basis of several bioinformatic resources. Pfam is possibly the most well known protein family database, built in many years of work by domain experts with extensive use of manual curation. This approach is generally very accurate, but it is quite time consuming and it may suffer from a bias generated from the hand-curation itself, which is often guided by the available experimental evidence. RESULTS We introduce a procedure that aims to identify automatically putative protein families. The procedure is based on Density Peak Clustering and uses as input only local pairwise alignments between protein sequences. In the experiment we present here, we ran the algorithm on about 4000 full-length proteins with at least one domain classified by Pfam as belonging to the Pseudouridine synthase and Archaeosine transglycosylase (PUA) clan. We obtained 71 automatically-generated sequence clusters with at least 100 members. While our clusters were largely consistent with the Pfam classification, showing good overlap with either single or multi-domain Pfam family architectures, we also observed some inconsistencies. The latter were inspected using structural and sequence based evidence, which suggested that the automatic classification captured evolutionary signals reflecting non-trivial features of protein family architectures. Based on this analysis we identified a putative novel pre-PUA domain as well as alternative boundaries for a few PUA or PUA-associated families. As a first indication that our approach was unlikely to be clan-specific, we performed the same analysis on the P53 clan, obtaining comparable results. CONCLUSIONS The clustering procedure described in this work takes advantage of the information contained in a large set of pairwise alignments and successfully identifies a set of putative families and family architectures in an unsupervised manner. Comparison with the Pfam classification highlights significant overlap and points to interesting differences, suggesting that our new algorithm could have potential in applications related to automatic protein classification. Testing this hypothesis, however, will require further experiments on large and diverse sequence datasets.
Collapse
Affiliation(s)
| | | | - Marco Punta
- Centre for Evolution and Cancer, The Institute of Cancer Research, London, SM2 5NG UK
- Present Address: Center for Omics Sciences, IRCCS San Raffaele Hospital, 20132 Milan, Italy
| |
Collapse
|
6
|
Nethathe B, Abera A, Naidoo V. Expression and phylogeny of multidrug resistance protein 2 and 4 in African white backed vulture (Gyps africanus). PeerJ 2020; 8:e10422. [PMID: 33344079 PMCID: PMC7718797 DOI: 10.7717/peerj.10422] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2020] [Accepted: 11/02/2020] [Indexed: 11/20/2022] Open
Abstract
Diclofenac toxicity in old world vultures is well described in the literature by both the severity of the toxicity induced and the speed of death. While the mechanism of toxicity remains unknown at present, the necropsy signs of gout suggests primary renal involvement at the level of the uric acid excretory pathways. From information in the chicken and man, uric acid excretion is known to be a complex process that involves a combination of glomerular filtration and active tubular excretion. For the proximal convoluted tubules excretion occurs as a two-step process with the basolateral cell membrane using the organic anion transporters and the apical membrane using the multidrug resistant protein to transport uric acid from the blood into the tubular fluid. With uric acid excretion seemingly inhibited by diclofenac, it becomes important to characterize these transporter mechanism at the species level. With no information being available on the molecular characterization/expression of MRPs of Gyps africanus, for this study we used next generation sequencing, and Sanger sequencing on the renal tissue of African white backed vulture (AWB), as the first step to establish if the MRPs gene are expressed in AWB. In silico analysis was conducted using different software to ascertain the function of the latter genes. The sequencing results revealed that the MRP2 and MRP4 are expressed in AWB vultures. Phylogeny of avian MRPs genes confirms that vultures and eagles are closely related, which could be attributed to having the same ancestral genes and foraging behavior. In silico analysis confirmed the transcribed proteins would transports anionic compounds and glucose.
Collapse
Affiliation(s)
- Bono Nethathe
- Department of Paraclinical Science, Faculty of Veterinary Science, University of Pretoria, Onderstepoort, Pretoria, South Africa.,Department of Food Science and Technology, University of Venda, Thohoyandou, Limpopo, South Africa
| | - Aron Abera
- Inqaba Biotechnology, Sunnyside, Pretoria, South Africa
| | - Vinny Naidoo
- Department of Paraclinical Science, Faculty of Veterinary Science, University of Pretoria, Onderstepoort, Pretoria, South Africa
| |
Collapse
|
7
|
Gulyaeva AA, Sigorskih AI, Ocheredko ES, Samborskiy DV, Gorbalenya AE. LAMPA, LArge Multidomain Protein Annotator, and its application to RNA virus polyproteins. Bioinformatics 2020; 36:2731-2739. [PMID: 32003788 PMCID: PMC7203729 DOI: 10.1093/bioinformatics/btaa065] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2019] [Revised: 01/02/2020] [Accepted: 01/23/2020] [Indexed: 12/28/2022] Open
Abstract
Motivation To facilitate accurate estimation of statistical significance of sequence similarity in profile–profile searches, queries should ideally correspond to protein domains. For multidomain proteins, using domains as queries depends on delineation of domain borders, which may be unknown. Thus, proteins are commonly used as queries that complicate establishing homology for similarities close to cutoff levels of statistical significance. Results In this article, we describe an iterative approach, called LAMPA, LArge Multidomain Protein Annotator, that resolves the above conundrum by gradual expansion of hit coverage of multidomain proteins through re-evaluating statistical significance of hit similarity using ever smaller queries defined at each iteration. LAMPA employs TMHMM and HHsearch for recognition of transmembrane regions and homology, respectively. We used Pfam database for annotating 2985 multidomain proteins (polyproteins) composed of >1000 amino acid residues, which dominate proteomes of RNA viruses. Under strict cutoffs, LAMPA outperformed HHsearch-mediated runs using intact polyproteins as queries by three measures: number of and coverage by identified homologous regions, and number of hit Pfam profiles. Compared to HHsearch, LAMPA identified 507 extra homologous regions in 14.4% of polyproteins. This Pfam-based annotation of RNA virus polyproteins by LAMPA was also superior to RefSeq expert annotation by two measures, region number and annotated length, for 69.3% of RNA virus polyprotein entries. We rationalized the obtained results based on dependencies of HHsearch hit statistical significance for local alignment similarity score from lengths and diversities of query-target pairs in computational experiments. Availability and implementation LAMPA 1.0.0 R package is placed at github (https://github.com/Gorbalenya-Lab/LAMPA). Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Anastasia A Gulyaeva
- Department of Medical Microbiology, Leiden University Medical Center, Leiden 2300 RC, The Netherlands
| | | | | | - Dmitry V Samborskiy
- Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow 119899, Russia
| | - Alexander E Gorbalenya
- Department of Medical Microbiology, Leiden University Medical Center, Leiden 2300 RC, The Netherlands.,Faculty of Bioengineering and Bioinformatics.,Belozersky Institute of Physico-Chemical Biology, Lomonosov Moscow State University, Moscow 119899, Russia.,Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden 2300 RC, The Netherlands
| |
Collapse
|
8
|
Mishra S, Rastogi YP, Jabin S, Kaur P, Amir M, Khatoon S. A bacterial phyla dataset for protein function prediction. Data Brief 2019; 28:105002. [PMID: 31921945 PMCID: PMC6950771 DOI: 10.1016/j.dib.2019.105002] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2019] [Accepted: 12/06/2019] [Indexed: 11/29/2022] Open
Abstract
Protein function prediction has been the most worked upon and the most challenging problem for computational biologists. The vast majority of known proteins have yet not been characterised experimentally, and there is significant gap between their structures and functions. New un-annotated sequences are being added to the public protein databases (e.g. UniprotKB) at an enormous pace [1]. Such proteins with unknown functions might play key role in the metabolism, growth and development regulation. Thus, if functions of unknown proteins left undiscovered, researchers may skip important information(s). Based on their sequence, structure, evolutionary history, and their association with other proteins, tools of computational biology can provide insights into the function of proteins [2]. For proteins with well characterised close relatives, it is trivial to infer function. Orphan proteins without discernible sequence relatives present a greater challenge [3]. Here the task of experimental characterisation is blind and becomes unwieldy. It is highly unlikely that all known proteins will ever be completely experimentally characterised [4]. Thus, there is an emergent need to develop fast and accurate computational approaches to fulfil this requirement. Towards this end, we prepared a dataset for protein function prediction by extracting protein sequences and annotations of reviewed prokaryotic proteins (total count 323,719 as accessed on date March 10, 2019) belonging to 9 bacterial phyla Actinobacteria, Bacteroidetes, Chlamydiae, Cyanobacteria, Firmicutes, Fusobacteria, Proteobacteria, Spirochaetes and Tenericutes. Corresponding to the most frequent 1739 Gene Ontology (Molecular Function) terms, samples were filtered, and 171,212 proteins were retrieved for feature generation. The Dataset was generated by calculating the sequence, sub-sequence, physiochemical, annotation-based features for each 171,212 reviewed proteins using method in [10]. These features constitute a total of 9890 attributes for each sequence of protein along with 1739 Gene Ontology terms. Each protein sequence is assigned one or more of 1739 Gene Ontology (Molecular Function) term as its target label. The Dataset contains the Entry and Entry name of each sequence corresponding to UniprotKB Database. This dataset being huge in size (171,212 samples X 9890 features, 1739 classes with multiple values) and equipped with enough number of positive and negative samples of each 1739 class, is good for testing efficiency of any upcoming deep learning models [5]. We divided the full dataset of 171,212 reviewed proteins in the ratio 3:1 to form Train/Test dataset 1; train dataset with 128,409 samples and test dataset with 42,803 samples to facilitate training of a deep learning model. The train and test datasets are stratified to contain good proportion of each 1739 classes. We then prepared a dataset 2 of pathogenic unreviewed proteins of the 9 bacterial phyla each with 9890 features same as train/train dataset of reviewed proteins but without target labels in order to predict their functions using deep learning model proposed in [5].
Collapse
Affiliation(s)
- Sarthak Mishra
- Department of Computer Science, Jamia Millia Islamia, Jamia Nagar, New Delhi, 110025, Delhi, India
| | - Yash Pratap Rastogi
- Department of Computer Science, Jamia Millia Islamia, Jamia Nagar, New Delhi, 110025, Delhi, India
| | - Suraiya Jabin
- Department of Computer Science, Jamia Millia Islamia, Jamia Nagar, New Delhi, 110025, Delhi, India
| | - Punit Kaur
- Department of Biophysics, All India Institute of Medical Sciences (AIIMS), New Delhi, 110029, Delhi, India
| | - Mohammad Amir
- Department of Computer Science, Jamia Millia Islamia, Jamia Nagar, New Delhi, 110025, Delhi, India
| | - Shabanam Khatoon
- Department of Computer Science, Jamia Millia Islamia, Jamia Nagar, New Delhi, 110025, Delhi, India
| |
Collapse
|
9
|
Sim EUH, Talwar SP. In silico evidence of de novo interactions between ribosomal and Epstein - Barr virus proteins. BMC Mol Cell Biol 2019; 20:34. [PMID: 31416416 PMCID: PMC6694676 DOI: 10.1186/s12860-019-0219-y] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2019] [Accepted: 08/08/2019] [Indexed: 12/29/2022] Open
Abstract
Background Association of Epstein-Barr virus (EBV) encoded latent gene products with host ribosomal proteins (RPs) has not been fully explored, despite their involvement in the aetiology of several human cancers. To gain an insight into their plausible interactions, we employed a computational approach that encompasses structural alignment, gene ontology analysis, pathway analysis, and molecular docking. Results In this study, the alignment analysis based on structural similarity allows the prediction of 48 potential interactions between 27 human RPs and the EBV proteins EBNA1, LMP1, LMP2A, and LMP2B. Gene ontology analysis of the putative protein-protein interactions (PPIs) reveals their probable involvement in RNA binding, ribosome biogenesis, metabolic and biosynthetic processes, and gene regulation. Pathway analysis shows their possible participation in viral infection strategies (viral translation), as well as oncogenesis (Wnt and EGFR signalling pathways). Finally, our molecular docking assay predicts the functional interactions of EBNA1 with four RPs individually: EBNA1-eS10, EBNA1-eS25, EBNA1-uL10 and EBNA1-uL11. Conclusion These interactions have never been revealed previously via either experimental or in silico approach. We envisage that the calculated interactions between the ribosomal and EBV proteins herein would provide a hypothetical model for future experimental studies on the functional relationship between ribosomal proteins and EBV infection. Electronic supplementary material The online version of this article (10.1186/s12860-019-0219-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Edmund Ui-Hang Sim
- Faculty of Resource Science and Technology, Universiti Malaysia Sarawak, 94300, Kota Samarahan, Sarawak, Malaysia.
| | - Shruti Prashant Talwar
- Faculty of Resource Science and Technology, Universiti Malaysia Sarawak, 94300, Kota Samarahan, Sarawak, Malaysia
| |
Collapse
|
10
|
Hitch TCA, Clavel T. A proposed update for the classification and description of bacterial lipolytic enzymes. PeerJ 2019; 7:e7249. [PMID: 31328034 PMCID: PMC6622161 DOI: 10.7717/peerj.7249] [Citation(s) in RCA: 34] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2019] [Accepted: 06/03/2019] [Indexed: 11/23/2022] Open
Abstract
Bacterial lipolytic enzymes represent an important class of proteins: they provide their host species with access to additional resources and have multiple applications within the biotechnology sector. Since the formalisation of lipolytic enzymes into families and subfamilies, advances in molecular biology have led to the discovery of lipolytic enzymes unable to be classified via the existing system. Utilising sequence-based comparison methods, we have integrated these novel families within the classification system so that it now consists of 35 families and 11 true lipase subfamilies. Representative sequences for each family and subfamily have been defined as well as methodology for accurate comparison of novel sequences against the reference proteins, facilitating the future assignment of novel proteins. Both the code and protein sequences required for integration of additional families are available at: https://github.com/thh32/Lipase_reclassification.
Collapse
Affiliation(s)
- Thomas C A Hitch
- Functional Microbiome Research Group, Institute of Medical Microbiology, University Hospital of RWTH Aachen, Aachen, Germany
| | - Thomas Clavel
- Functional Microbiome Research Group, Institute of Medical Microbiology, University Hospital of RWTH Aachen, Aachen, Germany
| |
Collapse
|
11
|
Sánchez-Reyez A, Batista-García RA, Valdés-García G, Ortiz E, Perezgasga L, Zárate-Romero A, Pastor N, Folch-Mallol JL. A family 13 thioesterase isolated from an activated sludge metagenome: Insights into aromatic compounds metabolism. Proteins 2017; 85:1222-1237. [DOI: 10.1002/prot.25282] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2016] [Revised: 02/21/2017] [Accepted: 02/27/2017] [Indexed: 12/23/2022]
Affiliation(s)
- Ayixon Sánchez-Reyez
- Centro de Investigación en Dinámica Celular, IICBA, Universidad Autónoma del Estado de Morelos (UAEM), Colonia Chamilpa; CP 62209 Cuernavaca, Morelos Mexico
- Centro de Investigación en Biotecnología UAEM; CP 62209 Cuernavaca Morelos Mexico
| | - Ramón Alberto Batista-García
- Centro de Investigación en Dinámica Celular, IICBA, Universidad Autónoma del Estado de Morelos (UAEM), Colonia Chamilpa; CP 62209 Cuernavaca, Morelos Mexico
| | - Gilberto Valdés-García
- Centro de Investigación en Dinámica Celular, IICBA, Universidad Autónoma del Estado de Morelos (UAEM), Colonia Chamilpa; CP 62209 Cuernavaca, Morelos Mexico
| | - Ernesto Ortiz
- Instituto de Biotecnología. Universidad Nacional Autónoma de México; CP 62210 Cuernavaca Morelos Mexico
| | - Lucía Perezgasga
- Instituto de Biotecnología. Universidad Nacional Autónoma de México; CP 62210 Cuernavaca Morelos Mexico
| | - Andrés Zárate-Romero
- Centro de Investigación en Biotecnología UAEM; CP 62209 Cuernavaca Morelos Mexico
| | - Nina Pastor
- Centro de Investigación en Dinámica Celular, IICBA, Universidad Autónoma del Estado de Morelos (UAEM), Colonia Chamilpa; CP 62209 Cuernavaca, Morelos Mexico
| | | |
Collapse
|
12
|
De-novo protein function prediction using DNA binding and RNA binding proteins as a test case. Nat Commun 2016; 7:13424. [PMID: 27869118 PMCID: PMC5121330 DOI: 10.1038/ncomms13424] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2016] [Accepted: 10/03/2016] [Indexed: 12/14/2022] Open
Abstract
Of the currently identified protein sequences, 99.6% have never been observed in the laboratory as proteins and their molecular function has not been established experimentally. Predicting the function of such proteins relies mostly on annotated homologs. However, this has resulted in some erroneous annotations, and many proteins have no annotated homologs. Here we propose a de-novo function prediction approach based on identifying biophysical features that underlie function. Using our approach, we discover DNA and RNA binding proteins that cannot be identified based on homology and validate these predictions experimentally. For example, FGF14, which belongs to a family of secreted growth factors was predicted to bind DNA. We verify this experimentally and also show that FGF14 is localized to the nucleus. Mutating the predicted binding site on FGF14 abrogated DNA binding. These results demonstrate the feasibility of automated de-novo function prediction based on identifying function-related biophysical features. Identification of the function of proteins is difficult when there are no structurally or biochemically characterized homologs. Here, the authors present an approach that allows the prediction of nucleic-acid binding proteins based on sequence alone, and they are able to experimentally validate their method.
Collapse
|
13
|
Kaushik R, Jayaram B. Structural difficulty index: a reliable measure for modelability of protein tertiary structures. Protein Eng Des Sel 2016; 29:391-7. [PMID: 27334454 DOI: 10.1093/protein/gzw025] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2016] [Accepted: 05/27/2016] [Indexed: 11/13/2022] Open
Abstract
The success in protein tertiary-structure prediction is considered to be a function of coverage and similarity/identity of their sequences with suitable templates in the structural databases. However, this measure of modelability of a protein sequence into its structure may be misleading. Addressing this limitation, we propose here a 'structural difficulty (SD)' index, which is derived from secondary structures, homology and physicochemical features of protein sequences. The SD index reflects the capability of predicting accurate structures and helps to assess the potential for developing proteome level structural databases for various organisms with some of the best methodologies available currently. For instance, the plausibility of populating the structural database of human proteome with reliable quality structures under 3 Å root mean square deviation from the corresponding natives is found to be ∼37% of a total of 11 084 manually curated soluble proteins and ∼64% for all annotated and reviewed unique soluble protein (344 661 sequences) of UniProtKB. Also for 77 human pathogenic viruses comprising 2365 globular viral proteins out of which only 162 structures are solved experimentally, SD index scores 1336 proteins in the modelable zone. Availability of reliable protein structures may prove a crucial aid in developing species-wise structural proteomic databases for accelerating function annotation and for drug development endeavors.
Collapse
Affiliation(s)
- Rahul Kaushik
- Kusuma School of Biological Sciences, Indian Institute of Technology, Hauz Khas, New Delhi 110016, India Supercomputing Facility for Bioinformatics & Computational Biology, Indian Institute of Technology, Hauz Khas, New Delhi 110016, India
| | - B Jayaram
- Kusuma School of Biological Sciences, Indian Institute of Technology, Hauz Khas, New Delhi 110016, India Supercomputing Facility for Bioinformatics & Computational Biology, Indian Institute of Technology, Hauz Khas, New Delhi 110016, India Department of Chemistry, Indian Institute of Technology, Hauz Khas, New Delhi 110016, India
| |
Collapse
|
14
|
Rivera-Borroto OM, García-de la Vega JM, Marrero-Ponce Y, Grau R. Relational Agreement Measures for Similarity Searching of Cheminformatic Data Sets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:158-67. [PMID: 26886740 DOI: 10.1109/tcbb.2015.2424435] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Research on similarity searching of cheminformatic data sets has been focused on similarity measures using fingerprints. However, nominal scales are the least informative of all metric scales, increasing the tied similarity scores, and decreasing the effectivity of the retrieval engines. Tanimoto's coefficient has been claimed to be the most prominent measure for this task. Nevertheless, this field is far from being exhausted since the computer science no free lunch theorem predicts that "no similarity measure has overall superiority over the population of data sets". We introduce 12 relational agreement (RA) coefficients for seven metric scales, which are integrated within a group fusion-based similarity searching algorithm. These similarity measures are compared to a reference panel of 21 proximity quantifiers over 17 benchmark data sets (MUV), by using informative descriptors, a feature selection stage, a suitable performance metric, and powerful comparison tests. In this stage, RA coefficients perform favourably with repect to the state-of-the-art proximity measures. Afterward, the RA-based method outperform another four nearest neighbor searching algorithms over the same data domains. In a third validation stage, RA measures are successfully applied to the virtual screening of the NCI data set. Finally, we discuss a possible molecular interpretation for these similarity variants.
Collapse
|
15
|
GoFDR: A sequence alignment based method for predicting protein functions. Methods 2016; 93:3-14. [DOI: 10.1016/j.ymeth.2015.08.009] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2015] [Revised: 07/27/2015] [Accepted: 08/11/2015] [Indexed: 01/01/2023] Open
|
16
|
Mudgal R, Sandhya S, Chandra N, Srinivasan N. De-DUFing the DUFs: Deciphering distant evolutionary relationships of Domains of Unknown Function using sensitive homology detection methods. Biol Direct 2015; 10:38. [PMID: 26228684 PMCID: PMC4520260 DOI: 10.1186/s13062-015-0069-2] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2015] [Accepted: 07/20/2015] [Indexed: 12/23/2022] Open
Abstract
Background In the post-genomic era where sequences are being determined at a rapid rate, we are highly reliant on computational methods for their tentative biochemical characterization. The Pfam database currently contains 3,786 families corresponding to “Domains of Unknown Function” (DUF) or “Uncharacterized Protein Family” (UPF), of which 3,087 families have no reported three-dimensional structure, constituting almost one-fourth of the known protein families in search for both structure and function. Results We applied a ‘computational structural genomics’ approach using five state-of-the-art remote similarity detection methods to detect the relationship between uncharacterized DUFs and domain families of known structures. The association with a structural domain family could serve as a start point in elucidating the function of a DUF. Amongst these five methods, searches in SCOP-NrichD database have been applied for the first time. Predictions were classified into high, medium and low- confidence based on the consensus of results from various approaches and also annotated with enzyme and Gene ontology terms. 614 uncharacterized DUFs could be associated with a known structural domain, of which high confidence predictions, involving at least four methods, were made for 54 families. These structure-function relationships for the 614 DUF families can be accessed on-line at http://proline.biochem.iisc.ernet.in/RHD_DUFS/. For potential enzymes in this set, we assessed their compatibility with the associated fold and performed detailed structural and functional annotation by examining alignments and extent of conservation of functional residues. Detailed discussion is provided for interesting assignments for DUF3050, DUF1636, DUF1572, DUF2092 and DUF659. Conclusions This study provides insights into the structure and potential function for nearly 20 % of the DUFs. Use of different computational approaches enables us to reliably recognize distant relationships, especially when they converge to a common assignment because the methods are often complementary. We observe that while pointers to the structural domain can offer the right clues to the function of a protein, recognition of its precise functional role is still ‘non-trivial’ with many DUF domains conserving only some of the critical residues. It is not clear whether these are functional vestiges or instances involving alternate substrates and interacting partners. Reviewers This article was reviewed by Drs Eugene Koonin, Frank Eisenhaber and Srikrishna Subramanian. Electronic supplementary material The online version of this article (doi:10.1186/s13062-015-0069-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Richa Mudgal
- IISc Mathematics Initiative, Indian Institute of Science, Bangalore, 560 012, India.
| | - Sankaran Sandhya
- Molecular Biophysics Unit, Indian Institute of Science, Bangalore, 560 012, India.
| | - Nagasuma Chandra
- Department of Biochemistry, Indian Institute of Science, Bangalore, 560 012, India.
| | | |
Collapse
|
17
|
Guna A, Butcher NJ, Bassett AS. Comparative mapping of the 22q11.2 deletion region and the potential of simple model organisms. J Neurodev Disord 2015; 7:18. [PMID: 26137170 PMCID: PMC4487986 DOI: 10.1186/s11689-015-9113-x] [Citation(s) in RCA: 81] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/23/2015] [Accepted: 05/26/2015] [Indexed: 01/18/2023] Open
Abstract
Background 22q11.2 deletion syndrome (22q11.2DS) is the most common micro-deletion syndrome. The associated 22q11.2 deletion conveys the strongest known molecular risk for schizophrenia. Neurodevelopmental phenotypes, including intellectual disability, are also prominent though variable in severity. Other developmental features include congenital cardiac and craniofacial anomalies. Whereas existing mouse models have been helpful in determining the role of some genes overlapped by the hemizygous 22q11.2 deletion in phenotypic expression, much remains unknown. Simple model organisms remain largely unexploited in exploring these genotype-phenotype relationships. Methods We first developed a comprehensive map of the human 22q11.2 deletion region, delineating gene content, and brain expression. To identify putative orthologs, standard methods were used to interrogate the proteomes of the zebrafish (D. rerio), fruit fly (D. melanogaster), and worm (C. elegans), in addition to the mouse. Spatial locations of conserved homologues were mapped to examine syntenic relationships. We systematically cataloged available knockout and knockdown models of all conserved genes across these organisms, including a comprehensive review of associated phenotypes. Results There are 90 genes overlapped by the typical 2.5 Mb deletion 22q11.2 region. Of the 46 protein-coding genes, 41 (89.1 %) have documented expression in the human brain. Identified homologues in the zebrafish (n = 37, 80.4 %) were comparable to those in the mouse (n = 40, 86.9 %) and included some conserved gene cluster structures. There were 22 (47.8 %) putative homologues in the fruit fly and 17 (37.0 %) in the worm involving multiple chromosomes. Individual gene knockdown mutants were available for the simple model organisms, but not for mouse. Although phenotypic data were relatively limited for knockout and knockdown models of the 17 genes conserved across all species, there was some evidence for roles in neurodevelopmental phenotypes, including four of the six mitochondrial genes in the 22q11.2 deletion region. Conclusions Simple model organisms represent a powerful but underutilized means of investigating the molecular mechanisms underlying the elevated risk for neurodevelopmental disorders in 22q11.2DS. This comparative multi-species study provides novel resources and support for the potential utility of non-mouse models in expression studies and high-throughput drug screening. The approach has implications for other recurrent copy number variations associated with neurodevelopmental phenotypes. Electronic supplementary material The online version of this article (doi:10.1186/s11689-015-9113-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Alina Guna
- Clinical Genetics Research Program and Campbell Family Mental Health Research Institute, Centre for Addiction and Mental Health, Toronto, ON Canada
| | - Nancy J Butcher
- Clinical Genetics Research Program and Campbell Family Mental Health Research Institute, Centre for Addiction and Mental Health, Toronto, ON Canada ; Institute of Medical Science, University of Toronto, Toronto, ON Canada
| | - Anne S Bassett
- Clinical Genetics Research Program and Campbell Family Mental Health Research Institute, Centre for Addiction and Mental Health, Toronto, ON Canada ; Institute of Medical Science, University of Toronto, Toronto, ON Canada ; Dalglish Family Hearts and Minds Clinic for Adults with 22q11.2 Deletion Syndrome, Division of Cardiology, Department of Medicine, Department of Psychiatry, and Toronto General Research Institute, University Health Network, Toronto, ON Canada ; Department of Psychiatry, University of Toronto, Toronto, ON Canada ; Centre for Addiction and Mental Health, 33 Russell Street, Room 1100, M5S 2S1 Toronto, ON Canada
| |
Collapse
|
18
|
Das S, Sillitoe I, Lee D, Lees JG, Dawson NL, Ward J, Orengo CA. CATH FunFHMMer web server: protein functional annotations using functional family assignments. Nucleic Acids Res 2015; 43:W148-53. [PMID: 25964299 PMCID: PMC4489299 DOI: 10.1093/nar/gkv488] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2015] [Accepted: 05/02/2015] [Indexed: 12/20/2022] Open
Abstract
The widening function annotation gap in protein databases and the increasing number and diversity of the proteins being sequenced presents new challenges to protein function prediction methods. Multidomain proteins complicate the protein sequence–structure–function relationship further as new combinations of domains can expand the functional repertoire, creating new proteins and functions. Here, we present the FunFHMMer web server, which provides Gene Ontology (GO) annotations for query protein sequences based on the functional classification of the domain-based CATH-Gene3D resource. Our server also provides valuable information for the prediction of functional sites. The predictive power of FunFHMMer has been validated on a set of 95 proteins where FunFHMMer performs better than BLAST, Pfam and CDD. Recent validation by an independent international competition ranks FunFHMMer as one of the top function prediction methods in predicting GO annotations for both the Biological Process and Molecular Function Ontology. The FunFHMMer web server is available at http://www.cathdb.info/search/by_funfhmmer.
Collapse
Affiliation(s)
- Sayoni Das
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - David Lee
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Jonathan G Lees
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Natalie L Dawson
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - John Ward
- Department of Biochemical Engineering, UCL, Gower Street, WC1E 6BT, UK
| | - Christine A Orengo
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| |
Collapse
|
19
|
In-depth characterisation of the lamb meat proteome from longissimus lumborum. EUPA OPEN PROTEOMICS 2015. [DOI: 10.1016/j.euprot.2015.01.001] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
20
|
Chiang Z, Vastermark A, Punta M, Coggill PC, Mistry J, Finn RD, Saier MH. The complexity, challenges and benefits of comparing two transporter classification systems in TCDB and Pfam. Brief Bioinform 2015; 16:865-72. [PMID: 25614388 PMCID: PMC4570203 DOI: 10.1093/bib/bbu053] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2014] [Indexed: 01/04/2023] Open
Abstract
Transport systems comprise roughly 10% of all proteins in a cell, playing critical roles in many processes. Improving and expanding their classification is an important goal that can affect studies ranging from comparative genomics to potential drug target searches. It is not surprising that different classification systems for transport proteins have arisen, be it within a specialized database, focused on this functional class of proteins, or as part of a broader classification system for all proteins. Two such databases are the Transporter Classification Database (TCDB) and the Protein family (Pfam) database. As part of a long-term endeavor to improve consistency between the two classification systems, we have compared transporter annotations in the two databases to understand the rationale for differences and to improve both systems. Differences sometimes reflect the fact that one database has a particular transporter family while the other does not. Differing family definitions and hierarchical organizations were reconciled, resulting in recognition of 69 Pfam ‘Domains of Unknown Function’, which proved to be transport protein families to be renamed using TCDB annotations. Of over 400 potential new Pfam families identified from TCDB, 10% have already been added to Pfam, and TCDB has created 60 new entries based on Pfam data. This work, for the first time, reveals the benefits of comprehensive database comparisons and explains the differences between Pfam and TCDB.
Collapse
|
21
|
Jiang Y, Clark WT, Friedberg I, Radivojac P. The impact of incomplete knowledge on the evaluation of protein function prediction: a structured-output learning perspective. ACTA ACUST UNITED AC 2015; 30:i609-16. [PMID: 25161254 PMCID: PMC4147924 DOI: 10.1093/bioinformatics/btu472] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
Motivation: The automated functional annotation of biological macromolecules is a problem of computational assignment of biological concepts or ontological terms to genes and gene products. A number of methods have been developed to computationally annotate genes using standardized nomenclature such as Gene Ontology (GO). However, questions remain about the possibility for development of accurate methods that can integrate disparate molecular data as well as about an unbiased evaluation of these methods. One important concern is that experimental annotations of proteins are incomplete. This raises questions as to whether and to what degree currently available data can be reliably used to train computational models and estimate their performance accuracy. Results: We study the effect of incomplete experimental annotations on the reliability of performance evaluation in protein function prediction. Using the structured-output learning framework, we provide theoretical analyses and carry out simulations to characterize the effect of growing experimental annotations on the correctness and stability of performance estimates corresponding to different types of methods. We then analyze real biological data by simulating the prediction, evaluation and subsequent re-evaluation (after additional experimental annotations become available) of GO term predictions. Our results agree with previous observations that incomplete and accumulating experimental annotations have the potential to significantly impact accuracy assessments. We find that their influence reflects a complex interplay between the prediction algorithm, performance metric and underlying ontology. However, using the available experimental data and under realistic assumptions, our results also suggest that current large-scale evaluations are meaningful and almost surprisingly reliable. Contact:predrag@indiana.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yuxiang Jiang
- Department of Computer Science and Informatics, Indiana University, Bloomington, IN, USA, Department of Microbiology and Department of Computer Science and Software Engineering, Miami University, Oxford, OH, USA
| | - Wyatt T Clark
- Department of Computer Science and Informatics, Indiana University, Bloomington, IN, USA, Department of Microbiology and Department of Computer Science and Software Engineering, Miami University, Oxford, OH, USA
| | - Iddo Friedberg
- Department of Computer Science and Informatics, Indiana University, Bloomington, IN, USA, Department of Microbiology and Department of Computer Science and Software Engineering, Miami University, Oxford, OH, USA Department of Computer Science and Informatics, Indiana University, Bloomington, IN, USA, Department of Microbiology and Department of Computer Science and Software Engineering, Miami University, Oxford, OH, USA
| | - Predrag Radivojac
- Department of Computer Science and Informatics, Indiana University, Bloomington, IN, USA, Department of Microbiology and Department of Computer Science and Software Engineering, Miami University, Oxford, OH, USA
| |
Collapse
|
22
|
Koskinen P, Törönen P, Nokso-Koivisto J, Holm L. PANNZER: high-throughput functional annotation of uncharacterized proteins in an error-prone environment. ACTA ACUST UNITED AC 2015; 31:1544-52. [PMID: 25653249 DOI: 10.1093/bioinformatics/btu851] [Citation(s) in RCA: 90] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2014] [Accepted: 12/24/2014] [Indexed: 01/06/2023]
Abstract
MOTIVATION The last decade has seen a remarkable growth in protein databases. This growth comes at a price: a growing number of submitted protein sequences lack functional annotation. Approximately 32% of sequences submitted to the most comprehensive protein database UniProtKB are labelled as 'Unknown protein' or alike. Also the functionally annotated parts are reported to contain 30-40% of errors. Here, we introduce a high-throughput tool for more reliable functional annotation called Protein ANNotation with Z-score (PANNZER). PANNZER predicts Gene Ontology (GO) classes and free text descriptions about protein functionality. PANNZER uses weighted k-nearest neighbour methods with statistical testing to maximize the reliability of a functional annotation. RESULTS Our results in free text description line prediction show that we outperformed all competing methods with a clear margin. In GO prediction we show clear improvement to our older method that performed well in CAFA 2011 challenge.
Collapse
Affiliation(s)
- Patrik Koskinen
- Department of Biosciences, University of Helsinki, 00014 Helsinki, Finland and Institute of Biotechnology, University of Helsinki, 00014 Helsinki, Finland
| | - Petri Törönen
- Department of Biosciences, University of Helsinki, 00014 Helsinki, Finland and Institute of Biotechnology, University of Helsinki, 00014 Helsinki, Finland
| | - Jussi Nokso-Koivisto
- Department of Biosciences, University of Helsinki, 00014 Helsinki, Finland and Institute of Biotechnology, University of Helsinki, 00014 Helsinki, Finland
| | - Liisa Holm
- Department of Biosciences, University of Helsinki, 00014 Helsinki, Finland and Institute of Biotechnology, University of Helsinki, 00014 Helsinki, Finland Department of Biosciences, University of Helsinki, 00014 Helsinki, Finland and Institute of Biotechnology, University of Helsinki, 00014 Helsinki, Finland
| |
Collapse
|
23
|
Naqvi AAT, Ahmad F, Hassan MI. Identification of functional candidates amongst hypothetical proteins of Mycobacterium leprae Br4923, a causative agent of leprosy. Genome 2015; 58:25-42. [DOI: 10.1139/gen-2014-0178] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Mycobacterium leprae is an intracellular obligate parasite that causes leprosy in humans, and it leads to the destruction of peripheral nerves and skin deformation. Here, we report an extensive analysis of the hypothetical proteins (HPs) from M. leprae strain Br4923, assigning their functions to better understand the mechanism of pathogenesis and to search for potential therapeutic interventions. The genome of M. leprae encodes 1604 proteins, of which the functions of 632 are not known (HPs). In this paper, we predicted the probable functions of 312 HPs. First, we classified all HPs into families and subfamilies on the basis of sequence similarity, followed by domain assignment, which provides many clues for their possible function. However, the functions of 320 proteins were not predicted because of low sequence similarity with proteins of known function. Annotated HPs were categorized into enzymes, binding proteins, transporters, and proteins involved in cellular processes. We found several novel proteins whose functions were unknown for M. leprae. These proteins have a requisite association with bacterial virulence and pathogenicity. Finally, our sequence-based analysis will be helpful for further validation and the search for potential drug targets while developing effective drugs to cure leprosy.
Collapse
Affiliation(s)
- Ahmad Abu Turab Naqvi
- Department of Computer Science, Jamia Millia Islamia, Jamia Nagar, New Delhi – 110025, India
| | - Faizan Ahmad
- Center for Interdisciplinary Research in Basic Sciences, Jamia Millia Islamia, Jamia Nagar, New Delhi – 110025, India
| | - Md. Imtaiyaz Hassan
- Center for Interdisciplinary Research in Basic Sciences, Jamia Millia Islamia, Jamia Nagar, New Delhi – 110025, India
| |
Collapse
|
24
|
Abstract
Background Phenotypic data are routinely used to elucidate gene function in organisms amenable to genetic manipulation. However, previous to this work, there was no generalizable system in place for the structured storage and retrieval of phenotypic information for bacteria. Results The Ontology of Microbial Phenotypes (OMP) has been created to standardize the capture of such phenotypic information from microbes. OMP has been built on the foundations of the Basic Formal Ontology and the Phenotype and Trait Ontology. Terms have logical definitions that can facilitate computational searching of phenotypes and their associated genes. OMP can be accessed via a wiki page as well as downloaded from SourceForge. Initial annotations with OMP are being made for Escherichia coli using a wiki-based annotation capture system. New OMP terms are being concurrently developed as annotation proceeds. Conclusions We anticipate that diverse groups studying microbial genetics and associated phenotypes will employ OMP for standardizing microbial phenotype annotation, much as the Gene Ontology has standardized gene product annotation. The resulting OMP resource and associated annotations will facilitate prediction of phenotypes for unknown genes and result in new experimental characterization of phenotypes and functions.
Collapse
|
25
|
Exploring function prediction in protein interaction networks via clustering methods. PLoS One 2014; 9:e99755. [PMID: 24972109 PMCID: PMC4074043 DOI: 10.1371/journal.pone.0099755] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2014] [Accepted: 05/17/2014] [Indexed: 01/06/2023] Open
Abstract
Complex networks have recently become the focus of research in many fields. Their structure reveals crucial information for the nodes, how they connect and share information. In our work we analyze protein interaction networks as complex networks for their functional modular structure and later use that information in the functional annotation of proteins within the network. We propose several graph representations for the protein interaction network, each having different level of complexity and inclusion of the annotation information within the graph. We aim to explore what the benefits and the drawbacks of these proposed graphs are, when they are used in the function prediction process via clustering methods. For making this cluster based prediction, we adopt well established approaches for cluster detection in complex networks using most recent representative algorithms that have been proven as efficient in the task at hand. The experiments are performed using a purified and reliable Saccharomyces cerevisiae protein interaction network, which is then used to generate the different graph representations. Each of the graph representations is later analysed in combination with each of the clustering algorithms, which have been possibly modified and implemented to fit the specific graph. We evaluate results in regards of biological validity and function prediction performance. Our results indicate that the novel ways of presenting the complex graph improve the prediction process, although the computational complexity should be taken into account when deciding on a particular approach.
Collapse
|
26
|
Pitkänen E, Jouhten P, Hou J, Syed MF, Blomberg P, Kludas J, Oja M, Holm L, Penttilä M, Rousu J, Arvas M. Comparative genome-scale reconstruction of gapless metabolic networks for present and ancestral species. PLoS Comput Biol 2014; 10:e1003465. [PMID: 24516375 PMCID: PMC3916221 DOI: 10.1371/journal.pcbi.1003465] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2013] [Accepted: 12/18/2013] [Indexed: 12/12/2022] Open
Abstract
We introduce a novel computational approach, CoReCo, for comparative metabolic reconstruction and provide genome-scale metabolic network models for 49 important fungal species. Leveraging on the exponential growth in sequenced genome availability, our method reconstructs genome-scale gapless metabolic networks simultaneously for a large number of species by integrating sequence data in a probabilistic framework. High reconstruction accuracy is demonstrated by comparisons to the well-curated Saccharomyces cerevisiae consensus model and large-scale knock-out experiments. Our comparative approach is particularly useful in scenarios where the quality of available sequence data is lacking, and when reconstructing evolutionary distant species. Moreover, the reconstructed networks are fully carbon mapped, allowing their use in 13C flux analysis. We demonstrate the functionality and usability of the reconstructed fungal models with computational steady-state biomass production experiment, as these fungi include some of the most important production organisms in industrial biotechnology. In contrast to many existing reconstruction techniques, only minimal manual effort is required before the reconstructed models are usable in flux balance experiments. CoReCo is available at http://esaskar.github.io/CoReCo/. Advances in next-generation sequencing technologies are revolutionizing molecular biology. Sequencing-enabled cost-effective characterization of microbial genomes is a particularly exciting development in metabolic engineering. There, considerable effort has been put to reconstructing genome-scale metabolic networks that describe the collection of hundreds to thousands of biochemical reactions available for a microbial cell. These network models are instrumental in understanding microbial metabolism and guiding metabolic engineering efforts to improve biochemical yields. We have developed a novel computational method, CoReCo, which bridges the growing gap between the availability of sequenced genomes and respective reconstructed metabolic networks. The method reconstructs genome-scale metabolic networks simultaneously for related microbial species. It utilizes the available sequencing data from these species to correct for incomplete and missing data. We used the method to reconstruct metabolic networks for a set of 49 fungal species providing the method protein sequence data and a phylogenetic tree describing the evolutionary relationships between the species. We demonstrate the applicability of the method by comparing a metabolic reconstruction of Saccharomyces cerevisiae to the manually curated, high-quality consensus network. We also provide an easy-to-use implementation of the method, usable both in single computer and distributed computing environments.
Collapse
Affiliation(s)
- Esa Pitkänen
- Department of Computer Science, University of Helsinki, Helsinki, Finland
- Department of Medical Genetics, Genome-Scale Biology Research Program, University of Helsinki, Helsinki, Finland
- * E-mail:
| | - Paula Jouhten
- VTT Technical Research Centre of Finland, Espoo, Finland
| | - Jian Hou
- Department of Computer Science, University of Helsinki, Helsinki, Finland
- Department of Information and Computer Science, Aalto University, Espoo, Finland
| | | | - Peter Blomberg
- VTT Technical Research Centre of Finland, Espoo, Finland
| | - Jana Kludas
- Department of Information and Computer Science, Aalto University, Espoo, Finland
| | - Merja Oja
- VTT Technical Research Centre of Finland, Espoo, Finland
| | - Liisa Holm
- Institute of Biotechnology & Department of Biosciences, University of Helsinki, Helsinki, Finland
| | - Merja Penttilä
- VTT Technical Research Centre of Finland, Espoo, Finland
| | - Juho Rousu
- Department of Information and Computer Science, Aalto University, Espoo, Finland
| | - Mikko Arvas
- VTT Technical Research Centre of Finland, Espoo, Finland
| |
Collapse
|
27
|
Feiglin A, Ashkenazi S, Schlessinger A, Rost B, Ofran Y. Co-expression and co-localization of hub proteins and their partners are encoded in protein sequence. MOLECULAR BIOSYSTEMS 2014; 10:787-94. [PMID: 24457447 DOI: 10.1039/c3mb70411d] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
Spatiotemporal coordination is a critical factor in biological processes. Some hubs in protein-protein interaction networks tend to be co-expressed and co-localized with their partners more strongly than others, a difference which is arguably related to functional differences between the hubs. Based on numerous analyses of yeast hubs, it has been suggested that differences in co-expression and co-localization are reflected in the structural and molecular characteristics of the hubs. We hypothesized that if indeed differences in co-expression and co-localization are encoded in the molecular characteristics of the protein, it may be possible to predict the tendency for co-expression and co-localization of human hubs based on features learned from systematically characterized yeast hubs. Thus, we trained a prediction algorithm on hubs from yeast that were classified as either strongly or weakly co-expressed and co-localized with their partners, and applied the trained model to 800 human hub proteins. We found that the algorithm significantly distinguishes between human hubs that are co-expressed and co-localized with their partners and hubs that are not. The prediction is based on sequence derived features such as "stickiness", i.e. the existence of multiple putative binding sites that enable multiple simultaneous interactions, "plasticity", i.e. the existence of predicted structural disorder which conjecturally allows for multiple consecutive interactions with the same binding site and predicted subcellular localization. These results suggest that spatiotemporal dynamics is encoded, at least in part, in the amino acid sequence of the protein and that this encoding is similar in yeast and in human.
Collapse
Affiliation(s)
- Ariel Feiglin
- The Goodman faculty of life sciences, Bar Ilan University, Ramat Gan 52900, Israel.
| | | | | | | | | |
Collapse
|
28
|
Puggioni V, Dondi A, Folli C, Shin I, Rhee S, Percudani R. Gene Context Analysis Reveals Functional Divergence between Hypothetically Equivalent Enzymes of the Purine–Ureide Pathway. Biochemistry 2014; 53:735-45. [DOI: 10.1021/bi4010107] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Vincenzo Puggioni
- Laboratory
of Biochemistry, Molecular Biology, and Bioinformatics, Department
of Life Sciences, University of Parma, Italy
| | - Ambra Dondi
- Laboratory
of Biochemistry, Molecular Biology, and Bioinformatics, Department
of Life Sciences, University of Parma, Italy
| | - Claudia Folli
- Department
of Food Science, University of Parma, Italy
| | - Inchul Shin
- Department
of Agricultural Biotechnology, Seoul National University, Seoul, Korea
| | - Sangkee Rhee
- Department
of Agricultural Biotechnology, Seoul National University, Seoul, Korea
| | - Riccardo Percudani
- Laboratory
of Biochemistry, Molecular Biology, and Bioinformatics, Department
of Life Sciences, University of Parma, Italy
| |
Collapse
|
29
|
Rahimi A, Madadkar-Sobhani A, Touserkani R, Goliaei B. Efficacy of function specific 3D-motifs in enzyme classification according to their EC-numbers. J Theor Biol 2013; 336:36-43. [PMID: 23871713 DOI: 10.1016/j.jtbi.2013.07.003] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2013] [Revised: 06/05/2013] [Accepted: 07/02/2013] [Indexed: 11/28/2022]
Abstract
Due to the increasing number of protein structures with unknown function originated from structural genomics projects, protein function prediction has become an important subject in bioinformatics. Among diverse function prediction methods, exploring known 3D-motifs, which are associated with functional elements in unknown protein structures is one of the most biologically meaningful methods. Homologous enzymes inherit such motifs in their active sites from common ancestors. However, slight differences in the properties of these motifs, results in variation in the reactions and substrates of the enzymes. In this study, we examined the possibility of discriminating highly related active site patterns according to their EC-numbers by 3D-motifs. For each EC-number, the spatial arrangement of an active site, which has minimum average distance to other active sites with the same function, was selected as a representative 3D-motif. In order to characterize the motifs, various points in active site elements were tested. The results demonstrated the possibility of predicting full EC-number of enzymes by 3D-motifs. However, the discriminating power of 3D-motifs varies among different enzyme families and depends on selecting the appropriate points and features.
Collapse
Affiliation(s)
- Amir Rahimi
- Department of Bioinformatics, Institute of Biochemistry and Biophysics (IBB), University of Tehran, Tehran, Iran
| | | | | | | |
Collapse
|
30
|
Diniz MC, Pacheco ACL, Farias KM, de Oliveira DM. The eukaryotic flagellum makes the day: novel and unforeseen roles uncovered after post-genomics and proteomics data. Curr Protein Pept Sci 2013; 13:524-46. [PMID: 22708495 PMCID: PMC3499766 DOI: 10.2174/138920312803582951] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2011] [Revised: 05/22/2012] [Accepted: 05/23/2012] [Indexed: 12/21/2022]
Abstract
This review will summarize and discuss the current biological understanding of the motile eukaryotic flagellum,
as posed out by recent advances enabled by post-genomics and proteomics approaches. The organelle, which is crucial
for motility, survival, differentiation, reproduction, division and feeding, among other activities, of many eukaryotes,
is a great example of a natural nanomachine assembled mostly by proteins (around 350-650 of them) that have been conserved
throughout eukaryotic evolution. Flagellar proteins are discussed in terms of their arrangement on to the axoneme,
the canonical “9+2” microtubule pattern, and also motor and sensorial elements that have been detected by recent proteomic
analyses in organisms such as Chlamydomonas reinhardtii, sea urchin, and trypanosomatids. Such findings can be
remarkably matched up to important discoveries in vertebrate and mammalian types as diverse as sperm cells, ciliated
kidney epithelia, respiratory and oviductal cilia, and neuro-epithelia, among others. Here we will focus on some exciting
work regarding eukaryotic flagellar proteins, particularly using the flagellar proteome of C. reinhardtii as a reference map
for exploring motility in function, dysfunction and pathogenic flagellates. The reference map for the eukaryotic flagellar
proteome consists of 652 proteins that include known structural and intraflagellar transport (IFT) proteins, less well-characterized
signal transduction proteins and flagellar associated proteins (FAPs), besides almost two hundred unannotated
conserved proteins, which lately have been the subject of intense investigation and of our present examination.
Collapse
Affiliation(s)
- Michely C Diniz
- Programa de Pós-Graduação em Biotecnologia-RENORBIO-Rede Nordeste de Biotecnologia, Universidade Estadual do Ceará-UECE, Av. Paranjana, 1700, Campus do Itaperi, Fortaleza, CE 60740-000 Brasil
| | | | | | | |
Collapse
|
31
|
Cock PJA, Grüning BA, Paszkiewicz K, Pritchard L. Galaxy tools and workflows for sequence analysis with applications in molecular plant pathology. PeerJ 2013; 1:e167. [PMID: 24109552 PMCID: PMC3792188 DOI: 10.7717/peerj.167] [Citation(s) in RCA: 112] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2013] [Accepted: 08/30/2013] [Indexed: 12/28/2022] Open
Abstract
The Galaxy Project offers the popular web browser-based platform Galaxy for running bioinformatics tools and constructing simple workflows. Here, we present a broad collection of additional Galaxy tools for large scale analysis of gene and protein sequences. The motivating research theme is the identification of specific genes of interest in a range of non-model organisms, and our central example is the identification and prediction of "effector" proteins produced by plant pathogens in order to manipulate their host plant. This functional annotation of a pathogen's predicted capacity for virulence is a key step in translating sequence data into potential applications in plant pathology. This collection includes novel tools, and widely-used third-party tools such as NCBI BLAST+ wrapped for use within Galaxy. Individual bioinformatics software tools are typically available separately as standalone packages, or in online browser-based form. The Galaxy framework enables the user to combine these and other tools to automate organism scale analyses as workflows, without demanding familiarity with command line tools and scripting. Workflows created using Galaxy can be saved and are reusable, so may be distributed within and between research groups, facilitating the construction of a set of standardised, reusable bioinformatic protocols. The Galaxy tools and workflows described in this manuscript are open source and freely available from the Galaxy Tool Shed (http://usegalaxy.org/toolshed or http://toolshed.g2.bx.psu.edu).
Collapse
Affiliation(s)
- Peter J A Cock
- Information and Computational Sciences, James Hutton Institute , UK
| | | | | | | |
Collapse
|
32
|
Malhotra A, Creer S, Harris JB, Stöcklin R, Favreau P, Thorpe RS. Predicting function from sequence in a large multifunctional toxin family. Toxicon 2013; 72:113-25. [PMID: 23831284 DOI: 10.1016/j.toxicon.2013.06.019] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2013] [Revised: 06/21/2013] [Accepted: 06/26/2013] [Indexed: 11/30/2022]
Abstract
Venoms contain active substances with highly specific physiological effects and are increasingly being used as sources of novel diagnostic, research and treatment tools for human disease. Experimental characterisation of individual toxin activities is a severe rate-limiting step in the discovery process, and in-silico tools which allow function to be predicted from sequence information are essential. Toxins are typically members of large multifunctional families of structurally similar proteins that can have different biological activities, and minor sequence divergence can have significant consequences. Thus, existing predictive tools tend to have low accuracy. We investigated a classification model based on physico-chemical attributes that can easily be calculated from amino-acid sequences, using over 250 (mostly novel) viperid phospholipase A₂ toxins. We also clustered proteins by sequence profiles, and carried out in-vitro tests for four major activities on a selection of isolated novel toxins, or crude venoms known to contain them. The majority of detected activities were consistent with predictions, in contrast to poor performance of a number of tested existing predictive methods. Our results provide a framework for comparison of active sites among different functional sub-groups of toxins that will allow a more targeted approach for identification of potential drug leads in the future.
Collapse
Affiliation(s)
- Anita Malhotra
- School of Biological Sciences, College of Natural Sciences, Bangor University, Bangor LL57 2UW, UK.
| | | | | | | | | | | |
Collapse
|
33
|
Abstract
Disease-causing aberrations in the normal function of a gene define that gene as a disease gene. Proving a causal link between a gene and a disease experimentally is expensive and time-consuming. Comprehensive prioritization of candidate genes prior to experimental testing drastically reduces the associated costs. Computational gene prioritization is based on various pieces of correlative evidence that associate each gene with the given disease and suggest possible causal links. A fair amount of this evidence comes from high-throughput experimentation. Thus, well-developed methods are necessary to reliably deal with the quantity of information at hand. Existing gene prioritization techniques already significantly improve the outcomes of targeted experimental studies. Faster and more reliable techniques that account for novel data types are necessary for the development of new diagnostics, treatments, and cure for many diseases.
Collapse
Affiliation(s)
- Yana Bromberg
- Department of Biochemistry and Microbiology, School of Environmental and Biological Sciences, Rutgers University, New Brunswick, New Jersey, USA.
| |
Collapse
|
34
|
Mistry J, Finn RD, Eddy SR, Bateman A, Punta M. Challenges in homology search: HMMER3 and convergent evolution of coiled-coil regions. Nucleic Acids Res 2013; 41:e121. [PMID: 23598997 PMCID: PMC3695513 DOI: 10.1093/nar/gkt263] [Citation(s) in RCA: 1032] [Impact Index Per Article: 86.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Detection of protein homology via sequence similarity has important applications in biology, from protein structure and function prediction to reconstruction of phylogenies. Although current methods for aligning protein sequences are powerful, challenges remain, including problems with homologous overextension of alignments and with regions under convergent evolution. Here, we test the ability of the profile hidden Markov model method HMMER3 to correctly assign homologous sequences to >13,000 manually curated families from the Pfam database. We identify problem families using protein regions that match two or more Pfam families not currently annotated as related in Pfam. We find that HMMER3 E-value estimates seem to be less accurate for families that feature periodic patterns of compositional bias, such as the ones typically observed in coiled-coils. These results support the continued use of manually curated inclusion thresholds in the Pfam database, especially on the subset of families that have been identified as problematic in experiments such as these. They also highlight the need for developing new methods that can correct for this particular type of compositional bias.
Collapse
Affiliation(s)
- Jaina Mistry
- EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| | | | | | | | | |
Collapse
|
35
|
Nam HJ, Han SK, Bowie JU, Kim S. Rampant exchange of the structure and function of extramembrane domains between membrane and water soluble proteins. PLoS Comput Biol 2013; 9:e1002997. [PMID: 23555228 PMCID: PMC3605051 DOI: 10.1371/journal.pcbi.1002997] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2012] [Accepted: 02/04/2013] [Indexed: 11/19/2022] Open
Abstract
Of the membrane proteins of known structure, we found that a remarkable 67% of the water soluble domains are structurally similar to water soluble proteins of known structure. Moreover, 41% of known water soluble protein structures share a domain with an already known membrane protein structure. We also found that functional residues are frequently conserved between extramembrane domains of membrane and soluble proteins that share structural similarity. These results suggest membrane and soluble proteins readily exchange domains and their attendant functionalities. The exchanges between membrane and soluble proteins are particularly frequent in eukaryotes, indicating that this is an important mechanism for increasing functional complexity. The high level of structural overlap between the two classes of proteins provides an opportunity to employ the extensive information on soluble proteins to illuminate membrane protein structure and function, for which much less is known. To this end, we employed structure guided sequence alignment to elucidate the functions of membrane proteins in the human genome. Our results bridge the gap of fold space between membrane and water soluble proteins and provide a resource for the prediction of membrane protein function. A database of predicted structural and functional relationships for proteins in the human genome is provided at sbi.postech.ac.kr/emdmp.
Collapse
Affiliation(s)
- Hyun-Jun Nam
- School of Interdisciplinary Bioscience and Bioengineering, Department of Life Science, Division of IT Convergence Engineering, Pohang University of Science and Technology, Pohang, Korea
| | - Seong Kyu Han
- School of Interdisciplinary Bioscience and Bioengineering, Department of Life Science, Division of IT Convergence Engineering, Pohang University of Science and Technology, Pohang, Korea
| | - James U. Bowie
- Department of Chemistry and Biochemistry, UCLA-DOE Institute of Genomics and Proteomics, Molecular Biology Institute, University of California Los Angeles, Los Angeles, California, United States of America
- * E-mail: (JB); (SK)
| | - Sanguk Kim
- School of Interdisciplinary Bioscience and Bioengineering, Department of Life Science, Division of IT Convergence Engineering, Pohang University of Science and Technology, Pohang, Korea
- Department of Chemistry and Biochemistry, UCLA-DOE Institute of Genomics and Proteomics, Molecular Biology Institute, University of California Los Angeles, Los Angeles, California, United States of America
- * E-mail: (JB); (SK)
| |
Collapse
|
36
|
Kolodny R, Kosloff M. From Protein Structure to Function via Computational Tools and Approaches. Isr J Chem 2013. [DOI: 10.1002/ijch.201200078] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
|
37
|
Abstract
Background Computational/manual annotations of protein functions are one of the first routes to making sense of a newly sequenced genome. Protein domain predictions form an essential part of this annotation process. This is due to the natural modularity of proteins with domains as structural, evolutionary and functional units. Sometimes two, three, or more adjacent domains (called supra-domains) are the operational unit responsible for a function, e.g. via a binding site at the interface. These supra-domains have contributed to functional diversification in higher organisms. Traditionally functional ontologies have been applied to individual proteins, rather than families of related domains and supra-domains. We expect, however, to some extent functional signals can be carried by protein domains and supra-domains, and consequently used in function prediction and functional genomics. Results Here we present a domain-centric Gene Ontology (dcGO) perspective. We generalize a framework for automatically inferring ontological terms associated with domains and supra-domains from full-length sequence annotations. This general framework has been applied specifically to primary protein-level annotations from UniProtKB-GOA, generating GO term associations with SCOP domains and supra-domains. The resulting 'dcGO Predictor', can be used to provide functional annotation to protein sequences. The functional annotation of sequences in the Critical Assessment of Function Annotation (CAFA) has been used as a valuable opportunity to validate our method and to be assessed by the community. The functional annotation of all completely sequenced genomes has demonstrated the potential for domain-centric GO enrichment analysis to yield functional insights into newly sequenced or yet-to-be-annotated genomes. This generalized framework we have presented has also been applied to other domain classifications such as InterPro and Pfam, and other ontologies such as mammalian phenotype and disease ontology. The dcGO and its predictor are available at http://supfam.org/SUPERFAMILY/dcGO including an enrichment analysis tool. Conclusions As functional units, domains offer a unique perspective on function prediction regardless of whether proteins are multi-domain or single-domain. The 'dcGO Predictor' holds great promise for contributing to a domain-centric functional understanding of genomes in the next generation sequencing era.
Collapse
Affiliation(s)
- Hai Fang
- Department of Computer Science, University of Bristol, The Merchant Venturers Building, Bristol BS8 1UB, UK.
| | | |
Collapse
|
38
|
MAURER-STROH SEBASTIAN, GAO HE, HAN HAO, BAETEN LIES, SCHYMKOWITZ JOOST, ROUSSEAU FREDERIC, ZHANG LOUXIN, EISENHABER FRANK. MOTIF DISCOVERY WITH DATA MINING IN 3D PROTEIN STRUCTURE DATABASES: DISCOVERY, VALIDATION AND PREDICTION OF THE U-SHAPE ZINC BINDING ("HUF-ZINC") MOTIF. J Bioinform Comput Biol 2013; 11:1340008. [DOI: 10.1142/s0219720013400088] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Data mining in protein databases, derivatives from more fundamental protein 3D structure and sequence databases, has considerable unearthed potential for the discovery of sequence motif—structural motif—function relationships as the finding of the U-shape (Huf-Zinc) motif, originally a small student's project, exemplifies. The metal ion zinc is critically involved in universal biological processes, ranging from protein-DNA complexes and transcription regulation to enzymatic catalysis and metabolic pathways. Proteins have evolved a series of motifs to specifically recognize and bind zinc ions. Many of these, so called zinc fingers, are structurally independent globular domains with discontinuous binding motifs made up of residues mostly far apart in sequence. Through a systematic approach starting from the BRIX structure fragment database, we discovered that there exists another predictable subset of zinc-binding motifs that not only have a conserved continuous sequence pattern but also share a characteristic local conformation, despite being included in totally different overall folds. While this does not allow general prediction of all Zn binding motifs, a HMM-based web server, Huf-Zinc, is available for prediction of these novel, as well as conventional, zinc finger motifs in protein sequences. The Huf-Zinc webserver can be freely accessed through this URL ( http://mendel.bii.a-star.edu.sg/METHODS/hufzinc/ ).
Collapse
Affiliation(s)
- SEBASTIAN MAURER-STROH
- Bioinformatics Institute (BII), Agency for Science and Technology (A*STAR), 30 Biopolis Street, #07-01, Matrix, 138671, Singapore
- School of Biological Sciences (SBS), Nanyang Technological University (NTU), 60 Nanyang Drive, 637551, Singapore
| | - HE GAO
- Bioinformatics Institute (BII), Agency for Science and Technology (A*STAR), 30 Biopolis Street, #07-01, Matrix, 138671, Singapore
- NUS Graduate School for Integrative Sciences and Engineering, National University of Singapore, Centre for Life Sciences, #05-01, 28 Medical Drive, Singapore 117456, Singapore
| | - HAO HAN
- Bioinformatics Institute (BII), Agency for Science and Technology (A*STAR), 30 Biopolis Street, #07-01, Matrix, 138671, Singapore
| | - LIES BAETEN
- VIB Switch Laboratory, Katholieke Universiteit Leuven, Herestraat 49, Box 802, 3000 Leuven, Belgium
| | - JOOST SCHYMKOWITZ
- VIB Switch Laboratory, Katholieke Universiteit Leuven, Herestraat 49, Box 802, 3000 Leuven, Belgium
| | - FREDERIC ROUSSEAU
- VIB Switch Laboratory, Katholieke Universiteit Leuven, Herestraat 49, Box 802, 3000 Leuven, Belgium
| | - LOUXIN ZHANG
- Department of Mathematics, National University of Singapore, 10 Lower Kent Ridge Road, Singapore 119076, Singapore
| | - FRANK EISENHABER
- Bioinformatics Institute (BII), Agency for Science and Technology (A*STAR), 30 Biopolis Street, #07-01, Matrix, 138671, Singapore
- Department of Biological Sciences (DBS), National University of Singapore (NUS), 8 Medical Drive 4, 117597, Singapore
- School of Computer Engineering (SCE), Nanyang Technological University (NTU), 50 Nanyang Drive, 637553, Singapore
| |
Collapse
|
39
|
Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K, Funk C, Verspoor K, Ben-Hur A, Pandey G, Yunes JM, Talwalkar AS, Repo S, Souza ML, Piovesan D, Casadio R, Wang Z, Cheng J, Fang H, Gough J, Koskinen P, Törönen P, Nokso-Koivisto J, Holm L, Cozzetto D, Buchan DWA, Bryson K, Jones DT, Limaye B, Inamdar H, Datta A, Manjari SK, Joshi R, Chitale M, Kihara D, Lisewski AM, Erdin S, Venner E, Lichtarge O, Rentzsch R, Yang H, Romero AE, Bhat P, Paccanaro A, Hamp T, Kaßner R, Seemayer S, Vicedo E, Schaefer C, Achten D, Auer F, Boehm A, Braun T, Hecht M, Heron M, Hönigschmid P, Hopf TA, Kaufmann S, Kiening M, Krompass D, Landerer C, Mahlich Y, Roos M, Björne J, Salakoski T, Wong A, Shatkay H, Gatzmann F, Sommer I, Wass MN, Sternberg MJE, Škunca N, Supek F, Bošnjak M, Panov P, Džeroski S, Šmuc T, Kourmpetis YAI, van Dijk ADJ, ter Braak CJF, Zhou Y, Gong Q, Dong X, Tian W, Falda M, Fontana P, Lavezzo E, Di Camillo B, Toppo S, Lan L, Djuric N, Guo Y, Vucetic S, Bairoch A, Linial M, Babbitt PC, Brenner SE, Orengo C, Rost B, et alRadivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K, Funk C, Verspoor K, Ben-Hur A, Pandey G, Yunes JM, Talwalkar AS, Repo S, Souza ML, Piovesan D, Casadio R, Wang Z, Cheng J, Fang H, Gough J, Koskinen P, Törönen P, Nokso-Koivisto J, Holm L, Cozzetto D, Buchan DWA, Bryson K, Jones DT, Limaye B, Inamdar H, Datta A, Manjari SK, Joshi R, Chitale M, Kihara D, Lisewski AM, Erdin S, Venner E, Lichtarge O, Rentzsch R, Yang H, Romero AE, Bhat P, Paccanaro A, Hamp T, Kaßner R, Seemayer S, Vicedo E, Schaefer C, Achten D, Auer F, Boehm A, Braun T, Hecht M, Heron M, Hönigschmid P, Hopf TA, Kaufmann S, Kiening M, Krompass D, Landerer C, Mahlich Y, Roos M, Björne J, Salakoski T, Wong A, Shatkay H, Gatzmann F, Sommer I, Wass MN, Sternberg MJE, Škunca N, Supek F, Bošnjak M, Panov P, Džeroski S, Šmuc T, Kourmpetis YAI, van Dijk ADJ, ter Braak CJF, Zhou Y, Gong Q, Dong X, Tian W, Falda M, Fontana P, Lavezzo E, Di Camillo B, Toppo S, Lan L, Djuric N, Guo Y, Vucetic S, Bairoch A, Linial M, Babbitt PC, Brenner SE, Orengo C, Rost B, Mooney SD, Friedberg I. A large-scale evaluation of computational protein function prediction. Nat Methods 2013; 10:221-7. [PMID: 23353650 PMCID: PMC3584181 DOI: 10.1038/nmeth.2340] [Show More Authors] [Citation(s) in RCA: 625] [Impact Index Per Article: 52.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2012] [Accepted: 12/10/2012] [Indexed: 01/03/2023]
Abstract
A report on the results of the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Fifty-four methods representing the state of the art for protein function prediction were evaluated on a target set of 866 proteins from 11 organisms. Two findings stand out: (i) today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is considerable need for improvement of currently available tools.
Collapse
Affiliation(s)
- Predrag Radivojac
- School of Informatics and Computing, Indiana University, Bloomington, Indiana, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
40
|
Konopka BM, Nebel JC, Kotulska M. Quality assessment of protein model-structures based on structural and functional similarities. BMC Bioinformatics 2012; 13:242. [PMID: 22998498 PMCID: PMC3526563 DOI: 10.1186/1471-2105-13-242] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2012] [Accepted: 09/14/2012] [Indexed: 11/10/2022] Open
Abstract
Background Experimental determination of protein 3D structures is expensive, time consuming and sometimes impossible. A gap between number of protein structures deposited in the World Wide Protein Data Bank and the number of sequenced proteins constantly broadens. Computational modeling is deemed to be one of the ways to deal with the problem. Although protein 3D structure prediction is a difficult task, many tools are available. These tools can model it from a sequence or partial structural information, e.g. contact maps. Consequently, biologists have the ability to generate automatically a putative 3D structure model of any protein. However, the main issue becomes evaluation of the model quality, which is one of the most important challenges of structural biology. Results GOBA - Gene Ontology-Based Assessment is a novel Protein Model Quality Assessment Program. It estimates the compatibility between a model-structure and its expected function. GOBA is based on the assumption that a high quality model is expected to be structurally similar to proteins functionally similar to the prediction target. Whereas DALI is used to measure structure similarity, protein functional similarity is quantified using standardized and hierarchical description of proteins provided by Gene Ontology combined with Wang's algorithm for calculating semantic similarity. Two approaches are proposed to express the quality of protein model-structures. One is a single model quality assessment method, the other is its modification, which provides a relative measure of model quality. Exhaustive evaluation is performed on data sets of model-structures submitted to the CASP8 and CASP9 contests. Conclusions The validation shows that the method is able to discriminate between good and bad model-structures. The best of tested GOBA scores achieved 0.74 and 0.8 as a mean Pearson correlation to the observed quality of models in our CASP8 and CASP9-based validation sets. GOBA also obtained the best result for two targets of CASP8, and one of CASP9, compared to the contest participants. Consequently, GOBA offers a novel single model quality assessment program that addresses the practical needs of biologists. In conjunction with other Model Quality Assessment Programs (MQAPs), it would prove useful for the evaluation of single protein models.
Collapse
Affiliation(s)
- Bogumil M Konopka
- Institute of Biomedical Engineering and Instrumentation, Wroclaw University of Technology, Wybrzeze Wyspianskiego 27, 50-370, Wroclaw, Poland
| | | | | |
Collapse
|
41
|
MUC16/CA125 in the context of modular proteins with an annotated role in adhesion-related processes: in silico analysis. Int J Mol Sci 2012; 13:10387-10400. [PMID: 22949868 PMCID: PMC3431866 DOI: 10.3390/ijms130810387] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Revised: 07/23/2012] [Accepted: 08/09/2012] [Indexed: 11/25/2022] Open
Abstract
Mucin 16 (MUC16) is a type I transmembrane protein, the extracellular portion of which is shed after proteolytic degradation and is denoted as CA125 antigen, a well known tumor marker for ovarian cancer. Regarding its polypeptide and glycan structures, as yet there is no detailed insight into their heterogeneity and ligand properties, which may greatly influence its function and biomarker potential. This study was aimed at obtaining further insight into the biological capacity of MUC16/CA125, using in silico analysis of corresponding mucin sequences, including similarity searches as well as GO (gene ontology)-based function prediction. The results obtained pointed to the similarities within extracellular serine/threonine rich regions of MUC16 to sequences of proteins expressed in evolutionary distant taxa, all having in common an annotated role in adhesion-related processes. Specifically, a homology to conserved domains from the family of herpesvirus major outer envelope protein (BLLF1) was found. In addition, the possible involvement of MUC16/CA125 in carbohydrate-binding interactions or cellular transport of protein/ion was suggested.
Collapse
|
42
|
Capriotti E, Nehrt NL, Kann MG, Bromberg Y. Bioinformatics for personal genome interpretation. Brief Bioinform 2012; 13:495-512. [PMID: 22247263 PMCID: PMC3404395 DOI: 10.1093/bib/bbr070] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2011] [Revised: 11/08/2011] [Indexed: 01/02/2023] Open
Abstract
An international consortium released the first draft sequence of the human genome 10 years ago. Although the analysis of this data has suggested the genetic underpinnings of many diseases, we have not yet been able to fully quantify the relationship between genotype and phenotype. Thus, a major current effort of the scientific community focuses on evaluating individual predispositions to specific phenotypic traits given their genetic backgrounds. Many resources aim to identify and annotate the specific genes responsible for the observed phenotypes. Some of these use intra-species genetic variability as a means for better understanding this relationship. In addition, several online resources are now dedicated to collecting single nucleotide variants and other types of variants, and annotating their functional effects and associations with phenotypic traits. This information has enabled researchers to develop bioinformatics tools to analyze the rapidly increasing amount of newly extracted variation data and to predict the effect of uncharacterized variants. In this work, we review the most important developments in the field--the databases and bioinformatics tools that will be of utmost importance in our concerted effort to interpret the human variome.
Collapse
|
43
|
Klie S, Nikoloski Z. The Choice between MapMan and Gene Ontology for Automated Gene Function Prediction in Plant Science. Front Genet 2012; 3:115. [PMID: 22754563 PMCID: PMC3384976 DOI: 10.3389/fgene.2012.00115] [Citation(s) in RCA: 55] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2012] [Accepted: 06/05/2012] [Indexed: 12/23/2022] Open
Abstract
Since the introduction of the Gene Ontology (GO), the analysis of high-throughput data has become tightly coupled with the use of ontologies to establish associations between knowledge and data in an automated fashion. Ontologies provide a systematic description of knowledge by a controlled vocabulary of defined structure in which ontological concepts are connected by pre-defined relationships. In plant science, MapMan and GO offer two alternatives for ontology-driven analyses. Unlike GO, initially developed to characterize microbial systems, MapMan was specifically designed to cover plant-specific pathways and processes. While the dependencies between concepts in MapMan are modeled as a tree, in GO these are captured in a directed acyclic graph. Therefore, the difference in ontologies may cause discrepancies in data reduction, visualization, and hypothesis generation. Here provide the first systematic comparative analysis of GO and MapMan for the case of the model plant species Arabidopsis thaliana (Arabidopsis) with respect to their structural properties and difference in distributions of information content. In addition, we investigate the effect of the two ontologies on the specificity and sensitivity of automated gene function prediction via the coupling of co-expression networks and the guilt-by-association principle. Automated gene function prediction is particularly needed for the model plant Arabidopsis in which only half of genes have been functionally annotated based on sequence similarity to known genes. The results highlight the need for structured representation of species-specific biological knowledge, and warrants caution in the design principles employed in future ontologies.
Collapse
Affiliation(s)
- Sebastian Klie
- Genes and Small Molecules Group, Max-Planck Institute of Molecular Plant Physiology Potsdam-Golm, Germany
| | | |
Collapse
|
44
|
Pires DEV, de Melo-Minardi RC, dos Santos MA, da Silveira CH, Santoro MM, Meira W. Cutoff Scanning Matrix (CSM): structural classification and function prediction by protein inter-residue distance patterns. BMC Genomics 2011; 12 Suppl 4:S12. [PMID: 22369665 PMCID: PMC3287581 DOI: 10.1186/1471-2164-12-s4-s12] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Background The unforgiving pace of growth of available biological data has increased the demand for efficient and scalable paradigms, models and methodologies for automatic annotation. In this paper, we present a novel structure-based protein function prediction and structural classification method: Cutoff Scanning Matrix (CSM). CSM generates feature vectors that represent distance patterns between protein residues. These feature vectors are then used as evidence for classification. Singular value decomposition is used as a preprocessing step to reduce dimensionality and noise. The aspect of protein function considered in the present work is enzyme activity. A series of experiments was performed on datasets based on Enzyme Commission (EC) numbers and mechanistically different enzyme superfamilies as well as other datasets derived from SCOP release 1.75. Results CSM was able to achieve a precision of up to 99% after SVD preprocessing for a database derived from manually curated protein superfamilies and up to 95% for a dataset of the 950 most-populated EC numbers. Moreover, we conducted experiments to verify our ability to assign SCOP class, superfamily, family and fold to protein domains. An experiment using the whole set of domains found in last SCOP version yielded high levels of precision and recall (up to 95%). Finally, we compared our structural classification results with those in the literature to place this work into context. Our method was capable of significantly improving the recall of a previous study while preserving a compatible precision level. Conclusions We showed that the patterns derived from CSMs could effectively be used to predict protein function and thus help with automatic function annotation. We also demonstrated that our method is effective in structural classification tasks. These facts reinforce the idea that the pattern of inter-residue distances is an important component of family structural signatures. Furthermore, singular value decomposition provided a consistent increase in precision and recall, which makes it an important preprocessing step when dealing with noisy data.
Collapse
Affiliation(s)
- Douglas E V Pires
- Department of Biochemistry and Immunology, Universidade Federal de Minas Gerais, Belo Horizonte, 31270-901, Brazil.
| | | | | | | | | | | |
Collapse
|
45
|
The roles and evolutionary patterns of intronless genes in deuterostomes. Comp Funct Genomics 2011; 2011:680673. [PMID: 21860604 PMCID: PMC3155783 DOI: 10.1155/2011/680673] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2010] [Revised: 04/13/2011] [Accepted: 06/22/2011] [Indexed: 12/26/2022] Open
Abstract
Genes without introns are a characteristic feature of prokaryotes, but there are still a number of intronless genes in eukaryotes. To study these eukaryotic genes that have prokaryotic architecture could help to understand the evolutionary patterns of related genes and genomes. Our analyses revealed a number of intronless genes that reside in 6 deuterostomes (sea urchin, sea squirt, zebrafish, chicken, platypus, and human). We also determined the conservation for each intronless gene in archaea, bacteria, fungi, plants, metazoans, and other eukaryotes. Proportions of intronless genes that are inherited from the common ancestor of archaea, bacteria, and eukaryotes in these species were consistent with their phylogenetic positions, with more proportions of ancient intronless genes residing in more primitive species. In these species, intronless genes belong to different cellular roles and gene ontology (GO) categories, and some of these functions are very basic. Part of intronless genes is derived from other intronless genes or multiexon genes in each species. In conclusion, we showed that a varying number and proportion of intronless genes reside in these 6 deuterostomes, and some of them function importantly. These genes are good candidates for subsequent functional and evolutionary analyses specifically.
Collapse
|
46
|
Clark WT, Radivojac P. Analysis of protein function and its prediction from amino acid sequence. Proteins 2011; 79:2086-96. [PMID: 21671271 DOI: 10.1002/prot.23029] [Citation(s) in RCA: 93] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2010] [Revised: 02/15/2011] [Accepted: 03/03/2011] [Indexed: 01/02/2023]
Abstract
Understanding protein function is one of the keys to understanding life at the molecular level. It is also important in the context of human disease because many conditions arise as a consequence of alterations of protein function. The recent availability of relatively inexpensive sequencing technology has resulted in thousands of complete or partially sequenced genomes with millions of functionally uncharacterized proteins. Such a large volume of data, combined with the lack of high-throughput experimental assays to functionally annotate proteins, attributes to the growing importance of automated function prediction. Here, we study proteins annotated by Gene Ontology (GO) terms and estimate the accuracy of functional transfer from protein sequence only. We find that the transfer of GO terms by pairwise sequence alignments is only moderately accurate, showing a surprisingly small influence of sequence identity (SID) in a broad range (30-100%). We developed and evaluated a new predictor of protein function, functional annotator (FANN), from amino acid sequence. The predictor exploits a multioutput neural network framework which is well suited to simultaneously modeling dependencies between functional terms. Experiments provide evidence that FANN-GO (predictor of GO terms; available from http://www.informatics.indiana.edu/predrag) outperforms standard methods such as transfer by global or local SID as well as GOtcha, a method that incorporates the structure of GO.
Collapse
Affiliation(s)
- Wyatt T Clark
- School of Informatics and Computing, Indiana University, Bloomington, Indiana 47405, USA
| | | |
Collapse
|
47
|
Flores CL, Gancedo C. Unraveling moonlighting functions with yeasts. IUBMB Life 2011; 63:457-62. [PMID: 21491559 DOI: 10.1002/iub.454] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2011] [Accepted: 02/22/2011] [Indexed: 01/21/2023]
Abstract
This review considers the use of yeasts to study protein moonlighting functions. The cases discussed highlight the possibilities offered by the well-developed yeast genetics for the study of moonlighting mechanisms. The possibility to generate sets of mutants encoding different protein variants has allowed in some cases to map the regions that participate in the moonlighting function. We discuss cases of enzymes that moonlight in such different activities as control of transcription, assembly of multimeric proteins, stabilization of mitochondrial DNA or biosynthesis of CoA. The moonlighting role of an enzyme and its metabolic function seems to have evolved independently as indicated by the finding that a protein may moonlight in a yeast species but not in others. Yeasts may open ways to study possible evolutionary relationships among moonlighting proteins.
Collapse
Affiliation(s)
- Carmen-Lisset Flores
- Department of Metabolism and Cell Signaling, Instituto de Investigaciones Biomédicas Alberto Sols, CSIC-UAM, Madrid, Spain
| | | |
Collapse
|
48
|
Pritchard L, Birch P. A systems biology perspective on plant-microbe interactions: biochemical and structural targets of pathogen effectors. PLANT SCIENCE : AN INTERNATIONAL JOURNAL OF EXPERIMENTAL PLANT BIOLOGY 2011; 180:584-603. [PMID: 21421407 DOI: 10.1016/j.plantsci.2010.12.008] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/01/2010] [Revised: 12/13/2010] [Accepted: 12/15/2010] [Indexed: 05/22/2023]
Abstract
Plants have biochemical defences against stresses from predators, parasites and pathogens. In this review we discuss the interaction of plant defences with microbial pathogens such as bacteria, fungi and oomycetes, and viruses. We examine principles of complex dynamic networks that allow identification of network components that are differentially and predictably sensitive to perturbation, thus making them likely effector targets. We relate these principles to recent developments in our understanding of known effector targets in plant-pathogen systems, and propose a systems-level framework for the interpretation and modelling of host-microbe interactions mediated by effectors. We describe this framework briefly, and conclude by discussing useful experimental approaches for populating this framework.
Collapse
Affiliation(s)
- Leighton Pritchard
- Plant Pathology Programme, SCRI, Errol Road, Invergowrie, Dundee, Scotland DD25DA, UK.
| | | |
Collapse
|
49
|
Sendiña-Nadal I, Ofran Y, Almendral JA, Buldú JM, Leyva I, Li D, Havlin S, Boccaletti S. Unveiling protein functions through the dynamics of the interaction network. PLoS One 2011; 6:e17679. [PMID: 21408013 PMCID: PMC3052369 DOI: 10.1371/journal.pone.0017679] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2010] [Accepted: 02/05/2011] [Indexed: 01/02/2023] Open
Abstract
Protein interaction networks have become a tool to study biological processes, either for predicting molecular functions or for designing proper new drugs to regulate the main biological interactions. Furthermore, such networks are known to be organized in sub-networks of proteins contributing to the same cellular function. However, the protein function prediction is not accurate and each protein has traditionally been assigned to only one function by the network formalism. By considering the network of the physical interactions between proteins of the yeast together with a manual and single functional classification scheme, we introduce a method able to reveal important information on protein function, at both micro- and macro-scale. In particular, the inspection of the properties of oscillatory dynamics on top of the protein interaction network leads to the identification of misclassification problems in protein function assignments, as well as to unveil correct identification of protein functions. We also demonstrate that our approach can give a network representation of the meta-organization of biological processes by unraveling the interactions between different functional classes.
Collapse
|
50
|
Jaeger S, Sers CT, Leser U. Combining modularity, conservation, and interactions of proteins significantly increases precision and coverage of protein function prediction. BMC Genomics 2010; 11:717. [PMID: 21171995 PMCID: PMC3017542 DOI: 10.1186/1471-2164-11-717] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2010] [Accepted: 12/20/2010] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND While the number of newly sequenced genomes and genes is constantly increasing, elucidation of their function still is a laborious and time-consuming task. This has led to the development of a wide range of methods for predicting protein functions in silico. We report on a new method that predicts function based on a combination of information about protein interactions, orthology, and the conservation of protein networks in different species. RESULTS We show that aggregation of these independent sources of evidence leads to a drastic increase in number and quality of predictions when compared to baselines and other methods reported in the literature. For instance, our method generates more than 12,000 novel protein functions for human with an estimated precision of ~76%, among which are 7,500 new functional annotations for 1,973 human proteins that previously had zero or only one function annotated. We also verified our predictions on a set of genes that play an important role in colorectal cancer (MLH1, PMS2, EPHB4 ) and could confirm more than 73% of them based on evidence in the literature. CONCLUSIONS The combination of different methods into a single, comprehensive prediction method infers thousands of protein functions for every species included in the analysis at varying, yet always high levels of precision and very good coverage.
Collapse
Affiliation(s)
- Samira Jaeger
- Knowledge Management in Bioinformatics, Humboldt-Universitat zu Berlin Unter den Linden 6, 10099 Berlin, Germany.
| | | | | |
Collapse
|