51
|
Di Lena P, Domeniconi G, Margara L, Moro G. GOTA: GO term annotation of biomedical literature. BMC Bioinformatics 2015; 16:346. [PMID: 26511083 PMCID: PMC4625458 DOI: 10.1186/s12859-015-0777-8] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2015] [Accepted: 10/13/2015] [Indexed: 12/12/2022] Open
Abstract
Background Functional annotation of genes and gene products is a major challenge in the post-genomic era. Nowadays, gene function curation is largely based on manual assignment of Gene Ontology (GO) annotations to genes by using published literature. The annotation task is extremely time-consuming, therefore there is an increasing interest in automated tools that can assist human experts. Results Here we introduce GOTA, a GO term annotator for biomedical literature. The proposed approach makes use only of information that is readily available from public repositories and it is easily expandable to handle novel sources of information. We assess the classification capabilities of GOTA on a large benchmark set of publications. The overall performances are encouraging in comparison to the state of the art in multi-label classification over large taxonomies. Furthermore, the experimental tests provide some interesting insights into the potential improvement of automated annotation tools. Conclusions GOTA implements a flexible and expandable model for GO annotation of biomedical literature. The current version of the GOTA tool is freely available at http://gota.apice.unibo.it. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0777-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Pietro Di Lena
- Department of Computer Science and Engineering, University of Bologna, Cesena Campus, Via Sacchi 3, Cesena, 47521, Italy.
| | - Giacomo Domeniconi
- Department of Computer Science and Engineering, University of Bologna, Cesena Campus, Via Sacchi 3, Cesena, 47521, Italy.
| | - Luciano Margara
- Department of Computer Science and Engineering, University of Bologna, Cesena Campus, Via Sacchi 3, Cesena, 47521, Italy.
| | - Gianluca Moro
- Department of Computer Science and Engineering, University of Bologna, Cesena Campus, Via Sacchi 3, Cesena, 47521, Italy.
| |
Collapse
|
52
|
Richards AJ, Herrel A, Bonneaud C. htsint: a Python library for sequencing pipelines that combines data through gene set generation. BMC Bioinformatics 2015; 16:307. [PMID: 26399714 PMCID: PMC4581156 DOI: 10.1186/s12859-015-0729-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2015] [Accepted: 09/08/2015] [Indexed: 11/10/2022] Open
Abstract
Background Sequencing technologies provide a wealth of details in terms of genes, expression, splice variants, polymorphisms, and other features. A standard for sequencing analysis pipelines is to put genomic or transcriptomic features into a context of known functional information, but the relationships between ontology terms are often ignored. For RNA-Seq, considering genes and their genetic variants at the group level enables a convenient way to both integrate annotation data and detect small coordinated changes between experimental conditions, a known caveat of gene level analyses. Results We introduce the high throughput data integration tool, htsint, as an extension to the commonly used gene set enrichment frameworks. The central aim of htsint is to compile annotation information from one or more taxa in order to calculate functional distances among all genes in a specified gene space. Spectral clustering is then used to partition the genes, thereby generating functional modules. The gene space can range from a targeted list of genes, like a specific pathway, all the way to an ensemble of genomes. Given a collection of gene sets and a count matrix of transcriptomic features (e.g. expression, polymorphisms), the gene sets produced by htsint can be tested for ‘enrichment’ or conditional differences using one of a number of commonly available packages. Conclusion The database and bundled tools to generate functional modules were designed with sequencing pipelines in mind, but the toolkit nature of htsint allows it to also be used in other areas of genomics. The software is freely available as a Python library through GitHub at https://github.com/ajrichards/htsint.
Collapse
Affiliation(s)
- Adam J Richards
- Station d'Ecologie Expérimentale du CNRS, USR 2936, Route du CNRS, Moulis, 09200, France.
| | - Anthony Herrel
- UMR 7179 CNRS/MNHN, Département d'Ecologie et de Gestion de la Biodiversité 57 rue Cuvier, Case postale 55, Paris, 75231, France. .,Ghent University, Evolutionary Morphology of Vertebrates, K.L. Ledeganckstraat 35, Ghent, B-9000, Belgium.
| | - Camille Bonneaud
- Station d'Ecologie Expérimentale du CNRS, USR 2936, Route du CNRS, Moulis, 09200, France. .,Centre for Ecology & Conservation, College of Life and Environmental Sciences, University of Exeter, Penryn TR10 9FE, Cornwall, UK.
| |
Collapse
|
53
|
|
54
|
Lavezzo E, Falda M, Fontana P, Bianco L, Toppo S. Enhancing protein function prediction with taxonomic constraints--The Argot2.5 web server. Methods 2015; 93:15-23. [PMID: 26318087 DOI: 10.1016/j.ymeth.2015.08.021] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2015] [Revised: 08/14/2015] [Accepted: 08/25/2015] [Indexed: 10/23/2022] Open
Abstract
Argot2.5 (Annotation Retrieval of Gene Ontology Terms) is a web server designed to predict protein function. It is an updated version of the previous Argot2 enriched with new features in order to enhance its usability and its overall performance. The algorithmic strategy exploits the grouping of Gene Ontology terms by means of semantic similarity to infer protein function. The tool has been challenged over two independent benchmarks and compared to Argot2, PANNZER, and a baseline method relying on BLAST, proving to obtain a better performance thanks to the contribution of some key interventions in critical steps of the working pipeline. The most effective changes regard: (a) the selection of the input data from sequence similarity searches performed against a clustered version of UniProt databank and a remodeling of the weights given to Pfam hits, (b) the application of taxonomic constraints to filter out annotations that cannot be applied to proteins belonging to the species under investigation. The taxonomic rules are derived from our in-house developed tool, FunTaxIS, that extends those provided by the Gene Ontology consortium. The web server is free for academic users and is available online at http://www.medcomp.medicina.unipd.it/Argot2-5/.
Collapse
Affiliation(s)
- Enrico Lavezzo
- Department of Molecular Medicine, University of Padova, Padova, Italy
| | - Marco Falda
- Department of Molecular Medicine, University of Padova, Padova, Italy
| | - Paolo Fontana
- Istituto Agrario San Michele all'Adige Research and Innovation Centre, Foundation Edmund Mach, Trento, Italy
| | - Luca Bianco
- Istituto Agrario San Michele all'Adige Research and Innovation Centre, Foundation Edmund Mach, Trento, Italy
| | - Stefano Toppo
- Department of Molecular Medicine, University of Padova, Padova, Italy.
| |
Collapse
|
55
|
Yu G, Zhu H, Domeniconi C, Liu J. Predicting protein function via downward random walks on a gene ontology. BMC Bioinformatics 2015; 16:271. [PMID: 26310806 PMCID: PMC4551531 DOI: 10.1186/s12859-015-0713-y] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2015] [Accepted: 08/20/2015] [Indexed: 12/24/2022] Open
Abstract
Background High-throughput bio-techniques accumulate ever-increasing amount of genomic and proteomic data. These data are far from being functionally characterized, despite the advances in gene (or gene’s product proteins) functional annotations. Due to experimental techniques and to the research bias in biology, the regularly updated functional annotation databases, i.e., the Gene Ontology (GO), are far from being complete. Given the importance of protein functions for biological studies and drug design, proteins should be more comprehensively and precisely annotated. Results We proposed downward Random Walks (dRW) to predict missing (or new) functions of partially annotated proteins. Particularly, we apply downward random walks with restart on the GO directed acyclic graph, along with the available functions of a protein, to estimate the probability of missing functions. To further boost the prediction accuracy, we extend dRW to dRW-kNN. dRW-kNN computes the semantic similarity between proteins based on the functional annotations of proteins; it then predicts functions based on the functions estimated by dRW, together with the functions associated with the k nearest proteins. Our proposed models can predict two kinds of missing functions: (i) the ones that are missing for a protein but associated with other proteins of interest; (ii) the ones that are not available for any protein of interest, but exist in the GO hierarchy. Experimental results on the proteins of Yeast and Human show that dRW and dRW-kNN can replenish functions more accurately than other related approaches, especially for sparse functions associated with no more than 10 proteins. Conclusion The empirical study shows that the semantic similarity between GO terms and the ontology hierarchy play important roles in predicting protein function. The proposed dRW and dRW-kNN can serve as tools for replenishing functions of partially annotated proteins. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0713-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Guoxian Yu
- College of Computer and Information Sciences, Southwest University, Beibei, Chongqing, China. .,Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China.
| | - Hailong Zhu
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, Hong Kong.
| | | | - Jiming Liu
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, Hong Kong.
| |
Collapse
|
56
|
Da Silveira M, Dos Reis JC, Pruski C. Management of Dynamic Biomedical Terminologies: Current Status and Future Challenges. Yearb Med Inform 2015; 10:125-33. [PMID: 26293859 DOI: 10.15265/iy-2015-002] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
OBJECTIVES Controlled terminologies and their dependent artefacts provide a consensual understanding of a domain while reducing ambiguities and enabling reasoning. However, the evolution of a domain's knowledge directly impacts these terminologies and generates inconsistencies in the underlying biomedical information systems. In this article, we review existing work addressing the dynamic aspect of terminologies as well as their effects on mappings and semantic annotations. METHODS We investigate approaches related to the identification, characterization and propagation of changes in terminologies, mappings and semantic annotations including techniques to update their content. RESULTS AND CONCLUSION Based on the explored issues and existing methods, we outline open research challenges requiring investigation in the near future.
Collapse
Affiliation(s)
- M Da Silveira
- Dr. Marcos Da Silveira, Luxembourg Institute of Science and Technology (LIST), 5, avenue des Hauts-Fourneaux, 4362 Esch/Alzette, Luxembourg, E-mail:
| | | | | |
Collapse
|
57
|
Das S, Lee D, Sillitoe I, Dawson NL, Lees JG, Orengo CA. Functional classification of CATH superfamilies: a domain-based approach for protein function annotation. Bioinformatics 2015; 31:3460-7. [PMID: 26139634 PMCID: PMC4612221 DOI: 10.1093/bioinformatics/btv398] [Citation(s) in RCA: 75] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2015] [Accepted: 06/24/2015] [Indexed: 11/18/2022] Open
Abstract
Motivation: Computational approaches that can predict protein functions are essential to bridge the widening function annotation gap especially since <1.0% of all proteins in UniProtKB have been experimentally characterized. We present a domain-based method for protein function classification and prediction of functional sites that exploits functional sub-classification of CATH superfamilies. The superfamilies are sub-classified into functional families (FunFams) using a hierarchical clustering algorithm supervised by a new classification method, FunFHMMer. Results: FunFHMMer generates more functionally coherent groupings of protein sequences than other domain-based protein classifications. This has been validated using known functional information. The conserved positions predicted by the FunFams are also found to be enriched in known functional residues. Moreover, the functional annotations provided by the FunFams are found to be more precise than other domain-based resources. FunFHMMer currently identifies 110 439 FunFams in 2735 superfamilies which can be used to functionally annotate > 16 million domain sequences. Availability and implementation: All FunFam annotation data are made available through the CATH webpages (http://www.cathdb.info). The FunFHMMer webserver (http://www.cathdb.info/search/by_funfhmmer) allows users to submit query sequences for assignment to a CATH FunFam. Contact:sayoni.das.12@ucl.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sayoni Das
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - David Lee
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Natalie L Dawson
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Jonathan G Lees
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Christine A Orengo
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| |
Collapse
|
58
|
Chapple CE, Robisson B, Spinelli L, Guien C, Becker E, Brun C. Extreme multifunctional proteins identified from a human protein interaction network. Nat Commun 2015; 6:7412. [PMID: 26054620 PMCID: PMC4468855 DOI: 10.1038/ncomms8412] [Citation(s) in RCA: 77] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2014] [Accepted: 05/06/2015] [Indexed: 12/30/2022] Open
Abstract
Moonlighting proteins are a subclass of multifunctional proteins whose functions are unrelated. Although they may play important roles in cells, there has been no large-scale method to identify them, nor any effort to characterize them as a group. Here, we propose the first method for the identification of ‘extreme multifunctional' proteins from an interactome as a first step to characterize moonlighting proteins. By combining network topological information with protein annotations, we identify 430 extreme multifunctional proteins (3% of the human interactome). We show that the candidates form a distinct sub-group of proteins, characterized by specific features, which form a signature of extreme multifunctionality. Overall, extreme multifunctional proteins are enriched in linear motifs and less intrinsically disordered than network hubs. We also provide MoonDB, a database containing information on all the candidates identified in the analysis and a set of manually curated human moonlighting proteins. Proteins are sometimes implicated in separate and seemingly unrelated processes, so called moonlighting functions. Here the authors use bioinformatics tools to identify extreme multifunctional proteins and define a signature of extreme multifunctionality.
Collapse
Affiliation(s)
- Charles E Chapple
- 1] Aix-Marseille University, TAGC, Marseille F-13009, France [2] INSERM UMR_S1090, Marseille F-13009, France
| | - Benoit Robisson
- 1] Aix-Marseille University, TAGC, Marseille F-13009, France [2] INSERM UMR_S1090, Marseille F-13009, France
| | - Lionel Spinelli
- 1] Aix-Marseille University, TAGC, Marseille F-13009, France [2] INSERM UMR_S1090, Marseille F-13009, France [3] Aix-Marseille University, CIML, Marseille F-13009, France [4] CNRS, UMR 7280, Marseille F-13009, France [5] INSERM, U631, Marseille F-13009, France
| | - Céline Guien
- 1] Aix-Marseille University, TAGC, Marseille F-13009, France [2] INSERM UMR_S1090, Marseille F-13009, France
| | - Emmanuelle Becker
- 1] Aix-Marseille University, TAGC, Marseille F-13009, France [2] INSERM UMR_S1090, Marseille F-13009, France
| | - Christine Brun
- 1] Aix-Marseille University, TAGC, Marseille F-13009, France [2] INSERM UMR_S1090, Marseille F-13009, France [3] CNRS, Marseille F-13009, France
| |
Collapse
|
59
|
Chapple CE, Herrmann C, Brun C. PrOnto database : GO term functional dissimilarity inferred from biological data. Front Genet 2015; 6:200. [PMID: 26089836 PMCID: PMC4452890 DOI: 10.3389/fgene.2015.00200] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2015] [Accepted: 05/21/2015] [Indexed: 12/22/2022] Open
Abstract
Moonlighting proteins are defined by their involvement in multiple, unrelated functions. The computational prediction of such proteins requires a formal method of assessing the similarity of cellular processes, for example, by identifying dissimilar Gene Ontology terms. While many measures of Gene Ontology term similarity exist, most depend on abstract mathematical analyses of the structure of the GO tree and do not necessarily represent the underlying biology. Here, we propose two metrics of GO term functional dissimilarity derived from biological information, one based on the protein annotations and the other on the interactions between proteins. They have been collected in the PrOnto database, a novel tool which can be of particular use for the identification of moonlighting proteins. The database can be queried via an web-based interface which is freely available at http://tagc.univ-mrs.fr/pronto.
Collapse
Affiliation(s)
- Charles E Chapple
- Inserm, UMR_S1090 TAGC Marseille, France ; Aix-Marseille Université, UMR_S1090 TAGC Marseille, France
| | - Carl Herrmann
- Inserm, UMR_S1090 TAGC Marseille, France ; Aix-Marseille Université, UMR_S1090 TAGC Marseille, France
| | - Christine Brun
- Inserm, UMR_S1090 TAGC Marseille, France ; Aix-Marseille Université, UMR_S1090 TAGC Marseille, France ; Centre National de la Recherche Scientifique Marseille, France
| |
Collapse
|
60
|
Masseroli M, Canakoglu A, Quigliatti M. Detection of gene annotations and protein-protein interaction associated disorders through transitive relationships between integrated annotations. BMC Genomics 2015; 16:S5. [PMID: 26046679 PMCID: PMC4460591 DOI: 10.1186/1471-2164-16-s6-s5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Background Increasingly high amounts of heterogeneous and valuable controlled biomolecular annotations are available, but far from exhaustive and scattered in many databases. Several annotation integration and prediction approaches have been proposed, but these issues are still unsolved. We previously created a Genomic and Proteomic Knowledge Base (GPKB) that efficiently integrates many distributed biomolecular annotation and interaction data of several organisms, including 32,956,102 gene annotations, 273,522,470 protein annotations and 277,095 protein-protein interactions (PPIs). Results By comprehensively leveraging transitive relationships defined by the numerous association data integrated in GPKB, we developed a software procedure that effectively detects and supplement consistent biomolecular annotations not present in the integrated sources. According to some defined logic rules, it does so only when the semantic type of data and of their relationships, as well as the cardinality of the relationships, allow identifying molecular biology compliant annotations. Thanks to controlled consistency and quality enforced on data integrated in GPKB, and to the procedures used to avoid error propagation during their automatic processing, we could reliably identify many annotations, which we integrated in GPKB. They comprise 3,144 gene to pathway and 21,942 gene to biological function annotations of many organisms, and 1,027 candidate associations between 317 genetic disorders and 782 human PPIs. Overall estimated recall and precision of our approach were 90.56 % and 96.61 %, respectively. Co-functional evaluation of genes with known function showed high functional similarity between genes with new detected and known annotation to the same pathway; considering also the new detected gene functional annotations enhanced such functional similarity, which resembled the one existing between genes known to be annotated to the same pathway. Strong evidence was also found in the literature for the candidate associations detected between Cystic fibrosis disorder and the PPIs between the CFTR_HUMAN, DERL1_HUMAN, RNF5_HUMAN, AHSA1_HUMAN and GOPC_HUMAN proteins, and between the CHIP_HUMAN and HSP7C_HUMAN proteins. Conclusions Although identified gene annotations and PPI-genetic disorder candidate associations require biological validation, our approach intrinsically provides their in silico evidence based on available data. Public availability within the GPKB (http://www.bioinformatics.deib.polimi.it/GPKB/) of all identified and integrated annotations offers a valuable resource fostering new biomedical-molecular knowledge discoveries.
Collapse
|
61
|
Das S, Sillitoe I, Lee D, Lees JG, Dawson NL, Ward J, Orengo CA. CATH FunFHMMer web server: protein functional annotations using functional family assignments. Nucleic Acids Res 2015; 43:W148-53. [PMID: 25964299 PMCID: PMC4489299 DOI: 10.1093/nar/gkv488] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2015] [Accepted: 05/02/2015] [Indexed: 12/20/2022] Open
Abstract
The widening function annotation gap in protein databases and the increasing number and diversity of the proteins being sequenced presents new challenges to protein function prediction methods. Multidomain proteins complicate the protein sequence–structure–function relationship further as new combinations of domains can expand the functional repertoire, creating new proteins and functions. Here, we present the FunFHMMer web server, which provides Gene Ontology (GO) annotations for query protein sequences based on the functional classification of the domain-based CATH-Gene3D resource. Our server also provides valuable information for the prediction of functional sites. The predictive power of FunFHMMer has been validated on a set of 95 proteins where FunFHMMer performs better than BLAST, Pfam and CDD. Recent validation by an independent international competition ranks FunFHMMer as one of the top function prediction methods in predicting GO annotations for both the Biological Process and Molecular Function Ontology. The FunFHMMer web server is available at http://www.cathdb.info/search/by_funfhmmer.
Collapse
Affiliation(s)
- Sayoni Das
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - David Lee
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Jonathan G Lees
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - Natalie L Dawson
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| | - John Ward
- Department of Biochemical Engineering, UCL, Gower Street, WC1E 6BT, UK
| | - Christine A Orengo
- Institute of Structural and Molecular Biology, UCL, Darwin Building, Gower Street, WC1E 6BT, UK
| |
Collapse
|
62
|
Bastian FB, Chibucos MC, Gaudet P, Giglio M, Holliday GL, Huang H, Lewis SE, Niknejad A, Orchard S, Poux S, Skunca N, Robinson-Rechavi M. The Confidence Information Ontology: a step towards a standard for asserting confidence in annotations. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2015; 2015:bav043. [PMID: 25957950 PMCID: PMC4425939 DOI: 10.1093/database/bav043] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/19/2014] [Accepted: 04/15/2015] [Indexed: 02/01/2023]
Abstract
Biocuration has become a cornerstone for analyses in biology, and to meet needs, the amount of annotations has considerably grown in recent years. However, the reliability of these annotations varies; it has thus become necessary to be able to assess the confidence in annotations. Although several resources already provide confidence information about the annotations that they produce, a standard way of providing such information has yet to be defined. This lack of standardization undermines the propagation of knowledge across resources, as well as the credibility of results from high-throughput analyses. Seeded at a workshop during the Biocuration 2012 conference, a working group has been created to address this problem. We present here the elements that were identified as essential for assessing confidence in annotations, as well as a draft ontology—the Confidence Information Ontology—to illustrate how the problems identified could be addressed. We hope that this effort will provide a home for discussing this major issue among the biocuration community. Tracker URL:https://github.com/BgeeDB/confidence-information-ontology Ontology URL:https://raw.githubusercontent.com/BgeeDB/confidence-information-ontology/master/src/ontology/cio-simple.obo
Collapse
Affiliation(s)
- Frederic B Bastian
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland, SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, Department of Microbiology and Immunology and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, SIB Swiss Institute of Bioinformatics, 1 Rue Michel Servet, 1211 Geneva, Switzerland, Department of Medicine and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158, USA, School of Information, University of South Florida, Tampa, FL, 33647, USA, Genomics Division, Lawrence Berkeley National Lab, 1 Cyclotron Rd., Berkeley, 94720 CA USA, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland, ETH Zurich, Department of Computer Science, Universitätstr. 19, 8092 Zürich, Switzerland, SIB Swiss Institute of Bioinformatics, Universitätstr. 6, 8092 Zürich, Switzerland and University College London, Gower St, London WC1E 6BT, UK Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland, SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, Department of Microbiology and Immunology and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, SIB Swiss Institute of Bioinformatics, 1 Rue Michel Servet, 1211 Geneva, Switzerland, Department of Medicine and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158, USA, School of Information, University of South Florida, Tampa, FL, 33647, USA, Genomics Division, Lawrence Berkeley Nat
| | - Marcus C Chibucos
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland, SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, Department of Microbiology and Immunology and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, SIB Swiss Institute of Bioinformatics, 1 Rue Michel Servet, 1211 Geneva, Switzerland, Department of Medicine and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158, USA, School of Information, University of South Florida, Tampa, FL, 33647, USA, Genomics Division, Lawrence Berkeley National Lab, 1 Cyclotron Rd., Berkeley, 94720 CA USA, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland, ETH Zurich, Department of Computer Science, Universitätstr. 19, 8092 Zürich, Switzerland, SIB Swiss Institute of Bioinformatics, Universitätstr. 6, 8092 Zürich, Switzerland and University College London, Gower St, London WC1E 6BT, UK
| | - Pascale Gaudet
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland, SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, Department of Microbiology and Immunology and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, SIB Swiss Institute of Bioinformatics, 1 Rue Michel Servet, 1211 Geneva, Switzerland, Department of Medicine and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158, USA, School of Information, University of South Florida, Tampa, FL, 33647, USA, Genomics Division, Lawrence Berkeley National Lab, 1 Cyclotron Rd., Berkeley, 94720 CA USA, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland, ETH Zurich, Department of Computer Science, Universitätstr. 19, 8092 Zürich, Switzerland, SIB Swiss Institute of Bioinformatics, Universitätstr. 6, 8092 Zürich, Switzerland and University College London, Gower St, London WC1E 6BT, UK
| | - Michelle Giglio
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland, SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, Department of Microbiology and Immunology and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, SIB Swiss Institute of Bioinformatics, 1 Rue Michel Servet, 1211 Geneva, Switzerland, Department of Medicine and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158, USA, School of Information, University of South Florida, Tampa, FL, 33647, USA, Genomics Division, Lawrence Berkeley National Lab, 1 Cyclotron Rd., Berkeley, 94720 CA USA, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland, ETH Zurich, Department of Computer Science, Universitätstr. 19, 8092 Zürich, Switzerland, SIB Swiss Institute of Bioinformatics, Universitätstr. 6, 8092 Zürich, Switzerland and University College London, Gower St, London WC1E 6BT, UK
| | - Gemma L Holliday
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland, SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, Department of Microbiology and Immunology and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, SIB Swiss Institute of Bioinformatics, 1 Rue Michel Servet, 1211 Geneva, Switzerland, Department of Medicine and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158, USA, School of Information, University of South Florida, Tampa, FL, 33647, USA, Genomics Division, Lawrence Berkeley National Lab, 1 Cyclotron Rd., Berkeley, 94720 CA USA, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland, ETH Zurich, Department of Computer Science, Universitätstr. 19, 8092 Zürich, Switzerland, SIB Swiss Institute of Bioinformatics, Universitätstr. 6, 8092 Zürich, Switzerland and University College London, Gower St, London WC1E 6BT, UK
| | - Hong Huang
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland, SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, Department of Microbiology and Immunology and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, SIB Swiss Institute of Bioinformatics, 1 Rue Michel Servet, 1211 Geneva, Switzerland, Department of Medicine and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158, USA, School of Information, University of South Florida, Tampa, FL, 33647, USA, Genomics Division, Lawrence Berkeley National Lab, 1 Cyclotron Rd., Berkeley, 94720 CA USA, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland, ETH Zurich, Department of Computer Science, Universitätstr. 19, 8092 Zürich, Switzerland, SIB Swiss Institute of Bioinformatics, Universitätstr. 6, 8092 Zürich, Switzerland and University College London, Gower St, London WC1E 6BT, UK
| | - Suzanna E Lewis
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland, SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, Department of Microbiology and Immunology and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, SIB Swiss Institute of Bioinformatics, 1 Rue Michel Servet, 1211 Geneva, Switzerland, Department of Medicine and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158, USA, School of Information, University of South Florida, Tampa, FL, 33647, USA, Genomics Division, Lawrence Berkeley National Lab, 1 Cyclotron Rd., Berkeley, 94720 CA USA, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland, ETH Zurich, Department of Computer Science, Universitätstr. 19, 8092 Zürich, Switzerland, SIB Swiss Institute of Bioinformatics, Universitätstr. 6, 8092 Zürich, Switzerland and University College London, Gower St, London WC1E 6BT, UK
| | - Anne Niknejad
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland, SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, Department of Microbiology and Immunology and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, SIB Swiss Institute of Bioinformatics, 1 Rue Michel Servet, 1211 Geneva, Switzerland, Department of Medicine and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158, USA, School of Information, University of South Florida, Tampa, FL, 33647, USA, Genomics Division, Lawrence Berkeley National Lab, 1 Cyclotron Rd., Berkeley, 94720 CA USA, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland, ETH Zurich, Department of Computer Science, Universitätstr. 19, 8092 Zürich, Switzerland, SIB Swiss Institute of Bioinformatics, Universitätstr. 6, 8092 Zürich, Switzerland and University College London, Gower St, London WC1E 6BT, UK Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland, SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, Department of Microbiology and Immunology and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, SIB Swiss Institute of Bioinformatics, 1 Rue Michel Servet, 1211 Geneva, Switzerland, Department of Medicine and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158, USA, School of Information, University of South Florida, Tampa, FL, 33647, USA, Genomics Division, Lawrence Berkeley Nat
| | - Sandra Orchard
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland, SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, Department of Microbiology and Immunology and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, SIB Swiss Institute of Bioinformatics, 1 Rue Michel Servet, 1211 Geneva, Switzerland, Department of Medicine and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158, USA, School of Information, University of South Florida, Tampa, FL, 33647, USA, Genomics Division, Lawrence Berkeley National Lab, 1 Cyclotron Rd., Berkeley, 94720 CA USA, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland, ETH Zurich, Department of Computer Science, Universitätstr. 19, 8092 Zürich, Switzerland, SIB Swiss Institute of Bioinformatics, Universitätstr. 6, 8092 Zürich, Switzerland and University College London, Gower St, London WC1E 6BT, UK
| | - Sylvain Poux
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland, SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, Department of Microbiology and Immunology and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, SIB Swiss Institute of Bioinformatics, 1 Rue Michel Servet, 1211 Geneva, Switzerland, Department of Medicine and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158, USA, School of Information, University of South Florida, Tampa, FL, 33647, USA, Genomics Division, Lawrence Berkeley National Lab, 1 Cyclotron Rd., Berkeley, 94720 CA USA, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland, ETH Zurich, Department of Computer Science, Universitätstr. 19, 8092 Zürich, Switzerland, SIB Swiss Institute of Bioinformatics, Universitätstr. 6, 8092 Zürich, Switzerland and University College London, Gower St, London WC1E 6BT, UK
| | - Nives Skunca
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland, SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, Department of Microbiology and Immunology and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, SIB Swiss Institute of Bioinformatics, 1 Rue Michel Servet, 1211 Geneva, Switzerland, Department of Medicine and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158, USA, School of Information, University of South Florida, Tampa, FL, 33647, USA, Genomics Division, Lawrence Berkeley National Lab, 1 Cyclotron Rd., Berkeley, 94720 CA USA, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland, ETH Zurich, Department of Computer Science, Universitätstr. 19, 8092 Zürich, Switzerland, SIB Swiss Institute of Bioinformatics, Universitätstr. 6, 8092 Zürich, Switzerland and University College London, Gower St, London WC1E 6BT, UK Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland, SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, Department of Microbiology and Immunology and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, SIB Swiss Institute of Bioinformatics, 1 Rue Michel Servet, 1211 Geneva, Switzerland, Department of Medicine and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158, USA, School of Information, University of South Florida, Tampa, FL, 33647, USA, Genomics Division, Lawrence Berkeley Nat
| | - Marc Robinson-Rechavi
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland, SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, Department of Microbiology and Immunology and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, SIB Swiss Institute of Bioinformatics, 1 Rue Michel Servet, 1211 Geneva, Switzerland, Department of Medicine and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158, USA, School of Information, University of South Florida, Tampa, FL, 33647, USA, Genomics Division, Lawrence Berkeley National Lab, 1 Cyclotron Rd., Berkeley, 94720 CA USA, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland, ETH Zurich, Department of Computer Science, Universitätstr. 19, 8092 Zürich, Switzerland, SIB Swiss Institute of Bioinformatics, Universitätstr. 6, 8092 Zürich, Switzerland and University College London, Gower St, London WC1E 6BT, UK Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland, SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland, Department of Microbiology and Immunology and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, SIB Swiss Institute of Bioinformatics, 1 Rue Michel Servet, 1211 Geneva, Switzerland, Department of Medicine and Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore MD, USA, Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158, USA, School of Information, University of South Florida, Tampa, FL, 33647, USA, Genomics Division, Lawrence Berkeley Nat
| |
Collapse
|
63
|
Škunca N, Dessimoz C. Phylogenetic profiling: how much input data is enough? PLoS One 2015; 10:e0114701. [PMID: 25679783 PMCID: PMC4332489 DOI: 10.1371/journal.pone.0114701] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2014] [Accepted: 11/10/2014] [Indexed: 12/04/2022] Open
Abstract
Phylogenetic profiling is a well-established approach for predicting gene function based on patterns of gene presence and absence across species. Much of the recent developments have focused on methodological improvements, but relatively little is known about the effect of input data size on the quality of predictions. In this work, we ask: how many genomes and functional annotations need to be considered for phylogenetic profiling to be effective? Phylogenetic profiling generally benefits from an increased amount of input data. However, by decomposing this improvement in predictive accuracy in terms of the contribution of additional genomes and of additional annotations, we observed diminishing returns in adding more than ∼100 genomes, whereas increasing the number of annotations remained strongly beneficial throughout. We also observed that maximising phylogenetic diversity within a clade of interest improves predictive accuracy, but the effect is small compared to changes in the number of genomes under comparison. Finally, we show that these findings are supported in light of the Open World Assumption, which posits that functional annotation databases are inherently incomplete. All the tools and data used in this work are available for reuse from http://lab.dessimoz.org/14_phylprof. Scripts used to analyse the data are available on request from the authors.
Collapse
Affiliation(s)
- Nives Škunca
- ETH Zürich, Department of Computer Science, Universitätstr. 19, 8092 Zürich, Switzerland
- Swiss Institute of Bioinformatics, Universitätstr. 6, 8092 Zürich, Switzerland
- University College London, Gower St, London WC1E 6BT, UK
- * E-mail: (NS), (CD)
| | - Christophe Dessimoz
- Swiss Institute of Bioinformatics, Universitätstr. 6, 8092 Zürich, Switzerland
- University College London, Gower St, London WC1E 6BT, UK
- * E-mail: (NS), (CD)
| |
Collapse
|
64
|
Carnielli CM, Winck FV, Paes Leme AF. Functional annotation and biological interpretation of proteomics data. BIOCHIMICA ET BIOPHYSICA ACTA-PROTEINS AND PROTEOMICS 2015; 1854:46-54. [DOI: 10.1016/j.bbapap.2014.10.019] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/11/2014] [Revised: 10/07/2014] [Accepted: 10/21/2014] [Indexed: 12/22/2022]
|
65
|
Trachana K, Forslund K, Larsson T, Powell S, Doerks T, von Mering C, Bork P. A phylogeny-based benchmarking test for orthology inference reveals the limitations of function-based validation. PLoS One 2014; 9:e111122. [PMID: 25369365 PMCID: PMC4219706 DOI: 10.1371/journal.pone.0111122] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2014] [Accepted: 09/23/2014] [Indexed: 11/19/2022] Open
Abstract
Accurate orthology prediction is crucial for many applications in the post-genomic era. The lack of broadly accepted benchmark tests precludes a comprehensive analysis of orthology inference. So far, functional annotation between orthologs serves as a performance proxy. However, this violates the fundamental principle of orthology as an evolutionary definition, while it is often not applicable due to limited experimental evidence for most species. Therefore, we constructed high quality "gold standard" orthologous groups that can serve as a benchmark set for orthology inference in bacterial species. Herein, we used this dataset to demonstrate 1) why a manually curated, phylogeny-based dataset is more appropriate for benchmarking orthology than other popular practices and 2) how it guides database design and parameterization through careful error quantification. More specifically, we illustrate how function-based tests often fail to identify false assignments, misjudging the true performance of orthology inference methods. We also examined how our dataset can instruct the selection of a “core” species repertoire to improve detection accuracy. We conclude that including more genomes at the proper evolutionary distances can influence the overall quality of orthology detection. The curated gene families, called Reference Orthologous Groups, are publicly available at http://eggnog.embl.de/orthobench2.
Collapse
Affiliation(s)
- Kalliopi Trachana
- Institute for Systems Biology, Seattle, WA, United States of America
| | - Kristoffer Forslund
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Tomas Larsson
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
- Developmental Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Sean Powell
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Tobias Doerks
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Christian von Mering
- Institute of Molecular Life Sciences, University of Zurich and Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Peer Bork
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany
- Max-Delbruck-Centre for Molecular Medicine, Berlin, Germany
- * E-mail:
| |
Collapse
|
66
|
Best A, James K, Dalgliesh C, Hong E, Kheirolahi-Kouhestani M, Curk T, Xu Y, Danilenko M, Hussain R, Keavney B, Wipat A, Klinck R, Cowell IG, Cheong Lee K, Austin CA, Venables JP, Chabot B, Santibanez Koref M, Tyson-Capper A, Elliott DJ. Human Tra2 proteins jointly control a CHEK1 splicing switch among alternative and constitutive target exons. Nat Commun 2014; 5:4760. [PMID: 25208576 PMCID: PMC4175592 DOI: 10.1038/ncomms5760] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2014] [Accepted: 07/22/2014] [Indexed: 01/11/2023] Open
Abstract
Alternative splicing--the production of multiple messenger RNA isoforms from a single gene--is regulated in part by RNA binding proteins. While the RBPs transformer2 alpha (Tra2α) and Tra2β have both been implicated in the regulation of alternative splicing, their relative contributions to this process are not well understood. Here we find simultaneous--but not individual--depletion of Tra2α and Tra2β induces substantial shifts in splicing of endogenous Tra2β target exons, and that both constitutive and alternative target exons are under dual Tra2α-Tra2β control. Target exons are enriched in genes associated with chromosome biology including CHEK1, which encodes a key DNA damage response protein. Dual Tra2 protein depletion reduces expression of full-length CHK1 protein, results in the accumulation of the DNA damage marker γH2AX and decreased cell viability. We conclude Tra2 proteins jointly control constitutive and alternative splicing patterns via paralog compensation to control pathways essential to the maintenance of cell viability.
Collapse
Affiliation(s)
- Andrew Best
- Institute of Genetic Medicine, Newcastle University, Central Parkway, Newcastle NE1 3BZ, UK
| | - Katherine James
- School of Computing Science, Claremont Tower, Newcastle University, Newcastle upon Tyne NE1 7RU, UK
| | - Caroline Dalgliesh
- Institute of Genetic Medicine, Newcastle University, Central Parkway, Newcastle NE1 3BZ, UK
| | - Elaine Hong
- Institute for Cellular Medicine, Newcastle University, Framlington Place, Newcastle NE2 4HH, UK
| | | | - Tomaz Curk
- Faculty of Computer and Information Science, University of Ljubljana, Trzaska cesta 25, SI-1000, Ljubljana, Slovenia
| | - Yaobo Xu
- Institute of Genetic Medicine, Newcastle University, Central Parkway, Newcastle NE1 3BZ, UK
| | - Marina Danilenko
- Institute of Genetic Medicine, Newcastle University, Central Parkway, Newcastle NE1 3BZ, UK
| | - Rafiq Hussain
- Institute of Genetic Medicine, Newcastle University, Central Parkway, Newcastle NE1 3BZ, UK
| | - Bernard Keavney
- Institute of Genetic Medicine, Newcastle University, Central Parkway, Newcastle NE1 3BZ, UK
- Institute of Cardiovascular Sciences, The University of Manchester, Manchester M13 9NT, UK
| | - Anil Wipat
- School of Computing Science, Claremont Tower, Newcastle University, Newcastle upon Tyne NE1 7RU, UK
| | - Roscoe Klinck
- Department of Microbiology and Infectious Diseases, Faculty of Medicine and Health Sciences, Université de Sherbrooke, Sherbrooke, Québec, Canada J1E 4K8
| | - Ian G. Cowell
- Institute for Cell and Molecular Biosciences, Newcastle University, Newcastle NE2 4HH, UK
| | - Ka Cheong Lee
- Institute for Cell and Molecular Biosciences, Newcastle University, Newcastle NE2 4HH, UK
| | - Caroline A. Austin
- Institute for Cell and Molecular Biosciences, Newcastle University, Newcastle NE2 4HH, UK
| | - Julian P. Venables
- Institute of Genetic Medicine, Newcastle University, Central Parkway, Newcastle NE1 3BZ, UK
| | - Benoit Chabot
- Department of Microbiology and Infectious Diseases, Faculty of Medicine and Health Sciences, Université de Sherbrooke, Sherbrooke, Québec, Canada J1E 4K8
| | - Mauro Santibanez Koref
- Institute of Genetic Medicine, Newcastle University, Central Parkway, Newcastle NE1 3BZ, UK
| | - Alison Tyson-Capper
- Institute for Cellular Medicine, Newcastle University, Framlington Place, Newcastle NE2 4HH, UK
| | - David J. Elliott
- Institute of Genetic Medicine, Newcastle University, Central Parkway, Newcastle NE1 3BZ, UK
| |
Collapse
|
67
|
Dikicioglu D, Wood V, Rutherford KM, McDowall MD, Oliver SG. Improving functional annotation for industrial microbes: a case study with Pichia pastoris. Trends Biotechnol 2014; 32:396-9. [PMID: 24929579 PMCID: PMC4111905 DOI: 10.1016/j.tibtech.2014.05.003] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2014] [Revised: 05/10/2014] [Accepted: 05/13/2014] [Indexed: 11/29/2022]
Abstract
The current status of the Pichia pastoris genome is shown to lack extensive functional annotation. GO annotation transfer and literature curation pipelines improve the functional annotation of genomes. Pipelines and tools that can improve the annotation status of the genomes of Pichia pastoris and many industrial microbes are considered. Well-annotated genome sequences will facilitate the utilization of these microbes in a broader range of synthetic biology applications.
The research communities studying microbial model organisms, such as Escherichia coli or Saccharomyces cerevisiae, are well served by model organism databases that have extensive functional annotation. However, this is not true of many industrial microbes that are used widely in biotechnology. In this Opinion piece, we use Pichia (Komagataella) pastoris to illustrate the limitations of the available annotation. We consider the resources that can be implemented in the short term both to improve Gene Ontology (GO) annotation coverage based on annotation transfer, and to establish curation pipelines for the literature corpus of this organism.
Collapse
Affiliation(s)
- Duygu Dikicioglu
- Cambridge Systems Biology Centre & Department of Biochemistry, University of Cambridge, Sanger Building, 80 Tennis Court Road, Cambridge CB2 1GA, UK
| | - Valerie Wood
- Cambridge Systems Biology Centre & Department of Biochemistry, University of Cambridge, Sanger Building, 80 Tennis Court Road, Cambridge CB2 1GA, UK
| | - Kim M Rutherford
- Cambridge Systems Biology Centre & Department of Biochemistry, University of Cambridge, Sanger Building, 80 Tennis Court Road, Cambridge CB2 1GA, UK
| | - Mark D McDowall
- European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI) Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
| | - Stephen G Oliver
- Cambridge Systems Biology Centre & Department of Biochemistry, University of Cambridge, Sanger Building, 80 Tennis Court Road, Cambridge CB2 1GA, UK.
| |
Collapse
|
68
|
Taboada M, Rodríguez H, Martínez D, Pardo M, Sobrido MJ. Automated semantic annotation of rare disease cases: a case study. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau045. [PMID: 24903515 PMCID: PMC4207225 DOI: 10.1093/database/bau045] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION As the number of clinical reports in the peer-reviewed medical literature keeps growing, there is an increasing need for online search tools to find and analyze publications on patients with similar clinical characteristics. This problem is especially critical and challenging for rare diseases, where publications of large series are scarce. Through an applied example, we illustrate how to automatically identify new relevant cases and semantically annotate the relevant literature about patient case reports to capture the phenotype of a rare disease named cerebrotendinous xanthomatosis. RESULTS Our results confirm that it is possible to automatically identify new relevant case reports with a high precision and to annotate them with a satisfactory quality (74% F-measure). Automated annotation with an emphasis to entirely describe all phenotypic abnormalities found in a disease may facilitate curation efforts by supplying phenotype retrieval and assessment of their frequency. Availability and Supplementary information: http://www.usc.es/keam/Phenotype Annotation/. Database URL: http://www.usc.es/keam/PhenotypeAnnotation/
Collapse
Affiliation(s)
- Maria Taboada
- Department of Electronics & Computer Science, Department of Applied Physics, Campus Vida, University of Santiago de Compostela, Department of Neurology, University Hospital Clinico of Santiago de Compostela and Fundación Pública Galega de Medicina Xenómica-Instituto de Investigación Sanitaria de Santiago (IDIS) and Centro de Investigación Biomédica en red de Enfermedades Raras (CIBERER), Santiago de Compostela, Spain
| | - Hadriana Rodríguez
- Department of Electronics & Computer Science, Department of Applied Physics, Campus Vida, University of Santiago de Compostela, Department of Neurology, University Hospital Clinico of Santiago de Compostela and Fundación Pública Galega de Medicina Xenómica-Instituto de Investigación Sanitaria de Santiago (IDIS) and Centro de Investigación Biomédica en red de Enfermedades Raras (CIBERER), Santiago de Compostela, Spain
| | - Diego Martínez
- Department of Electronics & Computer Science, Department of Applied Physics, Campus Vida, University of Santiago de Compostela, Department of Neurology, University Hospital Clinico of Santiago de Compostela and Fundación Pública Galega de Medicina Xenómica-Instituto de Investigación Sanitaria de Santiago (IDIS) and Centro de Investigación Biomédica en red de Enfermedades Raras (CIBERER), Santiago de Compostela, Spain
| | - María Pardo
- Department of Electronics & Computer Science, Department of Applied Physics, Campus Vida, University of Santiago de Compostela, Department of Neurology, University Hospital Clinico of Santiago de Compostela and Fundación Pública Galega de Medicina Xenómica-Instituto de Investigación Sanitaria de Santiago (IDIS) and Centro de Investigación Biomédica en red de Enfermedades Raras (CIBERER), Santiago de Compostela, Spain
| | - María Jesús Sobrido
- Department of Electronics & Computer Science, Department of Applied Physics, Campus Vida, University of Santiago de Compostela, Department of Neurology, University Hospital Clinico of Santiago de Compostela and Fundación Pública Galega de Medicina Xenómica-Instituto de Investigación Sanitaria de Santiago (IDIS) and Centro de Investigación Biomédica en red de Enfermedades Raras (CIBERER), Santiago de Compostela, Spain
| |
Collapse
|
69
|
Valentini G. Hierarchical ensemble methods for protein function prediction. ISRN BIOINFORMATICS 2014; 2014:901419. [PMID: 25937954 PMCID: PMC4393075 DOI: 10.1155/2014/901419] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/02/2014] [Accepted: 02/25/2014] [Indexed: 12/11/2022]
Abstract
Protein function prediction is a complex multiclass multilabel classification problem, characterized by multiple issues such as the incompleteness of the available annotations, the integration of multiple sources of high dimensional biomolecular data, the unbalance of several functional classes, and the difficulty of univocally determining negative examples. Moreover, the hierarchical relationships between functional classes that characterize both the Gene Ontology and FunCat taxonomies motivate the development of hierarchy-aware prediction methods that showed significantly better performances than hierarchical-unaware "flat" prediction methods. In this paper, we provide a comprehensive review of hierarchical methods for protein function prediction based on ensembles of learning machines. According to this general approach, a separate learning machine is trained to learn a specific functional term and then the resulting predictions are assembled in a "consensus" ensemble decision, taking into account the hierarchical relationships between classes. The main hierarchical ensemble methods proposed in the literature are discussed in the context of existing computational methods for protein function prediction, highlighting their characteristics, advantages, and limitations. Open problems of this exciting research area of computational biology are finally considered, outlining novel perspectives for future research.
Collapse
Affiliation(s)
- Giorgio Valentini
- AnacletoLab-Dipartimento di Informatica, Università degli Studi di Milano, Via Comelico 39, 20135 Milano, Italy
| |
Collapse
|
70
|
Grötzinger SW, Alam I, Ba Alawi W, Bajic VB, Stingl U, Eppinger J. Mining a database of single amplified genomes from Red Sea brine pool extremophiles-improving reliability of gene function prediction using a profile and pattern matching algorithm (PPMA). Front Microbiol 2014; 5:134. [PMID: 24778629 PMCID: PMC3985023 DOI: 10.3389/fmicb.2014.00134] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2014] [Accepted: 03/16/2014] [Indexed: 11/13/2022] Open
Abstract
Reliable functional annotation of genomic data is the key-step in the discovery of novel enzymes. Intrinsic sequencing data quality problems of single amplified genomes (SAGs) and poor homology of novel extremophile's genomes pose significant challenges for the attribution of functions to the coding sequences identified. The anoxic deep-sea brine pools of the Red Sea are a promising source of novel enzymes with unique evolutionary adaptation. Sequencing data from Red Sea brine pool cultures and SAGs are annotated and stored in the Integrated Data Warehouse of Microbial Genomes (INDIGO) data warehouse. Low sequence homology of annotated genes (no similarity for 35% of these genes) may translate into false positives when searching for specific functions. The Profile and Pattern Matching (PPM) strategy described here was developed to eliminate false positive annotations of enzyme function before progressing to labor-intensive hyper-saline gene expression and characterization. It utilizes InterPro-derived Gene Ontology (GO)-terms (which represent enzyme function profiles) and annotated relevant PROSITE IDs (which are linked to an amino acid consensus pattern). The PPM algorithm was tested on 15 protein families, which were selected based on scientific and commercial potential. An initial list of 2577 enzyme commission (E.C.) numbers was translated into 171 GO-terms and 49 consensus patterns. A subset of INDIGO-sequences consisting of 58 SAGs from six different taxons of bacteria and archaea were selected from six different brine pool environments. Those SAGs code for 74,516 genes, which were independently scanned for the GO-terms (profile filter) and PROSITE IDs (pattern filter). Following stringent reliability filtering, the non-redundant hits (106 profile hits and 147 pattern hits) are classified as reliable, if at least two relevant descriptors (GO-terms and/or consensus patterns) are present. Scripts for annotation, as well as for the PPM algorithm, are available through the INDIGO website.
Collapse
Affiliation(s)
- Stefan W Grötzinger
- Division of Physical Sciences and Engineering, KAUST Catalysis Center, King Abdullah University of Science and Technology Thuwal, Kingdom of Saudi Arabia
| | - Intikhab Alam
- Division of Biological Sciences and Engineering, Computational Bioscience Research Center, King Abdullah University of Science and Technology Thuwal, Kingdom of Saudi Arabia
| | - Wail Ba Alawi
- Division of Biological Sciences and Engineering, Computational Bioscience Research Center, King Abdullah University of Science and Technology Thuwal, Kingdom of Saudi Arabia
| | - Vladimir B Bajic
- Division of Biological Sciences and Engineering, Computational Bioscience Research Center, King Abdullah University of Science and Technology Thuwal, Kingdom of Saudi Arabia
| | - Ulrich Stingl
- Division of Biological Sciences and Engineering, Red Sea Research Center, King Abdullah University of Science and Technology Thuwal, Kingdom of Saudi Arabia
| | - Jörg Eppinger
- Division of Physical Sciences and Engineering, KAUST Catalysis Center, King Abdullah University of Science and Technology Thuwal, Kingdom of Saudi Arabia
| |
Collapse
|
71
|
Huntley RP, Sawford T, Martin MJ, O'Donovan C. Understanding how and why the Gene Ontology and its annotations evolve: the GO within UniProt. Gigascience 2014; 3:4. [PMID: 24641996 PMCID: PMC3995153 DOI: 10.1186/2047-217x-3-4] [Citation(s) in RCA: 53] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2013] [Accepted: 03/10/2014] [Indexed: 11/01/2022] Open
Abstract
The Gene Ontology Consortium (GOC) is a major bioinformatics project that provides structured controlled vocabularies to classify gene product function and location. GOC members create annotations to gene products using the Gene Ontology (GO) vocabularies, thus providing an extensive, publicly available resource. The GO and its annotations to gene products are now an integral part of functional analysis, and statistical tests using GO data are becoming routine for researchers to include when publishing functional information. While many helpful articles about the GOC are available, there are certain updates to the ontology and annotation sets that sometimes go unobserved. Here we describe some of the ways in which GO can change that should be carefully considered by all users of GO as they may have a significant impact on the resulting gene product annotations, and therefore the functional description of the gene product, or the interpretation of analyses performed on GO datasets. GO annotations for gene products change for many reasons, and while these changes generally improve the accuracy of the representation of the underlying biology, they do not necessarily imply that previous annotations were incorrect. We additionally describe the quality assurance mechanisms we employ to improve the accuracy of annotations, which necessarily changes the composition of the annotation sets we provide. We use the Universal Protein Resource (UniProt) for illustrative purposes of how the GO Consortium, as a whole, manages these changes.
Collapse
Affiliation(s)
- Rachael P Huntley
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.
| | | | | | | |
Collapse
|
72
|
Poux S, Magrane M, Arighi CN, Bridge A, O'Donovan C, Laiho K. Expert curation in UniProtKB: a case study on dealing with conflicting and erroneous data. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau016. [PMID: 24622611 PMCID: PMC3950660 DOI: 10.1093/database/bau016] [Citation(s) in RCA: 69] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
UniProtKB/Swiss-Prot provides expert curation with information extracted from literature and curator-evaluated computational analysis. As knowledgebases continue to play an increasingly important role in scientific research, a number of studies have evaluated their accuracy and revealed various errors. While some are curation errors, others are the result of incorrect information published in the scientific literature. By taking the example of sirtuin-5, a complex annotation case, we will describe the curation procedure of UniProtKB/Swiss-Prot and detail how we report conflicting information in the database. We will demonstrate the importance of collaboration between resources to ensure curation consistency and the value of contributions from the user community in helping maintain error-free resources. Database URL:www.uniprot.org
Collapse
Affiliation(s)
- Sylvain Poux
- SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, 1211 Geneva 4, Switzerland, European Molecular Biology Laboratory (EMBL), European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, Protein Information Resource, University of Delaware, 15 Innovation Way, Suite 205, Newark, DE 19711, USA and Protein Information Resource, Georgetown University Medical Center, 3300 Whitehaven Street North West, Suite 1200, Washington, DC 20007, USA
| | | | | | | | | | | | | |
Collapse
|
73
|
Frost HR, Moore JH. Optimization of gene set annotations via entropy minimization over variable clusters (EMVC). Bioinformatics 2014; 30:1698-706. [PMID: 24574114 PMCID: PMC4058919 DOI: 10.1093/bioinformatics/btu110] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
Motivation: Gene set enrichment has become a critical tool for interpreting the results of high-throughput genomic experiments. Inconsistent annotation quality and lack of annotation specificity, however, limit the statistical power of enrichment methods and make it difficult to replicate enrichment results across biologically similar datasets. Results: We propose a novel algorithm for optimizing gene set annotations to best match the structure of specific empirical data sources. Our proposed method, entropy minimization over variable clusters (EMVC), filters the annotations for each gene set to minimize a measure of entropy across disjoint gene clusters computed for a range of cluster sizes over multiple bootstrap resampled datasets. As shown using simulated gene sets with simulated data and Molecular Signatures Database collections with microarray gene expression data, the EMVC algorithm accurately filters annotations unrelated to the experimental outcome resulting in increased gene set enrichment power and better replication of enrichment results. Availability and implementation:http://cran.r-project.org/web/packages/EMVC/index.html. Contact:jason.h.moore@dartmouth.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- H Robert Frost
- Departments of Genetics and Community and Family Medicine, Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH 03755, USA
| | - Jason H Moore
- Departments of Genetics and Community and Family Medicine, Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH 03755, USA
| |
Collapse
|
74
|
Binder JX, Pletscher-Frankild S, Tsafou K, Stolte C, O'Donoghue SI, Schneider R, Jensen LJ. COMPARTMENTS: unification and visualization of protein subcellular localization evidence. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2014; 2014:bau012. [PMID: 24573882 PMCID: PMC3935310 DOI: 10.1093/database/bau012] [Citation(s) in RCA: 420] [Impact Index Per Article: 38.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
Abstract
Information on protein subcellular localization is important to understand the cellular functions of proteins. Currently, such information is manually curated from the literature, obtained from high-throughput microscopy-based screens and predicted from primary sequence. To get a comprehensive view of the localization of a protein, it is thus necessary to consult multiple databases and prediction tools. To address this, we present the COMPARTMENTS resource, which integrates all sources listed above as well as the results of automatic text mining. The resource is automatically kept up to date with source databases, and all localization evidence is mapped onto common protein identifiers and Gene Ontology terms. We further assign confidence scores to the localization evidence to facilitate comparison of different types and sources of evidence. To further improve the comparability, we assign confidence scores based on the type and source of the localization evidence. Finally, we visualize the unified localization evidence for a protein on a schematic cell to provide a simple overview. Database URL:http://compartments.jensenlab.org
Collapse
Affiliation(s)
- Janos X Binder
- Structural and Computational Biology Unit, European Molecular Biology Laboratory (EMBL), 69117 Heidelberg, Germany, Bioinformatics Core Facility, Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 4362 Esch-sur-Alzette, Luxembourg, Department of Disease Systems Biology, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, 2200 Copenhagen, Denmark, CSIRO Computational Informatics, Sydney, NSW 2113 Australia and Garvan Institute of Medical Research, Sydney, NSW 2100, Australia
| | | | | | | | | | | | | |
Collapse
|
75
|
COUTO FRANCISCOM, PINTO HSOFIA. THE NEXT GENERATION OF SIMILARITY MEASURES THAT FULLY EXPLORE THE SEMANTICS IN BIOMEDICAL ONTOLOGIES. J Bioinform Comput Biol 2013; 11:1371001. [DOI: 10.1142/s0219720013710017] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
There is a prominent trend to augment and improve the formality of biomedical ontologies. For example, this is shown by the current effort on adding description logic axioms, such as disjointness. One of the key ontology applications that can take advantage of this effort is the conceptual (functional) similarity measurement. The presence of description logic axioms in biomedical ontologies make the current structural or extensional approaches weaker and further away from providing sound semantics-based similarity measures. Although beneficial in small ontologies, the exploration of description logic axioms by semantics-based similarity measures is computational expensive. This limitation is critical for biomedical ontologies that normally contain thousands of concepts. Thus in the process of gaining their rightful place, biomedical functional similarity measures have to take the journey of finding how this rich and powerful knowledge can be fully explored while keeping feasible computational costs. This manuscript aims at promoting and guiding the development of compelling tools that deliver what the biomedical community will require in a near future: a next-generation of biomedical similarity measures that efficiently and fully explore the semantics present in biomedical ontologies.
Collapse
Affiliation(s)
- FRANCISCO M. COUTO
- Departamento de Informática, Faculdade de Ciências, Universidade de Lisboa, Lisboa 1749-016, Portugal
| | - H. SOFIA PINTO
- INESC-ID, Departamento de Engenharia Informática, Instituto Superior Técnico, Lisboa 1000-029, Portugal
| |
Collapse
|
76
|
CAFA and the open world of protein function predictions. Trends Genet 2013; 29:609-10. [PMID: 24138813 DOI: 10.1016/j.tig.2013.09.005] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2013] [Accepted: 09/17/2013] [Indexed: 11/22/2022]
|
77
|
Julienne H, Zoufir A, Audit B, Arneodo A. Human genome replication proceeds through four chromatin states. PLoS Comput Biol 2013; 9:e1003233. [PMID: 24130466 PMCID: PMC3794905 DOI: 10.1371/journal.pcbi.1003233] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2013] [Accepted: 08/06/2013] [Indexed: 12/26/2022] Open
Abstract
Advances in genomic studies have led to significant progress in understanding the epigenetically controlled interplay between chromatin structure and nuclear functions. Epigenetic modifications were shown to play a key role in transcription regulation and genome activity during development and differentiation or in response to the environment. Paradoxically, the molecular mechanisms that regulate the initiation and the maintenance of the spatio-temporal replication program in higher eukaryotes, and in particular their links to epigenetic modifications, still remain elusive. By integrative analysis of the genome-wide distributions of thirteen epigenetic marks in the human cell line K562, at the 100 kb resolution of corresponding mean replication timing (MRT) data, we identify four major groups of chromatin marks with shared features. These states have different MRT, namely from early to late replicating, replication proceeds though a transcriptionally active euchromatin state (C1), a repressive type of chromatin (C2) associated with polycomb complexes, a silent state (C3) not enriched in any available marks, and a gene poor HP1-associated heterochromatin state (C4). When mapping these chromatin states inside the megabase-sized U-domains (U-shaped MRT profile) covering about 50% of the human genome, we reveal that the associated replication fork polarity gradient corresponds to a directional path across the four chromatin states, from C1 at U-domains borders followed by C2, C3 and C4 at centers. Analysis of the other genome half is consistent with early and late replication loci occurring in separate compartments, the former correspond to gene-rich, high-GC domains of intermingled chromatin states C1 and C2, whereas the latter correspond to gene-poor, low-GC domains of alternating chromatin states C3 and C4 or long C4 domains. This new segmentation sheds a new light on the epigenetic regulation of the spatio-temporal replication program in human and provides a framework for further studies in different cell types, in both health and disease. Previous studies revealed spatially coherent and biological-meaningful chromatin mark combinations in human cells. Here, we analyze thirteen epigenetic mark maps in the human cell line K562 at 100 kb resolution of MRT data. The complexity of epigenetic data is reduced to four chromatin states that display remarkable similarities with those reported in fly, worm and plants. These states have different MRT: (C1) is transcriptionally active, early replicating, enriched in CTCF; (C2) is Polycomb repressed, mid-S replicating; (C3) lacks of marks and replicates late and (C4) is a late-replicating gene-poor HP1 repressed heterochromatin state. When mapping these states inside the 876 replication U-domains of K562, the replication fork polarity gradient observed in these U-domains comes along with a remarkable epigenetic organization from C1 at U-domain borders to C2, C3 and ultimately C4 at centers. The remaining genome half displays early replicating, gene rich and high GC domains of intermingled C1 and C2 states segregating from late replicating, gene poor and low GC domains of concatenated C3 and/or C4 states. This constitutes the first evidence of epigenetic compartmentalization of the human genome into replication domains likely corresponding to autonomous units in the 3D chromatin architecture.
Collapse
Affiliation(s)
- Hanna Julienne
- Université de Lyon, Lyon, France
- Laboratoire de Physique, CNRS UMR 5672, Ecole Normale Supérieure de Lyon, Lyon, France
| | - Azedine Zoufir
- Université de Lyon, Lyon, France
- Laboratoire de Physique, CNRS UMR 5672, Ecole Normale Supérieure de Lyon, Lyon, France
| | - Benjamin Audit
- Université de Lyon, Lyon, France
- Laboratoire de Physique, CNRS UMR 5672, Ecole Normale Supérieure de Lyon, Lyon, France
- * E-mail:
| | - Alain Arneodo
- Université de Lyon, Lyon, France
- Laboratoire de Physique, CNRS UMR 5672, Ecole Normale Supérieure de Lyon, Lyon, France
| |
Collapse
|
78
|
Mujahid H, Tan F, Zhang J, Nallamilli BRR, Pendarvis K, Peng Z. Nuclear proteome response to cell wall removal in rice (Oryza sativa). Proteome Sci 2013; 11:26. [PMID: 23777608 PMCID: PMC3695858 DOI: 10.1186/1477-5956-11-26] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2013] [Accepted: 06/13/2013] [Indexed: 01/31/2023] Open
Abstract
Plant cells are routinely exposed to various pathogens and environmental stresses that cause cell wall perturbations. Little is known of the mechanisms that plant cells use to sense these disturbances and transduce corresponding signals to regulate cellular responses to maintain cell wall integrity. Previous studies in rice have shown that removal of the cell wall leads to substantial chromatin reorganization and histone modification changes concomitant with cell wall re-synthesis. But the genes and proteins that regulate these cellular responses are still largely unknown. Here we present an examination of the nuclear proteome differential expression in response to removal of the cell wall in rice suspension cells using multiple nuclear proteome extraction methods. A total of 382 nuclear proteins were identified with two or more peptides, including 26 transcription factors. Upon removal of the cell wall, 142 nuclear proteins were up regulated and 112 were down regulated. The differentially expressed proteins included transcription factors, histones, histone domain containing proteins, and histone modification enzymes. Gene ontology analysis of the differentially expressed proteins indicates that chromatin & nucleosome assembly, protein-DNA complex assembly, and DNA packaging are tightly associated with cell wall removal. Our results indicate that removal of the cell wall imposes a tremendous challenge to the cells. Consequently, plant cells respond to the removal of the cell wall in the nucleus at every level of the regulatory hierarchy.
Collapse
Affiliation(s)
- Hana Mujahid
- Department of Biochemistry, Molecular Biology, Entomology and Plant Pathology, Mississippi State University, Starkville, MS 39762, USA.
| | | | | | | | | | | |
Collapse
|
79
|
Primmer CR, Papakostas S, Leder EH, Davis MJ, Ragan MA. Annotated genes and nonannotated genomes: cross-species use of Gene Ontology in ecology and evolution research. Mol Ecol 2013; 22:3216-41. [DOI: 10.1111/mec.12309] [Citation(s) in RCA: 68] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2012] [Revised: 02/22/2013] [Accepted: 02/26/2013] [Indexed: 02/01/2023]
Affiliation(s)
- C. R. Primmer
- Department of Biology; University of Turku; 20014 Turku Finland
| | - S. Papakostas
- Department of Biology; University of Turku; 20014 Turku Finland
| | - E. H. Leder
- Department of Biology; University of Turku; 20014 Turku Finland
| | - M. J. Davis
- Institute for Molecular Bioscience; The University of Queensland; Brisbane Qld 4072 Australia
| | - M. A. Ragan
- Institute for Molecular Bioscience; The University of Queensland; Brisbane Qld 4072 Australia
| |
Collapse
|
80
|
Wu X, Pang E, Lin K, Pei ZM. Improving the measurement of semantic similarity between gene ontology terms and gene products: insights from an edge- and IC-based hybrid method. PLoS One 2013; 8:e66745. [PMID: 23741529 PMCID: PMC3669204 DOI: 10.1371/journal.pone.0066745] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2013] [Accepted: 05/10/2013] [Indexed: 12/30/2022] Open
Abstract
BACKGROUND Explicit comparisons based on the semantic similarity of Gene Ontology terms provide a quantitative way to measure the functional similarity between gene products and are widely applied in large-scale genomic research via integration with other models. Previously, we presented an edge-based method, Relative Specificity Similarity (RSS), which takes the global position of relevant terms into account. However, edge-based semantic similarity metrics are sensitive to the intrinsic structure of GO and simply consider terms at the same level in the ontology to be equally specific nodes, revealing the weaknesses that could be complemented using information content (IC). RESULTS AND CONCLUSIONS Here, we used the IC-based nodes to improve RSS and proposed a new method, Hybrid Relative Specificity Similarity (HRSS). HRSS outperformed other methods in distinguishing true protein-protein interactions from false. HRSS values were divided into four different levels of confidence for protein interactions. In addition, HRSS was statistically the best at obtaining the highest average functional similarity among human-mouse orthologs. Both HRSS and the groupwise measure, simGIC, are superior in correlation with sequence and Pfam similarities. Because different measures are best suited for different circumstances, we compared two pairwise strategies, the maximum and the best-match average, in the evaluation. The former was more effective at inferring physical protein-protein interactions, and the latter at estimating the functional conservation of orthologs and analyzing the CESSM datasets. In conclusion, HRSS can be applied to different biological problems by quantifying the functional similarity between gene products. The algorithm HRSS was implemented in the C programming language, which is freely available from http://cmb.bnu.edu.cn/hrss.
Collapse
Affiliation(s)
- Xiaomei Wu
- College of Life and Environmental Sciences, Hangzhou Normal University, Hangzhou, People's Republic of China.
| | | | | | | |
Collapse
|
81
|
Wolf YI, Koonin EV. A tight link between orthologs and bidirectional best hits in bacterial and archaeal genomes. Genome Biol Evol 2013; 4:1286-94. [PMID: 23160176 PMCID: PMC3542571 DOI: 10.1093/gbe/evs100] [Citation(s) in RCA: 79] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Orthologous relationships between genes are routinely inferred from bidirectional best hits (BBH) in pairwise genome comparisons. However, to our knowledge, it has never been quantitatively demonstrated that orthologs form BBH. To test this “BBH-orthology conjecture,” we take advantage of the operon organization of bacterial and archaeal genomes and assume that, when two genes in compared genomes are flanked by two BBH show statistically significant sequence similarity to one another, these genes are bona fide orthologs. Under this assumption, we tested whether middle genes in “syntenic orthologous gene triplets” form BBH. We found that this was the case in more than 95% of the syntenic gene triplets in all genome comparisons. A detailed examination of the exceptions to this pattern, including maximum likelihood phylogenetic tree analysis, showed that some of these deviations involved artifacts of genome annotation, whereas very small fractions represented random assignment of the best hit to one of closely related in-paralogs, paralogous displacement in situ, or even less frequent genuine violations of the BBH–orthology conjecture caused by acceleration of evolution in one of the orthologs. We conclude that, at least in prokaryotes, genes for which independent evidence of orthology is available typically form BBH and, conversely, BBH can serve as a strong indication of gene orthology.
Collapse
|
82
|
Clarke EL, Loguercio S, Good BM, Su AI. A task-based approach for Gene Ontology evaluation. J Biomed Semantics 2013; 4 Suppl 1:S4. [PMID: 23734599 PMCID: PMC3633003 DOI: 10.1186/2041-1480-4-s1-s4] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Background The Gene Ontology and its associated annotations are critical tools for interpreting lists of genes. Here, we introduce a method for evaluating the Gene Ontology annotations and structure based on the impact they have on gene set enrichment analysis, along with an example implementation. This task-based approach yields quantitative assessments grounded in experimental data and anchored tightly to the primary use of the annotations. Results Applied to specific areas of biological interest, our framework allowed us to understand the progress of annotation and structural ontology changes from 2004 to 2012. Our framework was also able to determine that the quality of annotations and structure in the area under test have been improving in their ability to recall underlying biological traits. Furthermore, we were able to distinguish between the impact of changes to the annotation sets and ontology structure. Conclusion Our framework and implementation lay the groundwork for a powerful tool in evaluating the usefulness of the Gene Ontology. We demonstrate both the flexibility and the power of this approach in evaluating the current and past state of the Gene Ontology as well as its applicability in developing new methods for creating gene annotations.
Collapse
|
83
|
Abstract
Here we assessed the use of domain families for predicting the functions of whole proteins. These 'functional families' (FunFams) were derived using a protocol that combines sequence clustering with supervised cluster evaluation, relying on available high-quality Gene Ontology (GO) annotation data in the latter step. In essence, the protocol groups domain sequences belonging to the same superfamily into families based on the GO annotations of their parent proteins. An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone. For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically. All target proteins were first submitted to domain superfamily assignment, followed by FunFam assignment and, eventually, function assignment. The latter included an integration step for multi-domain target proteins. The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.
Collapse
Affiliation(s)
- Robert Rentzsch
- Robert Koch Institut, Research Group Bioinformatics Ng4, Nordufer 20, 13353 Berlin, Germany.
| | | |
Collapse
|
84
|
Gillis J, Pavlidis P. Assessing identity, redundancy and confounds in Gene Ontology annotations over time. ACTA ACUST UNITED AC 2013; 29:476-82. [PMID: 23297035 DOI: 10.1093/bioinformatics/bts727] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION The Gene Ontology (GO) is heavily used in systems biology, but the potential for redundancy, confounds with other data sources and problems with stability over time have been little explored. RESULTS We report that GO annotations are stable over short periods, with 3% of genes not being most semantically similar to themselves between monthly GO editions. However, we find that genes can alter their 'functional identity' over time, with 20% of genes not matching to themselves (by semantic similarity) after 2 years. We further find that annotation bias in GO, in which some genes are more characterized than others, has declined in yeast, but generally increased in humans. Finally, we discovered that many entries in protein interaction databases are owing to the same published reports that are used for GO annotations, with 66% of assessed GO groups exhibiting this confound. We provide a case study to illustrate how this information can be used in analyses of gene sets and networks. AVAILABILITY Data available at http://chibi.ubc.ca/assessGO.
Collapse
Affiliation(s)
- Jesse Gillis
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, 192B Genome Research Center, 500 Sunnyside Boulevard, Woodbury, NY 11797, USA
| | | |
Collapse
|
85
|
Phyletic profiling with cliques of orthologs is enhanced by signatures of paralogy relationships. PLoS Comput Biol 2013; 9:e1002852. [PMID: 23308060 PMCID: PMC3536626 DOI: 10.1371/journal.pcbi.1002852] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2012] [Accepted: 11/05/2012] [Indexed: 11/19/2022] Open
Abstract
New microbial genomes are sequenced at a high pace, allowing insight into the genetics of not only cultured microbes, but a wide range of metagenomic collections such as the human microbiome. To understand the deluge of genomic data we face, computational approaches for gene functional annotation are invaluable. We introduce a novel model for computational annotation that refines two established concepts: annotation based on homology and annotation based on phyletic profiling. The phyletic profiling-based model that includes both inferred orthologs and paralogs-homologs separated by a speciation and a duplication event, respectively-provides more annotations at the same average Precision than the model that includes only inferred orthologs. For experimental validation, we selected 38 poorly annotated Escherichia coli genes for which the model assigned one of three GO terms with high confidence: involvement in DNA repair, protein translation, or cell wall synthesis. Results of antibiotic stress survival assays on E. coli knockout mutants showed high agreement with our model's estimates of accuracy: out of 38 predictions obtained at the reported Precision of 60%, we confirmed 25 predictions, indicating that our confidence estimates can be used to make informed decisions on experimental validation. Our work will contribute to making experimental validation of computational predictions more approachable, both in cost and time. Our predictions for 998 prokaryotic genomes include ~400000 specific annotations with the estimated Precision of 90%, ~19000 of which are highly specific-e.g. "penicillin binding," "tRNA aminoacylation for protein translation," or "pathogenesis"-and are freely available at http://gorbi.irb.hr/.
Collapse
|
86
|
Carrascosa MC, Massaguer OL, Mestres J. PharmaTrek: A Semantic Web Explorer for Open Innovation in Multitarget Drug Discovery. Mol Inform 2012; 31:537-541. [PMID: 23548981 PMCID: PMC3573647 DOI: 10.1002/minf.201200070] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2012] [Accepted: 07/13/2012] [Indexed: 11/10/2022]
Affiliation(s)
- Maria C Carrascosa
- Research Programme on Biomedical Informatics (GRIB), IMIM Hospital del Mar Research Institute and University Pompeu Fabra , Parc de Recerca Biomèdica, Doctor Aiguader 88, 08003 Barcelona, Catalonia, Spain
| | | | | |
Collapse
|