1
|
Feuermann M, Mi H, Gaudet P, Muruganujan A, Lewis SE, Ebert D, Mushayahama T, Thomas PD. A compendium of human gene functions derived from evolutionary modelling. Nature 2025; 640:146-154. [PMID: 40011791 PMCID: PMC11964926 DOI: 10.1038/s41586-025-08592-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2023] [Accepted: 01/03/2025] [Indexed: 02/28/2025]
Abstract
A comprehensive, computable representation of the functional repertoire of all macromolecules encoded within the human genome is a foundational resource for biology and biomedical research. The Gene Ontology Consortium has been working towards this goal by generating a structured body of information about gene functions, which now includes experimental findings reported in more than 175,000 publications for human genes and genes in experimentally tractable model organisms1,2. Here, we describe the results of a large, international effort to integrate all of these findings to create a representation of human gene functions that is as complete and accurate as possible. Specifically, we apply an expert-curated, explicit evolutionary modelling approach to all human protein-coding genes. This approach integrates available experimental information across families of related genes into models that reconstruct the gain and loss of functional characteristics over evolutionary time. The models and the resulting set of 68,667 integrated gene functions cover approximately 82% of human protein-coding genes. The functional repertoire reveals a marked preponderance of molecular regulatory functions, and the models provide insights into the evolutionary origins of human gene functions. We show that our set of descriptions of functions can improve the widely used genomic technique of Gene Ontology enrichment analysis. The experimental evidence for each functional characteristic is recorded, thereby enabling the scientific community to help review and improve the resource, which we have made publicly available.
Collapse
Affiliation(s)
- Marc Feuermann
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland
| | - Huaiyu Mi
- Division of Bioinformatics, Department of Population and Public Health Sciences, University of Southern California Los Angeles, Los Angeles, CA, USA
| | - Pascale Gaudet
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, Geneva, Switzerland
| | - Anushya Muruganujan
- Division of Bioinformatics, Department of Population and Public Health Sciences, University of Southern California Los Angeles, Los Angeles, CA, USA
| | - Suzanna E Lewis
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Dustin Ebert
- Division of Bioinformatics, Department of Population and Public Health Sciences, University of Southern California Los Angeles, Los Angeles, CA, USA
| | - Tremayne Mushayahama
- Division of Bioinformatics, Department of Population and Public Health Sciences, University of Southern California Los Angeles, Los Angeles, CA, USA
| | - Paul D Thomas
- Division of Bioinformatics, Department of Population and Public Health Sciences, University of Southern California Los Angeles, Los Angeles, CA, USA.
| |
Collapse
|
2
|
Chen J, Goudey B, Geard N, Verspoor K. Integration of background knowledge for automatic detection of inconsistencies in gene ontology annotation. Bioinformatics 2024; 40:i390-i400. [PMID: 38940182 PMCID: PMC11256942 DOI: 10.1093/bioinformatics/btae246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION Biological background knowledge plays an important role in the manual quality assurance (QA) of biological database records. One such QA task is the detection of inconsistencies in literature-based Gene Ontology Annotation (GOA). This manual verification ensures the accuracy of the GO annotations based on a comprehensive review of the literature used as evidence, Gene Ontology (GO) terms, and annotated genes in GOA records. While automatic approaches for the detection of semantic inconsistencies in GOA have been developed, they operate within predetermined contexts, lacking the ability to leverage broader evidence, especially relevant domain-specific background knowledge. This paper investigates various types of background knowledge that could improve the detection of prevalent inconsistencies in GOA. In addition, the paper proposes several approaches to integrate background knowledge into the automatic GOA inconsistency detection process. RESULTS We have extended a previously developed GOA inconsistency dataset with several kinds of GOA-related background knowledge, including GeneRIF statements, biological concepts mentioned within evidence texts, GO hierarchy and existing GO annotations of the specific gene. We have proposed several effective approaches to integrate background knowledge as part of the automatic GOA inconsistency detection process. The proposed approaches can improve automatic detection of self-consistency and several of the most prevalent types of inconsistencies. This is the first study to explore the advantages of utilizing background knowledge and to propose a practical approach to incorporate knowledge in automatic GOA inconsistency detection. We establish a new benchmark for performance on this task. Our methods may be applicable to various tasks that involve incorporating biological background knowledge. AVAILABILITY AND IMPLEMENTATION https://github.com/jiyuc/de-inconsistency.
Collapse
Affiliation(s)
- Jiyu Chen
- School of Computing and Information Systems, The University of Melbourne, Parkville 3010, VIC, Australia
- Data61, The Commonwealth Scientific and Industrial Research Organisation, Marsfield 2122, NSW, Australia
| | - Benjamin Goudey
- School of Computing and Information Systems, The University of Melbourne, Parkville 3010, VIC, Australia
| | - Nicholas Geard
- School of Computing and Information Systems, The University of Melbourne, Parkville 3010, VIC, Australia
| | - Karin Verspoor
- School of Computing Technologies, RMIT University, Melbourne, Victoria 3000, Australia
| |
Collapse
|
3
|
Gay SM, Chartampila E, Lord JS, Grizzard S, Maisashvili T, Ye M, Barker NK, Mordant AL, Mills CA, Herring LE, Diering GH. Developing forebrain synapses are uniquely vulnerable to sleep loss. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.06.565853. [PMID: 37986967 PMCID: PMC10659326 DOI: 10.1101/2023.11.06.565853] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
Sleep is an essential behavior that supports lifelong brain health and cognition. Neuronal synapses are a major target for restorative sleep function and a locus of dysfunction in response to sleep deprivation (SD). Synapse density is highly dynamic during development, becoming stabilized with maturation to adulthood, suggesting sleep exerts distinct synaptic functions between development and adulthood. Importantly, problems with sleep are common in neurodevelopmental disorders including autism spectrum disorder (ASD). Moreover, early life sleep disruption in animal models causes long lasting changes in adult behavior. Different plasticity engaged during sleep necessarily implies that developing and adult synapses will show differential vulnerability to SD. To investigate distinct sleep functions and mechanisms of vulnerability to SD across development, we systematically examined the behavioral and molecular responses to acute SD between juvenile (P21-28), adolescent (P42-49) and adult (P70-100) mice of both sexes. Compared to adults, juveniles lack robust adaptations to SD, precipitating cognitive deficits in the novel object recognition test. Subcellular fractionation, combined with proteome and phosphoproteome analysis revealed the developing synapse is profoundly vulnerable to SD, whereas adults exhibit comparative resilience. SD in juveniles, and not older mice, aberrantly drives induction of synapse potentiation, synaptogenesis, and expression of peri-neuronal nets. Our analysis further reveals the developing synapse as a convergent node between vulnerability to SD and ASD genetic risk. Together, our systematic analysis supports a distinct developmental function of sleep and reveals how sleep disruption impacts key aspects of brain development, providing mechanistic insights for ASD susceptibility.
Collapse
|
4
|
Feuermann M, Gaudet P. Interpreting Gene Ontology Annotations Derived from Sequence Homology Methods. Methods Mol Biol 2024; 2836:285-298. [PMID: 38995546 DOI: 10.1007/978-1-0716-4007-4_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/13/2024]
Abstract
The Gene Ontology (GO) project describes the functions of the gene products of organisms from all kingdoms of life in a standardized way, enabling powerful analyses of experiments involving genome-wide analysis. The scientific literature is used to convert experimental results into GO annotations that systematically classify gene products' functions. However, to address the fact that only a minor fraction of all genes has been characterized experimentally, multiple predictive methods to assign GO annotations have been developed since the inception of GO. Sequence homologies between novel genes and genes with known functions help to approximate the roles of these non-characterized genes. Here we describe the main sequence homology methods to produce annotations: pairwise comparison (BLAST), protein profile models (InterPro), and phylogenetic-based annotation (PAINT). Some of these methods can be implemented with genome analysis pipelines (BLAST and InterPro2GO), while PAINT is curated by the GO consortium.
Collapse
Affiliation(s)
- Marc Feuermann
- SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Pascale Gaudet
- SIB Swiss Institute of Bioinformatics, Geneva, Switzerland.
| |
Collapse
|
5
|
Deng CH, Naithani S, Kumari S, Cobo-Simón I, Quezada-Rodríguez EH, Skrabisova M, Gladman N, Correll MJ, Sikiru AB, Afuwape OO, Marrano A, Rebollo I, Zhang W, Jung S. Genotype and phenotype data standardization, utilization and integration in the big data era for agricultural sciences. Database (Oxford) 2023; 2023:baad088. [PMID: 38079567 PMCID: PMC10712715 DOI: 10.1093/database/baad088] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Revised: 10/17/2023] [Accepted: 11/28/2023] [Indexed: 12/18/2023]
Abstract
Large-scale genotype and phenotype data have been increasingly generated to identify genetic markers, understand gene function and evolution and facilitate genomic selection. These datasets hold immense value for both current and future studies, as they are vital for crop breeding, yield improvement and overall agricultural sustainability. However, integrating these datasets from heterogeneous sources presents significant challenges and hinders their effective utilization. We established the Genotype-Phenotype Working Group in November 2021 as a part of the AgBioData Consortium (https://www.agbiodata.org) to review current data types and resources that support archiving, analysis and visualization of genotype and phenotype data to understand the needs and challenges of the plant genomic research community. For 2021-22, we identified different types of datasets and examined metadata annotations related to experimental design/methods/sample collection, etc. Furthermore, we thoroughly reviewed publicly funded repositories for raw and processed data as well as secondary databases and knowledgebases that enable the integration of heterogeneous data in the context of the genome browser, pathway networks and tissue-specific gene expression. Based on our survey, we recommend a need for (i) additional infrastructural support for archiving many new data types, (ii) development of community standards for data annotation and formatting, (iii) resources for biocuration and (iv) analysis and visualization tools to connect genotype data with phenotype data to enhance knowledge synthesis and to foster translational research. Although this paper only covers the data and resources relevant to the plant research community, we expect that similar issues and needs are shared by researchers working on animals. Database URL: https://www.agbiodata.org.
Collapse
Affiliation(s)
- Cecilia H Deng
- Molecular and Digital Breeding, New Cultivar Innovation, The New Zealand Institute for Plant and Food Research Limited, 120 Mt Albert Road, Auckland 1025, New Zealand
| | - Sushma Naithani
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR 97331, USA
| | - Sunita Kumari
- Cold Spring Harbor Laboratory, 1 Bungtown Rd, Cold Spring Harbor, New York, NY 11724, USA
| | - Irene Cobo-Simón
- Department of Ecology and Evolutionary Biology, University of Connecticut, Storrs, CT, USA
- Institute of Forest Science (ICIFOR-INIA, CSIC), Madrid, Spain
| | - Elsa H Quezada-Rodríguez
- Departamento de Producción Agrícola y Animal, Universidad Autónoma Metropolitana-Xochimilco, Ciudad de México, México
- Centro de Ciencias de la Complejidad, Universidad Nacional Autónoma de México, Ciudad de México, México
| | - Maria Skrabisova
- Department of Biochemistry, Faculty of Science, Palacky University, Olomouc, Czech Republic
| | - Nick Gladman
- Cold Spring Harbor Laboratory, 1 Bungtown Rd, Cold Spring Harbor, New York, NY 11724, USA
- U.S. Department of Agriculture-Agricultural Research Service, NEA Robert W. Holley Center for Agriculture and Health, Cornell University, Ithaca, NY 14853, USA
| | - Melanie J Correll
- Agricultural and Biological Engineering Department, University of Florida, 1741 Museum Rd, Gainesville, FL 32611, USA
| | | | | | - Annarita Marrano
- Phoenix Bioinformatics, 39899 Balentine Drive, Suite 200, Newark, CA 94560, USA
| | | | - Wentao Zhang
- National Research Council Canada, 110 Gymnasium Pl, Saskatoon, Saskatchewan S7N 0W9, Canada
| | - Sook Jung
- Department of Horticulture, Washington State University, 303c Plant Sciences Building, Pullman, WA 99164-6414, USA
| |
Collapse
|
6
|
Chen J, Goudey B, Zobel J, Geard N, Verspoor K. Exploring automatic inconsistency detection for literature-based gene ontology annotation. Bioinformatics 2022; 38:i273-i281. [PMID: 35758780 PMCID: PMC9235499 DOI: 10.1093/bioinformatics/btac230] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/08/2022] [Indexed: 11/12/2022] Open
Abstract
Motivation Literature-based gene ontology annotations (GOA) are biological database records that use controlled vocabulary to uniformly represent gene function information that is described in the primary literature. Assurance of the quality of GOA is crucial for supporting biological research. However, a range of different kinds of inconsistencies in between literature as evidence and annotated GO terms can be identified; these have not been systematically studied at record level. The existing manual-curation approach to GOA consistency assurance is inefficient and is unable to keep pace with the rate of updates to gene function knowledge. Automatic tools are therefore needed to assist with GOA consistency assurance. This article presents an exploration of different GOA inconsistencies and an early feasibility study of automatic inconsistency detection. Results We have created a reliable synthetic dataset to simulate four realistic types of GOA inconsistency in biological databases. Three automatic approaches are proposed. They provide reasonable performance on the task of distinguishing the four types of inconsistency and are directly applicable to detect inconsistencies in real-world GOA database records. Major challenges resulting from such inconsistencies in the context of several specific application settings are reported. This is the first study to introduce automatic approaches that are designed to address the challenges in current GOA quality assurance workflows. The data underlying this article are available in Github at https://github.com/jiyuc/AutoGOAConsistency.
Collapse
Affiliation(s)
- Jiyu Chen
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Benjamin Goudey
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Justin Zobel
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Nicholas Geard
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Parkville, VIC 3010, Australia.,School of Computer Technologies, RMIT University, Melbourne, VIC 3000, Australia
| |
Collapse
|
7
|
Chen J, Geard N, Zobel J, Verspoor K. Automatic consistency assurance for literature-based gene ontology annotation. BMC Bioinformatics 2021; 22:565. [PMID: 34823464 PMCID: PMC8620237 DOI: 10.1186/s12859-021-04479-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2021] [Accepted: 11/15/2021] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Literature-based gene ontology (GO) annotation is a process where expert curators use uniform expressions to describe gene functions reported in research papers, creating computable representations of information about biological systems. Manual assurance of consistency between GO annotations and the associated evidence texts identified by expert curators is reliable but time-consuming, and is infeasible in the context of rapidly growing biological literature. A key challenge is maintaining consistency of existing GO annotations as new studies are published and the GO vocabulary is updated. RESULTS In this work, we introduce a formalisation of biological database annotation inconsistencies, identifying four distinct types of inconsistency. We propose a novel and efficient method using state-of-the-art text mining models to automatically distinguish between consistent GO annotation and the different types of inconsistent GO annotation. We evaluate this method using a synthetic dataset generated by directed manipulation of instances in an existing corpus, BC4GO. We provide detailed error analysis for demonstrating that the method achieves high precision on more confident predictions. CONCLUSIONS Two models built using our method for distinct annotation consistency identification tasks achieved high precision and were robust to updates in the GO vocabulary. Our approach demonstrates clear value for human-in-the-loop curation scenarios.
Collapse
Affiliation(s)
- Jiyu Chen
- School of Computing and Information Systems, University of Melbourne, Melbourne, 3010, Australia
| | - Nicholas Geard
- School of Computing and Information Systems, University of Melbourne, Melbourne, 3010, Australia
| | - Justin Zobel
- School of Computing and Information Systems, University of Melbourne, Melbourne, 3010, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, University of Melbourne, Melbourne, 3010, Australia. .,School of Computing Technologies, RMIT University, Melbourne, VIC, 3000, Australia.
| |
Collapse
|
8
|
Gaudet P, Logie C, Lovering RC, Kuiper M, Lægreid A, Thomas PD. Gene Ontology representation for transcription factor functions. BIOCHIMICA ET BIOPHYSICA ACTA-GENE REGULATORY MECHANISMS 2021; 1864:194752. [PMID: 34461313 DOI: 10.1016/j.bbagrm.2021.194752] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/06/2021] [Revised: 08/24/2021] [Accepted: 08/25/2021] [Indexed: 12/31/2022]
Abstract
Transcription plays a central role in defining the identity and functionalities of cells, as well as in their responses to changes in the cellular environment. The Gene Ontology (GO) provides a rigorously defined set of concepts that describe the functions of gene products. A GO annotation is a statement about the function of a particular gene product, represented as an association between a gene product and the biological concept a GO term defines. Critically, each GO annotation is based on traceable scientific evidence. Here, we describe the different GO terms that are associated with proteins involved in transcription and its regulation, focusing on the standard of evidence required to support these associations. This article is intended to help users of GO annotations understand how to interpret the annotations and can contribute to the consistency of GO annotations. We distinguish between three classes of activities involved in transcription or directly regulating it - general transcription factors, DNA-binding transcription factors, and transcription co-regulators.
Collapse
Affiliation(s)
- Pascale Gaudet
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, 1 Rue Michel-Servet, 1211 Genève, Switzerland.
| | - Colin Logie
- Molecular Biology Department, Faculty of Science, Radboud University, PO box 9101, 6500HB Nijmegen, the Netherlands
| | - Ruth C Lovering
- Functional Gene Annotation, Preclinical and Fundamental Science, UCL Institute of Cardiovascular Science, University College London, London, UK
| | - Martin Kuiper
- Department of Biology, Norwegian University of Science and Technology, Trondheim, Norway
| | - Astrid Lægreid
- Department of Clinical and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway
| | - Paul D Thomas
- Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, CA, USA
| |
Collapse
|
9
|
Chen Y, Verbeek FJ, Wolstencroft K. Establishing a consensus for the hallmarks of cancer based on gene ontology and pathway annotations. BMC Bioinformatics 2021; 22:178. [PMID: 33823788 PMCID: PMC8025515 DOI: 10.1186/s12859-021-04105-8] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Accepted: 03/22/2021] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND The hallmarks of cancer provide a highly cited and well-used conceptual framework for describing the processes involved in cancer cell development and tumourigenesis. However, methods for translating these high-level concepts into data-level associations between hallmarks and genes (for high throughput analysis), vary widely between studies. The examination of different strategies to associate and map cancer hallmarks reveals significant differences, but also consensus. RESULTS Here we present the results of a comparative analysis of cancer hallmark mapping strategies, based on Gene Ontology and biological pathway annotation, from different studies. By analysing the semantic similarity between annotations, and the resulting gene set overlap, we identify emerging consensus knowledge. In addition, we analyse the differences between hallmark and gene set associations using Weighted Gene Co-expression Network Analysis and enrichment analysis. CONCLUSIONS Reaching a community-wide consensus on how to identify cancer hallmark activity from research data would enable more systematic data integration and comparison between studies. These results highlight the current state of the consensus and offer a starting point for further convergence. In addition, we show how a lack of consensus can lead to large differences in the biological interpretation of downstream analyses and discuss the challenges of annotating changing and accumulating biological data, using intermediate knowledge resources that are also changing over time.
Collapse
Affiliation(s)
- Yi Chen
- The Leiden Institute of Advanced Computer Science (LIACS), Snellius Gebouw, Niels Bohrweg 1, Leiden, The Netherlands
| | - Fons. J. Verbeek
- The Leiden Institute of Advanced Computer Science (LIACS), Snellius Gebouw, Niels Bohrweg 1, Leiden, The Netherlands
| | - Katherine Wolstencroft
- The Leiden Institute of Advanced Computer Science (LIACS), Snellius Gebouw, Niels Bohrweg 1, Leiden, The Netherlands
| |
Collapse
|
10
|
Whaley P, Edwards SW, Kraft A, Nyhan K, Shapiro A, Watford S, Wattam S, Wolffe T, Angrish M. Knowledge Organization Systems for Systematic Chemical Assessments. ENVIRONMENTAL HEALTH PERSPECTIVES 2020; 128:125001. [PMID: 33356525 PMCID: PMC7759237 DOI: 10.1289/ehp6994] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Revised: 12/02/2020] [Accepted: 12/04/2020] [Indexed: 05/04/2023]
Abstract
BACKGROUND Although the implementation of systematic review and evidence mapping methods stands to improve the transparency and accuracy of chemical assessments, they also accentuate the challenges that assessors face in ensuring they have located and included all the evidence that is relevant to evaluating the potential health effects an exposure might be causing. This challenge of information retrieval can be characterized in terms of "semantic" and "conceptual" factors that render chemical assessments vulnerable to the streetlight effect. OBJECTIVES This commentary presents how controlled vocabularies, thesauruses, and ontologies contribute to overcoming the streetlight effect in information retrieval, making up the key components of Knowledge Organization Systems (KOSs) that enable more systematic access to assessment-relevant information than is currently achievable. The concept of Adverse Outcome Pathways is used to illustrate what a general KOS for use in chemical assessment could look like. DISCUSSION Ontologies are an underexploited element of effective knowledge organization in the environmental health sciences. Agreeing on and implementing ontologies in chemical assessment is a complex but tractable process with four fundamental steps. Successful implementation of ontologies would not only make currently fragmented information about health risks from chemical exposures vastly more accessible, it could ultimately enable computational methods for chemical assessment that can take advantage of the full richness of data described in natural language in primary studies. https://doi.org/10.1289/EHP6994.
Collapse
Affiliation(s)
- Paul Whaley
- Evidence Based Toxicology Collaboration, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA
- Lancaster Environment Centre, Lancaster University, Lancaster, UK
| | - Stephen W. Edwards
- GenOmics, Bioinformatics, and Translational Research Center, RTI International, Research Triangle Park, North Carolina, USA
| | - Andrew Kraft
- Chemical Pollutant Assessment Division, Center for Public Health and Environmental Assessment, U.S. Environmental Protection Agency (U.S. EPA), Washington, DC, USA
| | - Kate Nyhan
- Environmental Health Sciences, Yale School of Public Health and Harvey Cushing/John Hay Whitney Medical Library, Yale University, New Haven, Connecticut, USA
| | - Andrew Shapiro
- Chemical Pollutant Assessment Division, Center for Public Health and Environmental Assessment, U.S. Environmental Protection Agency (U.S. EPA), Washington, DC, USA
| | - Sean Watford
- National Center for Computational Toxicology, U.S. EPA, Durham, North Carolina, USA
| | - Steve Wattam
- WAP Academy Consultancy Ltd, Thirsk, Yorkshire, UK
| | - Taylor Wolffe
- Lancaster Environment Centre, Lancaster University, Lancaster, UK
| | - Michelle Angrish
- Chemical Pollutant Assessment Division, Center for Public Health and Environmental Assessment, U.S. EPA, Durham, North Carolina, USA
| |
Collapse
|
11
|
Wood V, Carbon S, Harris MA, Lock A, Engel SR, Hill DP, Van Auken K, Attrill H, Feuermann M, Gaudet P, Lovering RC, Poux S, Rutherford KM, Mungall CJ. Term Matrix: a novel Gene Ontology annotation quality control system based on ontology term co-annotation patterns. Open Biol 2020; 10:200149. [PMID: 32875947 PMCID: PMC7536087 DOI: 10.1098/rsob.200149] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2020] [Accepted: 08/06/2020] [Indexed: 12/11/2022] Open
Abstract
Biological processes are accomplished by the coordinated action of gene products. Gene products often participate in multiple processes, and can therefore be annotated to multiple Gene Ontology (GO) terms. Nevertheless, processes that are functionally, temporally and/or spatially distant may have few gene products in common, and co-annotation to unrelated processes probably reflects errors in literature curation, ontology structure or automated annotation pipelines. We have developed an annotation quality control workflow that uses rules based on mutually exclusive processes to detect annotation errors, based on and validated by case studies including the three we present here: fission yeast protein-coding gene annotations over time; annotations for cohesin complex subunits in human and model species; and annotations using a selected set of GO biological process terms in human and five model species. For each case study, we reviewed available GO annotations, identified pairs of biological processes which are unlikely to be correctly co-annotated to the same gene products (e.g. amino acid metabolism and cytokinesis), and traced erroneous annotations to their sources. To date we have generated 107 quality control rules, and corrected 289 manual annotations in eukaryotes and over 52 700 automatically propagated annotations across all taxa.
Collapse
Affiliation(s)
- Valerie Wood
- Cambridge Systems Biology Centre, University of Cambridge, 80 Tennis Court Road, Cambridge CB2 1GA, UK
- Department of Biochemistry, University of Cambridge, 80 Tennis Court Road, Cambridge CB2 1GA, UK
| | - Seth Carbon
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Midori A. Harris
- Cambridge Systems Biology Centre, University of Cambridge, 80 Tennis Court Road, Cambridge CB2 1GA, UK
- Department of Biochemistry, University of Cambridge, 80 Tennis Court Road, Cambridge CB2 1GA, UK
| | - Antonia Lock
- Department of Genetics, Evolution and Environment, University College London, London WC1E 6B, UK
| | - Stacia R. Engel
- Department of Genetics, Stanford University, Palo Alto, CA 94304-5477, USA
| | - David P. Hill
- Mouse Genome Informatics, The Jackson Laboratory, Bar Harbor, ME 04609, USA
| | - Kimberly Van Auken
- Division of Biology and Biological Engineering, California Institute of Technology, 1200 East California Boulevard, Pasadena, CA 91125, USA
| | - Helen Attrill
- Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge CB2 3DY, UK
| | - Marc Feuermann
- Swiss Institute of Bioinformatics, 1 Michel-Servet, 1204 Geneva, Switzerland
| | - Pascale Gaudet
- Swiss Institute of Bioinformatics, 1 Michel-Servet, 1204 Geneva, Switzerland
| | - Ruth C. Lovering
- Functional Gene Annotation, Preclinical and Fundamental Science, Institute of Cardiovascular Science, University College London, London WC1E 6JF, UK
| | - Sylvain Poux
- Swiss Institute of Bioinformatics, 1 Michel-Servet, 1204 Geneva, Switzerland
| | - Kim M. Rutherford
- Cambridge Systems Biology Centre, University of Cambridge, 80 Tennis Court Road, Cambridge CB2 1GA, UK
- Department of Biochemistry, University of Cambridge, 80 Tennis Court Road, Cambridge CB2 1GA, UK
| | - Christopher J. Mungall
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| |
Collapse
|
12
|
Quality Matters: Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases. GENOMICS PROTEOMICS & BIOINFORMATICS 2020; 18:91-103. [PMID: 32652120 PMCID: PMC7646089 DOI: 10.1016/j.gpb.2018.11.006] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/08/2017] [Revised: 10/24/2018] [Accepted: 12/14/2018] [Indexed: 11/27/2022]
|
13
|
Peng Y, Jiang Y, Radivojac P. Enumerating consistent sub-graphs of directed acyclic graphs: an insight into biomedical ontologies. Bioinformatics 2019; 34:i313-i322. [PMID: 29949985 PMCID: PMC6022688 DOI: 10.1093/bioinformatics/bty268] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Motivation Modern problems of concept annotation associate an object of interest (gene, individual, text document) with a set of interrelated textual descriptors (functions, diseases, topics), often organized in concept hierarchies or ontologies. Most ontology can be seen as directed acyclic graphs (DAGs), where nodes represent concepts and edges represent relational ties between these concepts. Given an ontology graph, each object can only be annotated by a consistent sub-graph; that is, a sub-graph such that if an object is annotated by a particular concept, it must also be annotated by all other concepts that generalize it. Ontologies therefore provide a compact representation of a large space of possible consistent sub-graphs; however, until now we have not been aware of a practical algorithm that can enumerate such annotation spaces for a given ontology. Results We propose an algorithm for enumerating consistent sub-graphs of DAGs. The algorithm recursively partitions the graph into strictly smaller graphs until the resulting graph becomes a rooted tree (forest), for which a linear-time solution is computed. It then combines the tallies from graphs created in the recursion to obtain the final count. We prove the correctness of this algorithm, propose several practical accelerations, evaluate it on random graphs and then apply it to characterize four major biomedical ontologies. We believe this work provides valuable insights into the complexity of concept annotation spaces and its potential influence on the predictability of ontological annotation. Availability and implementation https://github.com/shawn-peng/counting-consistent-sub-DAG Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yisu Peng
- Department of Computer Science, Indiana University, Bloomington, USA
| | - Yuxiang Jiang
- Department of Computer Science, Indiana University, Bloomington, USA
| | - Predrag Radivojac
- Department of Computer Science, Indiana University, Bloomington, USA
| |
Collapse
|
14
|
Attrill H, Gaudet P, Huntley RP, Lovering RC, Engel SR, Poux S, Van Auken KM, Georghiou G, Chibucos MC, Berardini TZ, Wood V, Drabkin H, Fey P, Garmiri P, Harris MA, Sawford T, Reiser L, Tauber R, Toro S. Annotation of gene product function from high-throughput studies using the Gene Ontology. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2019; 2019:5304975. [PMID: 30715275 PMCID: PMC6355445 DOI: 10.1093/database/baz007] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/30/2018] [Accepted: 01/08/2019] [Indexed: 11/17/2022]
Abstract
High-throughput studies constitute an essential and valued source of information for researchers. However, high-throughput experimental workflows are often complex, with multiple data sets that may contain large numbers of false positives. The representation of high-throughput data in the Gene Ontology (GO) therefore presents a challenging annotation problem, when the overarching goal of GO curation is to provide the most precise view of a gene's role in biology. To address this, representatives from annotation teams within the GO Consortium reviewed high-throughput data annotation practices. We present an annotation framework for high-throughput studies that will facilitate good standards in GO curation and, through the use of new high-throughput evidence codes, increase the visibility of these annotations to the research community.
Collapse
Affiliation(s)
- Helen Attrill
- FlyBase, Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge , UK
| | - Pascale Gaudet
- CALIPHO group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, rue Michel Servet, CH Geneva, Switzerland
| | - Rachael P Huntley
- Institute of Cardiovascular Science, University College London, London, UK
| | - Ruth C Lovering
- Institute of Cardiovascular Science, University College London, London, UK
| | - Stacia R Engel
- Saccharomyces Genome Database, Department of Genetics, Stanford University, Porter Drive, Palo Alto, CA, USA
| | - Sylvain Poux
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, rue Michel Servet, CH Geneva, Switzerland
| | - Kimberly M Van Auken
- WormBase, Division of Biology and Biological Engineering, California Institute of Technology, E California Blvd, Pasadena, CA, USA
| | - George Georghiou
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Marcus C Chibucos
- Evidence and Conclusion Ontology, University of Maryland School of Medicine, W Baltimore St., Baltimore, MD, USA
| | - Tanya Z Berardini
- The Arabidopsis Information Resource, Phoenix Bioinformatics, Redwood City, CA, USA
| | - Valerie Wood
- PomBase, Cambridge Systems Biology Centre and Department of Biochemistry, University of Cambridge, Sanger Building, Tennis Court Road, Cambridge, UK
| | - Harold Drabkin
- Mouse Genome Informatics, Department of Computational Biology and Bioinformatics, The Jackson Laboratory, Main St., Bar Harbor, ME, USA
| | - Petra Fey
- dictyBase, Biomedical Informatics Center and Center for Genetic Medicine, Northwestern University, Feinberg School of Medicine, North Lake Shore Drive, Chicago, IL, USA
| | - Penelope Garmiri
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Midori A Harris
- PomBase, Cambridge Systems Biology Centre and Department of Biochemistry, University of Cambridge, Sanger Building, Tennis Court Road, Cambridge, UK
| | - Tony Sawford
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Leonore Reiser
- The Arabidopsis Information Resource, Phoenix Bioinformatics, Redwood City, CA, USA
| | - Rebecca Tauber
- Evidence and Conclusion Ontology, University of Maryland School of Medicine, W Baltimore St., Baltimore, MD, USA
| | - Sabrina Toro
- Zebrafish Information Network, University of Oregon, Eugene, OR, USA
| | | |
Collapse
|
15
|
Braun DM, Chung I, Kepper N, Deeg KI, Rippe K. TelNet - a database for human and yeast genes involved in telomere maintenance. BMC Genet 2018; 19:32. [PMID: 29776332 PMCID: PMC5960154 DOI: 10.1186/s12863-018-0617-8] [Citation(s) in RCA: 31] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2018] [Accepted: 04/30/2018] [Indexed: 02/05/2023] Open
Abstract
Background The ends of linear chromosomes, the telomeres, comprise repetitive DNA sequences in complex with proteins that protects them from being processed by the DNA repair machinery. Cancer cells need to counteract the shortening of telomere repeats during replication for their unlimited proliferation by reactivating the reverse transcriptase telomerase or by using the alternative lengthening of telomeres (ALT) pathway. The different telomere maintenance (TM) mechanisms appear to involve hundreds of proteins but their telomere repeat length related activities are only partly understood. Currently, a database that integrates information on TM relevant genes is missing. Description To provide a resource for studies that dissect TM features, we here introduce the TelNet database at http://www.cancertelsys.org/telnet/. It offers a comprehensive compilation of more than 2000 human and 1100 yeast genes linked to telomere maintenance. These genes were annotated in terms of TM mechanism, associated specific functions and orthologous genes, a TM significance score and information from peer-reviewed literature. This TM information can be retrieved via different search and view modes and evaluated for a set of genes as demonstrated for an exemplary application. Conclusion TelNet supports the annotation of genes identified from bioinformatics analysis pipelines to reveal possible connections with TM networks. We anticipate that TelNet will be a helpful resource for researchers that study telomeres. Electronic supplementary material The online version of this article (10.1186/s12863-018-0617-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Delia M Braun
- Division of Chromatin Networks, German Cancer Research Center (DKFZ) & Bioquant, 69120, Heidelberg, Germany
| | - Inn Chung
- Division of Chromatin Networks, German Cancer Research Center (DKFZ) & Bioquant, 69120, Heidelberg, Germany
| | - Nick Kepper
- Division of Chromatin Networks, German Cancer Research Center (DKFZ) & Bioquant, 69120, Heidelberg, Germany
| | - Katharina I Deeg
- Division of Chromatin Networks, German Cancer Research Center (DKFZ) & Bioquant, 69120, Heidelberg, Germany
| | - Karsten Rippe
- Division of Chromatin Networks, German Cancer Research Center (DKFZ) & Bioquant, 69120, Heidelberg, Germany.
| |
Collapse
|
16
|
Abstract
The Gene Ontology Consortium (GOC) produces a wealth of resources widely used throughout the scientific community. In this chapter, we discuss the different ways in which researchers can access the resources of the GOC. We here share details about the mechanics of obtaining GO annotations, both by manually browsing, querying, and downloading data from the GO website, as well as computationally accessing the resources from the command line, including the ability to restrict the data being retrieved to subsets with only certain attributes.
Collapse
|
17
|
Chibucos MC, Siegele DA, Hu JC, Giglio M. The Evidence and Conclusion Ontology (ECO): Supporting GO Annotations. Methods Mol Biol 2017; 1446:245-259. [PMID: 27812948 DOI: 10.1007/978-1-4939-3743-1_18] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Abstract
The Evidence and Conclusion Ontology (ECO) is a community resource for describing the various types of evidence that are generated during the course of a scientific study and which are typically used to support assertions made by researchers. ECO describes multiple evidence types, including evidence resulting from experimental (i.e., wet lab) techniques, evidence arising from computational methods, statements made by authors (whether or not supported by evidence), and inferences drawn by researchers curating the literature. In addition to summarizing the evidence that supports a particular assertion, ECO also offers a means to document whether a computer or a human performed the process of making the annotation. Incorporating ECO into an annotation system makes it possible to leverage the structure of the ontology such that associated data can be grouped hierarchically, users can select data associated with particular evidence types, and quality control pipelines can be optimized. Today, over 30 resources, including the Gene Ontology, use the Evidence and Conclusion Ontology to represent both evidence and how annotations are made.
Collapse
Affiliation(s)
- Marcus C Chibucos
- Department of Microbiology and Immunology, Institute for Genome Sciences, University of Maryland School of Medicine, 801 W. Baltimore Street, Baltimore, MD, 21201, USA.
| | - Deborah A Siegele
- Department of Biology, Texas A&M University, College Station, TX, 77843, USA
| | - James C Hu
- Department of Biochemistry and Biophysics, Texas A&M University and Texas AgriLife Research, College Station, TX, 77843, USA
| | - Michelle Giglio
- Department of Medicine, Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, 21201, USA
| |
Collapse
|
18
|
Abstract
The Gene Ontology (GO) project is the largest resource for cataloguing gene function. The combination of solid conceptual underpinnings and a practical set of features have made the GO a widely adopted resource in the research community and an essential resource for data analysis. In this chapter, we provide a concise primer for all users of the GO. We briefly introduce the structure of the ontology and explain how to interpret annotations associated with the GO.
Collapse
Affiliation(s)
- Pascale Gaudet
- CALIPHO group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 Michel-Servet, 1211, Geneva, Switzerland. .,Department of Human Protein Sciences, Faculty of Medicine, University of Geneva, Geneva, Switzerland.
| | - Nives Škunca
- Department of Computer Science, ETH Zurich, Universitätstrasse 19, 8092, Zurich, Switzerland.,SIB Swiss Institute of Bioinformatics, Universitätstr. 19, 8092, Zurich, Switzerland.,University College London, Gower St, London, WC1E 6BT, UK
| | - James C Hu
- Department of Biochemistry and Biophysics, Texas A&M University and Texas AgriLife Research, College Station, TX, USA
| | - Christophe Dessimoz
- Department of Genetics, Evolution & Environment, University College London, Gower St, London, WC1E 6BT, UK.,Swiss Institute of Bioinformatics, Biophore, 1015, Lausanne, Switzerland.,Department of Ecology and Evolution, University of Lausanne, Street Biophore, 1015, Lausanne, Switzerland.,Center of Integrative Genomics, University of Lausanne, Biophore, 1015, Lausanne, Switzerland.,Department of Computer Science, University College London, Gower St, Lausanne, WC1E 6BT, UK
| |
Collapse
|
19
|
Abstract
The overarching goal of the Gene Ontology (GO) Consortium is to provide researchers in biology and biomedicine with all current functional information concerning genes and the cellular context under which these occur. When the GO was started in the 1990s surprisingly little attention had been given to how functional information about genes was to be uniformly captured, structured in a computable form, and made accessible to biologists. Because knowledge of gene, protein, ncRNA, and molecular complex roles is continuously accumulating and changing, the GO needed to be a dynamic resource, accurately tracking ongoing research results over time. Here I describe the progress that has been made over the years towards this goal, and the work that still remains to be done, to make of the Gene Ontology (GO) Consortium realize its goal of offering the most comprehensive and up-to-date resource for information on gene function.
Collapse
Affiliation(s)
- Suzanna E Lewis
- Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA, 94720, USA.
| |
Collapse
|