1
|
James K, Alsobhe A, Cockell SJ, Wipat A, Pocock M. Integration of probabilistic functional networks without an external Gold Standard. BMC Bioinformatics 2022; 23:302. [PMID: 35879662 PMCID: PMC9316706 DOI: 10.1186/s12859-022-04834-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Accepted: 07/11/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Probabilistic functional integrated networks (PFINs) are designed to aid our understanding of cellular biology and can be used to generate testable hypotheses about protein function. PFINs are generally created by scoring the quality of interaction datasets against a Gold Standard dataset, usually chosen from a separate high-quality data source, prior to their integration. Use of an external Gold Standard has several drawbacks, including data redundancy, data loss and the need for identifier mapping, which can complicate the network build and impact on PFIN performance. Additionally, there typically are no Gold Standard data for non-model organisms. RESULTS We describe the development of an integration technique, ssNet, that scores and integrates both high-throughput and low-throughout data from a single source database in a consistent manner without the need for an external Gold Standard dataset. Using data from Saccharomyces cerevisiae we show that ssNet is easier and faster, overcoming the challenges of data redundancy, Gold Standard bias and ID mapping. In addition ssNet results in less loss of data and produces a more complete network. CONCLUSIONS The ssNet method allows PFINs to be built successfully from a single database, while producing comparable network performance to networks scored using an external Gold Standard source and with reduced data loss.
Collapse
Affiliation(s)
- Katherine James
- Department of Applied Sciences, Northumbria University, Sandyford Rd, Newcastle upon Tyne, NE1 8ST, UK. .,Interdisciplinary Computing and Complex BioSystems Group, Newcastle University, Science Square, Newcastle upon Tyne, NE4 5TG, UK.
| | - Aoesha Alsobhe
- Interdisciplinary Computing and Complex BioSystems Group, Newcastle University, Science Square, Newcastle upon Tyne, NE4 5TG, UK.,Saudi Electronic University, Abi Bakr As Siddiq Branch Rd, Riyadh, 1332, Saudi Arabia
| | - Simon J Cockell
- School of Biomedical, Nutritional and Sports Science, Faculty of Medical Sciences, Newcastle University, Framlington Place, Newcastle upon Tyne, NE2 4HH, UK
| | - Anil Wipat
- Interdisciplinary Computing and Complex BioSystems Group, Newcastle University, Science Square, Newcastle upon Tyne, NE4 5TG, UK
| | - Matthew Pocock
- Interdisciplinary Computing and Complex BioSystems Group, Newcastle University, Science Square, Newcastle upon Tyne, NE4 5TG, UK
| |
Collapse
|
2
|
OUP accepted manuscript. Brief Funct Genomics 2022; 21:243-269. [DOI: 10.1093/bfgp/elac007] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2021] [Revised: 03/17/2022] [Accepted: 03/18/2022] [Indexed: 11/14/2022] Open
|
3
|
Nazarieh M, Hoeppner M, Helms V. Identification of Biomarkers Controlling Cell Fate In Blood Cell Development. FRONTIERS IN BIOINFORMATICS 2021; 1:653054. [PMID: 36303754 PMCID: PMC9581055 DOI: 10.3389/fbinf.2021.653054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2021] [Accepted: 07/01/2021] [Indexed: 11/13/2022] Open
Abstract
A blood cell lineage consists of several consecutive developmental stages starting from the pluri- or multipotent stem cell to a state of terminal differentiation. Despite their importance for human biology, the regulatory pathways and gene networks that govern these differentiation processes are not yet fully understood. This is in part due to challenges associated with delineating the interactions between transcription factors (TFs) and their corresponding target genes. A possible step forward in this case is provided by the increasing amount of expression data, as a basis for linking differentiation stages and gene activities. Here, we present a novel hierarchical approach to identify characteristic expression peak patterns that global regulators excert along the differentiation path of cell lineages. Based on such simple patterns, we identified cell state-specific marker genes and extracted TFs that likely drive their differentiation. Integration of the mean expression values of stage-specific “key player” genes yielded a distinct peaking pattern for each lineage that was used to identify further genes in the dataset which behave similarly. Incorporating the set of TFs that regulate these genes led to a set of stage-specific regulators that control the biological process of cell fate. As proof of concept, we considered two expression datasets covering key differentiation events in blood cell formation of mice.
Collapse
Affiliation(s)
- Maryam Nazarieh
- Institute of Clinical Molecular Biology, Christian-Albrecht-University of Kiel, Kiel, Germany
- Center for Bioinformatics, Saarland University, Saarbruecken, Germany
| | - Marc Hoeppner
- Institute of Clinical Molecular Biology, Christian-Albrecht-University of Kiel, Kiel, Germany
| | - Volkhard Helms
- Center for Bioinformatics, Saarland University, Saarbruecken, Germany
- *Correspondence: Volkhard Helms,
| |
Collapse
|
4
|
Faisal A, Peltonen J, Georgii E, Rung J, Kaski S. Toward computational cumulative biology by combining models of biological datasets. PLoS One 2014; 9:e113053. [PMID: 25427176 PMCID: PMC4245117 DOI: 10.1371/journal.pone.0113053] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2014] [Accepted: 10/17/2014] [Indexed: 11/21/2022] Open
Abstract
A main challenge of data-driven sciences is how to make maximal use of the progressively expanding databases of experimental datasets in order to keep research cumulative. We introduce the idea of a modeling-based dataset retrieval engine designed for relating a researcher's experimental dataset to earlier work in the field. The search is (i) data-driven to enable new findings, going beyond the state of the art of keyword searches in annotations, (ii) modeling-driven, to include both biological knowledge and insights learned from data, and (iii) scalable, as it is accomplished without building one unified grand model of all data. Assuming each dataset has been modeled beforehand, by the researchers or automatically by database managers, we apply a rapidly computable and optimizable combination model to decompose a new dataset into contributions from earlier relevant models. By using the data-driven decomposition, we identify a network of interrelated datasets from a large annotated human gene expression atlas. While tissue type and disease were major driving forces for determining relevant datasets, the found relationships were richer, and the model-based search was more accurate than the keyword search; moreover, it recovered biologically meaningful relationships that are not straightforwardly visible from annotations—for instance, between cells in different developmental stages such as thymocytes and T-cells. Data-driven links and citations matched to a large extent; the data-driven links even uncovered corrections to the publication data, as two of the most linked datasets were not highly cited and turned out to have wrong publication entries in the database.
Collapse
Affiliation(s)
- Ali Faisal
- Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland
| | - Jaakko Peltonen
- Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland
| | - Elisabeth Georgii
- Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland
| | - Johan Rung
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, United Kingdom
| | - Samuel Kaski
- Helsinki Institute for Information Technology HIIT, Department of Information and Computer Science, Aalto University, Espoo, Finland
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, University of Helsinki, Helsinki, Finland
- * E-mail:
| |
Collapse
|
5
|
Christadore LM, Pham L, Kolaczyk ED, Schaus SE. Improvement of experimental testing and network training conditions with genome-wide microarrays for more accurate predictions of drug gene targets. BMC SYSTEMS BIOLOGY 2014; 8:7. [PMID: 24444313 PMCID: PMC3911882 DOI: 10.1186/1752-0509-8-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/29/2012] [Accepted: 11/21/2013] [Indexed: 11/10/2022]
Abstract
Background Genome-wide microarrays have been useful for predicting chemical-genetic interactions at the gene level. However, interpreting genome-wide microarray results can be overwhelming due to the vast output of gene expression data combined with off-target transcriptional responses many times induced by a drug treatment. This study demonstrates how experimental and computational methods can interact with each other, to arrive at more accurate predictions of drug-induced perturbations. We present a two-stage strategy that links microarray experimental testing and network training conditions to predict gene perturbations for a drug with a known mechanism of action in a well-studied organism. Results S. cerevisiae cells were treated with the antifungal, fluconazole, and expression profiling was conducted under different biological conditions using Affymetrix genome-wide microarrays. Transcripts were filtered with a formal network-based method, sparse simultaneous equation models and Lasso regression (SSEM-Lasso), under different network training conditions. Gene expression results were evaluated using both gene set and single gene target analyses, and the drug’s transcriptional effects were narrowed first by pathway and then by individual genes. Variables included: (i) Testing conditions – exposure time and concentration and (ii) Network training conditions – training compendium modifications. Two analyses of SSEM-Lasso output – gene set and single gene – were conducted to gain a better understanding of how SSEM-Lasso predicts perturbation targets. Conclusions This study demonstrates that genome-wide microarrays can be optimized using a two-stage strategy for a more in-depth understanding of how a cell manifests biological reactions to a drug treatment at the transcription level. Additionally, a more detailed understanding of how the statistical model, SSEM-Lasso, propagates perturbations through a network of gene regulatory interactions is achieved.
Collapse
Affiliation(s)
| | | | | | - Scott E Schaus
- Department of Chemistry, Boston University, Boston, MA, USA.
| |
Collapse
|
6
|
Abstract
Whole-genome sequencing, particularly in fungi, has progressed at a tremendous rate. More difficult, however, is experimental testing of the inferences about gene function that can be drawn from comparative sequence analysis alone. We present a genome-wide functional characterization of a sequenced but experimentally understudied budding yeast, Saccharomyces bayanus var. uvarum (henceforth referred to as S. bayanus), allowing us to map changes over the 20 million years that separate this organism from S. cerevisiae. We first created a suite of genetic tools to facilitate work in S. bayanus. Next, we measured the gene-expression response of S. bayanus to a diverse set of perturbations optimized using a computational approach to cover a diverse array of functionally relevant biological responses. The resulting data set reveals that gene-expression patterns are largely conserved, but significant changes may exist in regulatory networks such as carbohydrate utilization and meiosis. In addition to regulatory changes, our approach identified gene functions that have diverged. The functions of genes in core pathways are highly conserved, but we observed many changes in which genes are involved in osmotic stress, peroxisome biogenesis, and autophagy. A surprising number of genes specific to S. bayanus respond to oxidative stress, suggesting the organism may have evolved under different selection pressures than S. cerevisiae. This work expands the scope of genome-scale evolutionary studies from sequence-based analysis to rapid experimental characterization and could be adopted for functional mapping in any lineage of interest. Furthermore, our detailed characterization of S. bayanus provides a valuable resource for comparative functional genomics studies in yeast.
Collapse
|
7
|
Georgii E, Salojärvi J, Brosché M, Kangasjärvi J, Kaski S. Targeted retrieval of gene expression measurements using regulatory models. Bioinformatics 2012; 28:2349-56. [DOI: 10.1093/bioinformatics/bts361] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
|
8
|
Park J, Costanzo MC, Balakrishnan R, Cherry JM, Hong EL. CvManGO, a method for leveraging computational predictions to improve literature-based Gene Ontology annotations. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2012; 2012:bas001. [PMID: 22434836 PMCID: PMC3308158 DOI: 10.1093/database/bas001] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
Abstract
The set of annotations at the Saccharomyces Genome Database (SGD) that classifies the cellular function of S. cerevisiae gene products using Gene Ontology (GO) terms has become an important resource for facilitating experimental analysis. In addition to capturing and summarizing experimental results, the structured nature of GO annotations allows for functional comparison across organisms as well as propagation of functional predictions between related gene products. Due to their relevance to many areas of research, ensuring the accuracy and quality of these annotations is a priority at SGD. GO annotations are assigned either manually, by biocurators extracting experimental evidence from the scientific literature, or through automated methods that leverage computational algorithms to predict functional information. Here, we discuss the relationship between literature-based and computationally predicted GO annotations in SGD and extend a strategy whereby comparison of these two types of annotation identifies genes whose annotations need review. Our method, CvManGO (Computational versus Manual GO annotations), pairs literature-based GO annotations with computational GO predictions and evaluates the relationship of the two terms within GO, looking for instances of discrepancy. We found that this method will identify genes that require annotation updates, taking an important step towards finding ways to prioritize literature review. Additionally, we explored factors that may influence the effectiveness of CvManGO in identifying relevant gene targets to find in particular those genes that are missing literature-supported annotations, but our survey found that there are no immediately identifiable criteria by which one could enrich for these under-annotated genes. Finally, we discuss possible ways to improve this strategy, and the applicability of this method to other projects that use the GO for curation. Database URL:http://www.yeastgenome.org
Collapse
Affiliation(s)
- Julie Park
- Department of Genetics, Stanford University, Stanford, CA 94305-5120, USA
| | | | | | | | | |
Collapse
|
9
|
Costanzo MC, Park J, Balakrishnan R, Cherry JM, Hong EL. Using computational predictions to improve literature-based Gene Ontology annotations: a feasibility study. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2011; 2011:bar004. [PMID: 21411447 PMCID: PMC3067894 DOI: 10.1093/database/bar004] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
Abstract
Annotation using Gene Ontology (GO) terms is one of the most important ways in which biological information about specific gene products can be expressed in a searchable, computable form that may be compared across genomes and organisms. Because literature-based GO annotations are often used to propagate functional predictions between related proteins, their accuracy is critically important. We present a strategy that employs a comparison of literature-based annotations with computational predictions to identify and prioritize genes whose annotations need review. Using this method, we show that comparison of manually assigned ‘unknown’ annotations in the Saccharomyces Genome Database (SGD) with InterPro-based predictions can identify annotations that need to be updated. A survey of literature-based annotations and computational predictions made by the Gene Ontology Annotation (GOA) project at the European Bioinformatics Institute (EBI) across several other databases shows that this comparison strategy could be used to maintain and improve the quality of GO annotations for other organisms besides yeast. The survey also shows that although GOA-assigned predictions are the most comprehensive source of functional information for many genomes, a large proportion of genes in a variety of different organisms entirely lack these predictions but do have manual annotations. This underscores the critical need for manually performed, literature-based curation to provide functional information about genes that are outside the scope of widely used computational methods. Thus, the combination of manual and computational methods is essential to provide the most accurate and complete functional annotation of a genome. Database URL:http://www.yeastgenome.org
Collapse
Affiliation(s)
- Maria C Costanzo
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305-5120, USA
| | | | | | | | | |
Collapse
|
10
|
Huttenhower C, Mutungu KT, Indik N, Yang W, Schroeder M, Forman JJ, Troyanskaya OG, Coller HA. Detailing regulatory networks through large scale data integration. Bioinformatics 2009; 25:3267-74. [PMID: 19825796 DOI: 10.1093/bioinformatics/btp588] [Citation(s) in RCA: 47] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Much of a cell's regulatory response to changing environments occurs at the transcriptional level. Particularly in higher organisms, transcription factors (TFs), microRNAs and epigenetic modifications can combine to form a complex regulatory network. Part of this system can be modeled as a collection of regulatory modules: co-regulated genes, the conditions under which they are co-regulated and sequence-level regulatory motifs. RESULTS We present the Combinatorial Algorithm for Expression and Sequence-based Cluster Extraction (COALESCE) system for regulatory module prediction. The algorithm is efficient enough to discover expression biclusters and putative regulatory motifs in metazoan genomes (>20,000 genes) and very large microarray compendia (>10,000 conditions). Using Bayesian data integration, it can also include diverse supporting data types such as evolutionary conservation or nucleosome placement. We validate its performance using a functional evaluation of co-clustered genes, known yeast and Escherichea coli TF targets, synthetic data and various metazoan data compendia. In all cases, COALESCE performs as well or better than current biclustering and motif prediction tools, with high accuracy in functional and TF/target assignments and zero false positives on synthetic data. COALESCE provides an efficient and flexible platform within which large, diverse data collections can be integrated to predict metazoan regulatory networks. AVAILABILITY Source code (C++) is available at http://function.princeton.edu/sleipnir, and supporting data and a web interface are provided at http://function.princeton.edu/coalesce. CONTACT ogt@cs.princeton.edu; hcoller@princeton.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Curtis Huttenhower
- Department of Computer Science, Princeton University, 35 Olden Street, Princeton, NJ 08540, USA
| | | | | | | | | | | | | | | |
Collapse
|
11
|
Christie KR, Hong EL, Cherry JM. Functional annotations for the Saccharomyces cerevisiae genome: the knowns and the known unknowns. Trends Microbiol 2009; 17:286-94. [PMID: 19577472 PMCID: PMC3057094 DOI: 10.1016/j.tim.2009.04.005] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2008] [Revised: 04/20/2009] [Accepted: 04/24/2009] [Indexed: 11/27/2022]
Abstract
The quest to characterize each of the genes of the yeast Saccharomyces cerevisiae has propelled the development and application of novel high-throughput (HTP) experimental techniques. To handle the enormous amount of information generated by these techniques, new bioinformatics tools and resources are needed. Gene Ontology (GO) annotations curated by the Saccharomyces Genome Database (SGD) have facilitated the development of algorithms that analyze HTP data and help predict functions for poorly characterized genes in S. cerevisiae and other organisms. Here, we describe how published results are incorporated into GO annotations at SGD and why researchers can benefit from using these resources wisely to analyze their HTP data and predict gene functions.
Collapse
Affiliation(s)
- Karen R Christie
- Department of Genetics, Stanford University Medical School, Stanford, CA 94305-5120, USA
| | | | | |
Collapse
|
12
|
Huttenhower C, Haley EM, Hibbs MA, Dumeaux V, Barrett DR, Coller HA, Troyanskaya OG. Exploring the human genome with functional maps. Genes Dev 2009; 19:1093-106. [PMID: 19246570 PMCID: PMC2694471 DOI: 10.1101/gr.082214.108] [Citation(s) in RCA: 147] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2008] [Accepted: 02/09/2009] [Indexed: 11/24/2022]
Abstract
Human genomic data of many types are readily available, but the complexity and scale of human molecular biology make it difficult to integrate this body of data, understand it from a systems level, and apply it to the study of specific pathways or genetic disorders. An investigator could best explore a particular protein, pathway, or disease if given a functional map summarizing the data and interactions most relevant to his or her area of interest. Using a regularized Bayesian integration system, we provide maps of functional activity and interaction networks in over 200 areas of human cellular biology, each including information from approximately 30,000 genome-scale experiments pertaining to approximately 25,000 human genes. Key to these analyses is the ability to efficiently summarize this large data collection from a variety of biologically informative perspectives: prediction of protein function and functional modules, cross-talk among biological processes, and association of novel genes and pathways with known genetic disorders. In addition to providing maps of each of these areas, we also identify biological processes active in each data set. Experimental investigation of five specific genes, AP3B1, ATP6AP1, BLOC1S1, LAMP2, and RAB11A, has confirmed novel roles for these proteins in the proper initiation of macroautophagy in amino acid-starved human fibroblasts. Our functional maps can be explored using HEFalMp (Human Experimental/Functional Mapper), a web interface allowing interactive visualization and investigation of this large body of information.
Collapse
Affiliation(s)
- Curtis Huttenhower
- Department of Computer Science, Princeton University, Princeton, New Jersey 08540, USA
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08544, USA
| | - Erin M. Haley
- Department of Molecular Biology, Princeton University, Princeton, New Jersey 08544, USA
| | | | - Vanessa Dumeaux
- Institute of Community Medicine, Tromsø University, Tromsø, Norway
| | - Daniel R. Barrett
- Department of Computer Science, Princeton University, Princeton, New Jersey 08540, USA
| | - Hilary A. Coller
- Department of Molecular Biology, Princeton University, Princeton, New Jersey 08544, USA
| | - Olga G. Troyanskaya
- Department of Computer Science, Princeton University, Princeton, New Jersey 08540, USA
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08544, USA
| |
Collapse
|