51
|
He F, Chen H, Probst-Kepper M, Geffers R, Eifes S, Del Sol A, Schughart K, Zeng AP, Balling R. PLAU inferred from a correlation network is critical for suppressor function of regulatory T cells. Mol Syst Biol 2013; 8:624. [PMID: 23169000 PMCID: PMC3531908 DOI: 10.1038/msb.2012.56] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2012] [Accepted: 10/05/2012] [Indexed: 02/07/2023] Open
Abstract
Human FOXP3(+)CD25(+)CD4(+) regulatory T cells (Tregs) are essential to the maintenance of immune homeostasis. Several genes are known to be important for murine Tregs, but for human Tregs the genes and underlying molecular networks controlling the suppressor function still largely remain unclear. Here, we describe a strategy to identify the key genes directly from an undirected correlation network which we reconstruct from a very high time-resolution (HTR) transcriptome during the activation of human Tregs/CD4(+) T-effector cells. We show that a predicted top-ranked new key gene PLAU (the plasminogen activator urokinase) is important for the suppressor function of both human and murine Tregs. Further analysis unveils that PLAU is particularly important for memory Tregs and that PLAU mediates Treg suppressor function via STAT5 and ERK signaling pathways. Our study demonstrates the potential for identifying novel key genes for complex dynamic biological processes using a network strategy based on HTR data, and reveals a critical role for PLAU in Treg suppressor function.
Collapse
Affiliation(s)
- Feng He
- Department of Infection Genetics, Helmholtz Centre for Infection Research (HZI), University of Veterinary Medicine Hannover, Braunschweig, Germany
| | | | | | | | | | | | | | | | | |
Collapse
|
52
|
Abstract
Modern experimental strategies often generate genome-scale measurements of human tissues or cell lines in various physiological states. Investigators often use these datasets individually to help elucidate molecular mechanisms of human diseases. Here we discuss approaches that effectively weight and integrate hundreds of heterogeneous datasets to gene-gene networks that focus on a specific process or disease. Diverse and systematic genome-scale measurements provide such approaches both a great deal of power and a number of challenges. We discuss some such challenges as well as methods to address them. We also raise important considerations for the assessment and evaluation of such approaches. When carefully applied, these integrative data-driven methods can make novel high-quality predictions that can transform our understanding of the molecular-basis of human disease.
Collapse
Affiliation(s)
- Casey S. Greene
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - Olga G. Troyanskaya
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- * E-mail:
| |
Collapse
|
53
|
Abstract
Advanced statistical methods used to analyze high-throughput data such as gene-expression assays result in long lists of “significant genes.” One way to gain insight into the significance of altered expression levels is to determine whether Gene Ontology (GO) terms associated with a particular biological process, molecular function, or cellular component are over- or under-represented in the set of genes deemed significant. This process, referred to as enrichment analysis, profiles a gene-set, and is widely used to makes sense of the results of high-throughput experiments. The canonical example of enrichment analysis is when the output dataset is a list of genes differentially expressed in some condition. To determine the biological relevance of a lengthy gene list, the usual solution is to perform enrichment analysis with the GO. We can aggregate the annotating GO concepts for each gene in this list, and arrive at a profile of the biological processes or mechanisms affected by the condition under study. While GO has been the principal target for enrichment analysis, the methods of enrichment analysis are generalizable. We can conduct the same sort of profiling along other ontologies of interest. Just as scientists can ask “Which biological process is over-represented in my set of interesting genes or proteins?” we can also ask “Which disease (or class of diseases) is over-represented in my set of interesting genes or proteins?“. For example, by annotating known protein mutations with disease terms from the ontologies in BioPortal, Mort et al. recently identified a class of diseases—blood coagulation disorders—that were associated with a 14-fold depletion in substitutions at O-linked glycosylation sites. With the availability of tools for automatic annotation of datasets with terms from disease ontologies, there is no reason to restrict enrichment analyses to the GO. In this chapter, we will discuss methods to perform enrichment analysis using any ontology available in the biomedical domain. We will review the general methodology of enrichment analysis, the associated challenges, and discuss the novel translational analyses enabled by the existence of public, national computational infrastructure and by the use of disease ontologies in such analyses.
Collapse
Affiliation(s)
- Nigam H Shah
- Center for Biomedical Informatics Research, Stanford University, Stanford, California, United States of America.
| | | | | |
Collapse
|
54
|
Guan Y, Gorenshteyn D, Burmeister M, Wong AK, Schimenti JC, Handel MA, Bult CJ, Hibbs MA, Troyanskaya OG. Tissue-specific functional networks for prioritizing phenotype and disease genes. PLoS Comput Biol 2012; 8:e1002694. [PMID: 23028291 PMCID: PMC3459891 DOI: 10.1371/journal.pcbi.1002694] [Citation(s) in RCA: 88] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2012] [Accepted: 08/02/2012] [Indexed: 12/16/2022] Open
Abstract
Integrated analyses of functional genomics data have enormous potential for identifying phenotype-associated genes. Tissue-specificity is an important aspect of many genetic diseases, reflecting the potentially different roles of proteins and pathways in diverse cell lineages. Accounting for tissue specificity in global integration of functional genomics data is challenging, as “functionality” and “functional relationships” are often not resolved for specific tissue types. We address this challenge by generating tissue-specific functional networks, which can effectively represent the diversity of protein function for more accurate identification of phenotype-associated genes in the laboratory mouse. Specifically, we created 107 tissue-specific functional relationship networks through integration of genomic data utilizing knowledge of tissue-specific gene expression patterns. Cross-network comparison revealed significantly changed genes enriched for functions related to specific tissue development. We then utilized these tissue-specific networks to predict genes associated with different phenotypes. Our results demonstrate that prediction performance is significantly improved through using the tissue-specific networks as compared to the global functional network. We used a testis-specific functional relationship network to predict genes associated with male fertility and spermatogenesis phenotypes, and experimentally confirmed one top prediction, Mbyl1. We then focused on a less-common genetic disease, ataxia, and identified candidates uniquely predicted by the cerebellum network, which are supported by both literature and experimental evidence. Our systems-level, tissue-specific scheme advances over traditional global integration and analyses and establishes a prototype to address the tissue-specific effects of genetic perturbations, diseases and drugs. Tissue specificity is an important aspect of many genetic diseases, reflecting the potentially different roles of proteins and pathways in diverse cell lineages. We propose an effective strategy to model tissue-specific functional relationship networks in the laboratory mouse. We integrated large scale genomics datasets as well as low-throughput tissue-specific expression profiles to estimate the probability that two proteins are co-functioning in the tissue under study. These networks can accurately reflect the diversity of protein functions across different organs and tissue compartments. By computationally exploring the tissue-specific networks, we can accurately predict novel phenotype-related gene candidates. We experimentally confirmed a top candidate gene, Mybl1, to affect several male fertility phenotypes, predicted based on male-reproductive system-specific networks and we predicted candidates related to a rare genetic disease ataxia, which are supported by experimental and literature evidence. The above results demonstrate the power of modeling tissue-specific dynamics of co-functionality through computational approaches.
Collapse
Affiliation(s)
- Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
- Department of Internal Medicine, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Dmitriy Gorenshteyn
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
| | - Margit Burmeister
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
- Molecular & Behavioral Neuroscience Institution, Department of Psychiatry, and Department of Human Genetics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Aaron K. Wong
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - John C. Schimenti
- Department of Biomedical Sciences, College of Veterinary Medicine, Cornell University, Ithaca, New York, United States of America
| | - Mary Ann Handel
- The Jackson Laboratory, Bar Harbor, Maine, United States of America
| | - Carol J. Bult
- The Jackson Laboratory, Bar Harbor, Maine, United States of America
| | - Matthew A. Hibbs
- The Jackson Laboratory, Bar Harbor, Maine, United States of America
- Trinity University, Computer Science Department, San Antonio, Texas, United States of America
- * E-mail: (MAH); (OGT)
| | - Olga G. Troyanskaya
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- * E-mail: (MAH); (OGT)
| |
Collapse
|
55
|
Klomp JA, Furge KA. Genome-wide matching of genes to cellular roles using guilt-by-association models derived from single sample analysis. BMC Res Notes 2012; 5:370. [PMID: 22824328 PMCID: PMC3599284 DOI: 10.1186/1756-0500-5-370] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2012] [Accepted: 07/23/2012] [Indexed: 11/10/2022] Open
Abstract
Background High-throughput methods that ascribe a cellular or physiological function for each gene product are useful to understand the roles of genes that have not been extensively characterized by molecular or genetic approaches. One method to infer gene function is "guilt-by-association", in which the expression pattern of a poorly characterized gene is shown to co-vary with the expression of better-characterized genes. The function of the poorly characterized gene is inferred from the known function(s) of the well-described genes. For example, genes co-expressed with transcripts that vary during the cell cycle, development, environmental stresses, and with oncogenesis have been implicated in those processes. Findings While examining the expression characteristics of several poorly characterized genes, we noted that we could associate each of the genes with a cellular phenotype by correlating individual gene expression changes with gene set enrichment scores from individual samples. We evaluated the effectiveness of this approach using a modest sized gene expression data set (expO) and a compendium of gene expression phenotypes (MSigDBv3.0). We found the transcripts that correlated best with enrichment in mitochondrial and lysosomal gene sets were mostly related to those processes (89/100 and 44/50, respectively). The reciprocal evaluation, ranking gene sets according to correlation of enrichment with an individual gene’s expression, also reflected known associations for prominent genes in the biomedical literature (16/19). In evaluating the model, we also found that 4% of the genome encodes proteins that are associated with small molecule and small peptide signal transduction gene sets, implicating a large number of genes in both internal and external environmental sensing. Conclusions Our results show that this approach is useful to infer functions of disparate sets of genes. This method mirrors the biological experimental approaches used by others to associate individual genes with defined gene expression changes. Moreover, the approach can be used beyond discovering genes related to a cellular process to discover meaningful expression phenotypes from a compendium that are associated with a given gene. The effectiveness, versatility, and breadth of this approach make possible its application in a variety of contexts and with a variety of downstream analyses.
Collapse
Affiliation(s)
- Jeff A Klomp
- Center for Cancer Genomics and Computational Biology, Van Andel Research Institute, Grand Rapids, MI, USA
| | | |
Collapse
|
56
|
Abstract
Through domestication, humans have substantially altered the morphology of Zea mays ssp. parviglumis (teosinte) into the currently recognizable maize. This system serves as a model for studying adaptation, genome evolution, and the genetics and evolution of complex traits. To examine how domestication has reshaped the transcriptome of maize seedlings, we used expression profiling of 18,242 genes for 38 diverse maize genotypes and 24 teosinte genotypes. We detected evidence for more than 600 genes having significantly different expression levels in maize compared with teosinte. Moreover, more than 1,100 genes showed significantly altered coexpression profiles, reflective of substantial rewiring of the transcriptome since domestication. The genes with altered expression show a significant enrichment for genes previously identified through population genetic analyses as likely targets of selection during maize domestication and improvement; 46 genes previously identified as putative targets of selection also exhibit altered expression levels and coexpression relationships. We also identified 45 genes with altered, primarily higher, expression in inbred relative to outcrossed teosinte. These genes are enriched for functions related to biotic stress and may reflect responses to the effects of inbreeding. This study not only documents alterations in the maize transcriptome following domestication, identifying several genes that may have contributed to the evolution of maize, but highlights the complementary information that can be gained by combining gene expression with population genetic analyses.
Collapse
|
57
|
Koch EN, Costanzo M, Bellay J, Deshpande R, Chatfield-Reed K, Chua G, D'Urso G, Andrews BJ, Boone C, Myers CL. Conserved rules govern genetic interaction degree across species. Genome Biol 2012; 13:R57. [PMID: 22747640 PMCID: PMC3491379 DOI: 10.1186/gb-2012-13-7-r57] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2011] [Accepted: 07/02/2012] [Indexed: 11/10/2022] Open
Abstract
Background Synthetic genetic interactions have recently been mapped on a genome scale in the budding yeast Saccharomyces cerevisiae, providing a functional view of the central processes of eukaryotic life. Currently, comprehensive genetic interaction networks have not been determined for other species, and we therefore sought to model conserved aspects of genetic interaction networks in order to enable the transfer of knowledge between species. Results Using a combination of physiological and evolutionary properties of genes, we built models that successfully predicted the genetic interaction degree of S. cerevisiae genes. Importantly, a model trained on S. cerevisiae gene features and degree also accurately predicted interaction degree in the fission yeast Schizosaccharomyces pombe, suggesting that many of the predictive relationships discovered in S. cerevisiae also hold in this evolutionarily distant yeast. In both species, high single mutant fitness defect, protein disorder, pleiotropy, protein-protein interaction network degree, and low expression variation were significantly predictive of genetic interaction degree. A comparison of the predicted genetic interaction degrees of S. pombe genes to the degrees of S. cerevisiae orthologs revealed functional rewiring of specific biological processes that distinguish these two species. Finally, predicted differences in genetic interaction degree were independently supported by differences in co-expression relationships of the two species. Conclusions Our findings show that there are common relationships between gene properties and genetic interaction network topology in two evolutionarily distant species. This conservation allows use of the extensively mapped S. cerevisiae genetic interaction network as an orthology-independent reference to guide the study of more complex species.
Collapse
|
58
|
Tseng GC, Ghosh D, Feingold E. Comprehensive literature review and statistical considerations for microarray meta-analysis. Nucleic Acids Res 2012; 40:3785-99. [PMID: 22262733 PMCID: PMC3351145 DOI: 10.1093/nar/gkr1265] [Citation(s) in RCA: 281] [Impact Index Per Article: 21.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
With the rapid advances of various high-throughput technologies, generation of ‘-omics’ data is commonplace in almost every biomedical field. Effective data management and analytical approaches are essential to fully decipher the biological knowledge contained in the tremendous amount of experimental data. Meta-analysis, a set of statistical tools for combining multiple studies of a related hypothesis, has become popular in genomic research. Here, we perform a systematic search from PubMed and manual collection to obtain 620 genomic meta-analysis papers, of which 333 microarray meta-analysis papers are summarized as the basis of this paper and the other 249 GWAS meta-analysis papers are discussed in the next companion paper. The review in the present paper focuses on various biological purposes of microarray meta-analysis, databases and software and related statistical procedures. Statistical considerations of such an analysis are further scrutinized and illustrated by a case study. Finally, several open questions are listed and discussed.
Collapse
Affiliation(s)
- George C Tseng
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA.
| | | | | |
Collapse
|
59
|
Abstract
Microarrays were one of the first technologies of the genomic revolution to gain widespread adoption, rapidly expanding from a cottage industry to the source of thousands of experimental results. They were one of the first assays for which data repositories and metadata were standardized and researchers were required by many journals to make published data publicly available. Microarrays provide high-throughput insights into the biological functions of genes and gene products; however, they also present a "curse of dimensionality," whereby the availability of many gene expression measurements in few samples make it challenging to distinguish noise from true biological signal. All of these factors argue for integrative approaches to microarray data analysis, which combine data from multiple experiments to increase sample size, avoid laboratory-specific bias, and enable new biological insights not possible from a single experiment. Here, we discuss several approaches to integrative microarray analysis for a diverse range of applications, including biomarker discovery, gene function and interaction prediction, and regulatory network inference. We also show how, by integrating large microarray compendia with diverse genomic data types, more nuanced biological hypotheses can be explored computationally. This chapter provides overviews and brief descriptions of each of these approaches to microarray integration.
Collapse
Affiliation(s)
- Levi Waldron
- Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA
| | | | | |
Collapse
|
60
|
Tsiliki G, Kossida S. Fusion methodologies for biomedical data. J Proteomics 2011; 74:2774-85. [PMID: 21767675 DOI: 10.1016/j.jprot.2011.07.001] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2011] [Revised: 06/13/2011] [Accepted: 07/01/2011] [Indexed: 12/12/2022]
Abstract
Data fusion methods are powerful tools for integrating the different views of an organism provided by various types of experimental data. We describe various methodologies for integrating and drawing inferences from a collection of biomedical data, primarily focusing on protein and gene expression data. Computational experiments performed using biomedical data, including known protein-protein interactions, hydropathy profiles, gene expression data and amino acid sequences, demonstrate the utility of this approach. Overall, studies agree in that methodologies using carefully selected data of various types to predict particular classes, groups and interactions, perform better than when applied to a single type of data.
Collapse
Affiliation(s)
- Georgia Tsiliki
- Bioinformatics andMedical Informatics Group, Biomedical Research Foundation, Academy of Athens, 4 Soranou Ephessiou, 115 27, Athens, Greece.
| | | |
Collapse
|
61
|
Bellay J, Atluri G, Sing TL, Toufighi K, Costanzo M, Ribeiro PSM, Pandey G, Baller J, VanderSluis B, Michaut M, Han S, Kim P, Brown GW, Andrews BJ, Boone C, Kumar V, Myers CL. Putting genetic interactions in context through a global modular decomposition. Genome Res 2011; 21:1375-87. [PMID: 21715556 DOI: 10.1101/gr.117176.110] [Citation(s) in RCA: 60] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Genetic interactions provide a powerful perspective into gene function, but our knowledge of the specific mechanisms that give rise to these interactions is still relatively limited. The availability of a global genetic interaction map in Saccharomyces cerevisiae, covering ∼30% of all possible double mutant combinations, provides an unprecedented opportunity for an unbiased assessment of the native structure within genetic interaction networks and how it relates to gene function and modular organization. Toward this end, we developed a data mining approach to exhaustively discover all block structures within this network, which allowed for its complete modular decomposition. The resulting modular structures revealed the importance of the context of individual genetic interactions in their interpretation and revealed distinct trends among genetic interaction hubs as well as insights into the evolution of duplicate genes. Block membership also revealed a surprising degree of multifunctionality across the yeast genome and enabled a novel association of VIP1 and IPK1 with DNA replication and repair, which is supported by experimental evidence. Our modular decomposition also provided a basis for testing the between-pathway model of negative genetic interactions and within-pathway model of positive genetic interactions. While we find that most modular structures involving negative genetic interactions fit the between-pathway model, we found that current models for positive genetic interactions fail to explain 80% of the modular structures detected. We also find differences between the modular structures of essential and nonessential genes.
Collapse
Affiliation(s)
- Jeremy Bellay
- Department of Computer Science and Engineering, University of Minnesota-Twin Cities, Minneapolis, Minnesota 55455, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
62
|
Bellay J, Han S, Michaut M, Kim T, Costanzo M, Andrews BJ, Boone C, Bader GD, Myers CL, Kim PM. Bringing order to protein disorder through comparative genomics and genetic interactions. Genome Biol 2011. [PMID: 21324131 DOI: 10.1186/gb‐2011‐12‐2‐r14] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND Intrinsically disordered regions are widespread, especially in proteomes of higher eukaryotes. Recently, protein disorder has been associated with a wide variety of cellular processes and has been implicated in several human diseases. Despite its apparent functional importance, the sheer range of different roles played by protein disorder often makes its exact contribution difficult to interpret. RESULTS We attempt to better understand the different roles of disorder using a novel analysis that leverages both comparative genomics and genetic interactions. Strikingly, we find that disorder can be partitioned into three biologically distinct phenomena: regions where disorder is conserved but with quickly evolving amino acid sequences (flexible disorder); regions of conserved disorder with also highly conserved amino acid sequences (constrained disorder); and, lastly, non-conserved disorder. Flexible disorder bears many of the characteristics commonly attributed to disorder and is associated with signaling pathways and multi-functionality. Conversely, constrained disorder has markedly different functional attributes and is involved in RNA binding and protein chaperones. Finally, non-conserved disorder lacks clear functional hallmarks based on our analysis. CONCLUSIONS Our new perspective on protein disorder clarifies a variety of previous results by putting them into a systematic framework. Moreover, the clear and distinct functional association of flexible and constrained disorder will allow for new approaches and more specific algorithms for disorder detection in a functional context. Finally, in flexible disordered regions, we demonstrate clear evolutionary selection of protein disorder with little selection on primary structure, which has important implications for sequence-based studies of protein structure and evolution.
Collapse
Affiliation(s)
- Jeremy Bellay
- Department of Computer Science and Engineering, University of Minnesota, 200 Union Street SE, Minneapolis, MN 55455, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
63
|
Bellay J, Han S, Michaut M, Kim T, Costanzo M, Andrews BJ, Boone C, Bader GD, Myers CL, Kim PM. Bringing order to protein disorder through comparative genomics and genetic interactions. Genome Biol 2011; 12:R14. [PMID: 21324131 PMCID: PMC3188796 DOI: 10.1186/gb-2011-12-2-r14] [Citation(s) in RCA: 110] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2010] [Revised: 02/01/2011] [Accepted: 02/16/2011] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND Intrinsically disordered regions are widespread, especially in proteomes of higher eukaryotes. Recently, protein disorder has been associated with a wide variety of cellular processes and has been implicated in several human diseases. Despite its apparent functional importance, the sheer range of different roles played by protein disorder often makes its exact contribution difficult to interpret. RESULTS We attempt to better understand the different roles of disorder using a novel analysis that leverages both comparative genomics and genetic interactions. Strikingly, we find that disorder can be partitioned into three biologically distinct phenomena: regions where disorder is conserved but with quickly evolving amino acid sequences (flexible disorder); regions of conserved disorder with also highly conserved amino acid sequences (constrained disorder); and, lastly, non-conserved disorder. Flexible disorder bears many of the characteristics commonly attributed to disorder and is associated with signaling pathways and multi-functionality. Conversely, constrained disorder has markedly different functional attributes and is involved in RNA binding and protein chaperones. Finally, non-conserved disorder lacks clear functional hallmarks based on our analysis. CONCLUSIONS Our new perspective on protein disorder clarifies a variety of previous results by putting them into a systematic framework. Moreover, the clear and distinct functional association of flexible and constrained disorder will allow for new approaches and more specific algorithms for disorder detection in a functional context. Finally, in flexible disordered regions, we demonstrate clear evolutionary selection of protein disorder with little selection on primary structure, which has important implications for sequence-based studies of protein structure and evolution.
Collapse
Affiliation(s)
- Jeremy Bellay
- Department of Computer Science and Engineering, University of Minnesota, 200 Union Street SE, Minneapolis, MN 55455, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
64
|
Chikina MD, Troyanskaya OG. Accurate quantification of functional analogy among close homologs. PLoS Comput Biol 2011; 7:e1001074. [PMID: 21304936 PMCID: PMC3033368 DOI: 10.1371/journal.pcbi.1001074] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2010] [Accepted: 01/02/2011] [Indexed: 11/18/2022] Open
Abstract
Correctly evaluating functional similarities among homologous proteins is necessary for accurate transfer of experimental knowledge from one organism to another, and is of particular importance for the development of animal models of human disease. While the fact that sequence similarity implies functional similarity is a fundamental paradigm of molecular biology, sequence comparison does not directly assess the extent to which two proteins participate in the same biological processes, and has limited utility for analyzing families with several parologous members. Nevertheless, we show that it is possible to provide a cross-organism functional similarity measure in an unbiased way through the exclusive use of high-throughput gene-expression data. Our methodology is based on probabilistic cross-species mapping of functionally analogous proteins based on Bayesian integrative analysis of gene expression compendia. We demonstrate that even among closely related genes, our method is able to predict functionally analogous homolog pairs better than relying on sequence comparison alone. We also demonstrate that the landscape of functional similarity is often complex and that definitive “functional orthologs” do not always exist. Even in these cases, our method and the online interface we provide are designed to allow detailed exploration of sources of inferred functional similarity that can be evaluated by the user. Common ancestry is a central tenet of modern biology, as genes from different species often show a high degree of sequence similarity, making it possible to study analogous processes across model organisms. However, many genes belong to large families with several duplicates and the relationship between genes from different species is often not one-to-one, complicating the transfer of experimental knowledge. We present a method that uses a large compendia of high-throughput expression data, that covers many genes that have not been analyzed in any other way, to systematically predict which genes are most likely to participate in the same biological process and thus have analogous function in different organisms. We show that our method agrees well with current experimental knowledge and we use it to investigate several families of genes that demonstrate the complexity of functional analogy.
Collapse
Affiliation(s)
- Maria D. Chikina
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
| | - Olga G. Troyanskaya
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- * E-mail:
| |
Collapse
|
65
|
Pop A, Huttenhower C, Iyer-Pascuzzi A, Benfey PN, Troyanskaya OG. Integrated functional networks of process, tissue, and developmental stage specific interactions in Arabidopsis thaliana. BMC SYSTEMS BIOLOGY 2010; 4:180. [PMID: 21194434 PMCID: PMC3023688 DOI: 10.1186/1752-0509-4-180] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/14/2010] [Accepted: 12/31/2010] [Indexed: 11/21/2022]
Abstract
Background Recent years have seen an explosion in plant genomics, as the difficulties inherent in sequencing and functionally analyzing these biologically and economically significant organisms have been overcome. Arabidopsis thaliana, a versatile model organism, represents an opportunity to evaluate the predictive power of biological network inference for plant functional genomics. Results Here, we provide a compendium of functional relationship networks for Arabidopsis thaliana leveraging data integration based on over 60 microarray, physical and genetic interaction, and literature curation datasets. These include tissue, biological process, and development stage specific networks, each predicting relationships specific to an individual biological context. These biological networks enable the rapid investigation of uncharacterized genes in specific tissues and developmental stages of interest and summarize a very large collection of A. thaliana data for biological examination. We found validation in the literature for many of our predicted networks, including those involved in disease resistance, root hair patterning, and auxin homeostasis. Conclusions These context-specific networks demonstrate that highly specific biological hypotheses can be generated for a diversity of individual processes, developmental stages, and plant tissues in A. thaliana. All predicted functional networks are available online at http://function.princeton.edu/arathGraphle.
Collapse
Affiliation(s)
- Ana Pop
- Computer Science Department, Princeton University, Princeton, NJ, USA
| | | | | | | | | |
Collapse
|
66
|
Narayanan M, Vetta A, Schadt EE, Zhu J. Simultaneous clustering of multiple gene expression and physical interaction datasets. PLoS Comput Biol 2010; 6:e1000742. [PMID: 20419151 PMCID: PMC2855327 DOI: 10.1371/journal.pcbi.1000742] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2009] [Accepted: 03/15/2010] [Indexed: 12/04/2022] Open
Abstract
Many genome-wide datasets are routinely generated to study different aspects of biological systems, but integrating them to obtain a coherent view of the underlying biology remains a challenge. We propose simultaneous clustering of multiple networks as a framework to integrate large-scale datasets on the interactions among and activities of cellular components. Specifically, we develop an algorithm JointCluster that finds sets of genes that cluster well in multiple networks of interest, such as coexpression networks summarizing correlations among the expression profiles of genes and physical networks describing protein-protein and protein-DNA interactions among genes or gene-products. Our algorithm provides an efficient solution to a well-defined problem of jointly clustering networks, using techniques that permit certain theoretical guarantees on the quality of the detected clustering relative to the optimal clustering. These guarantees coupled with an effective scaling heuristic and the flexibility to handle multiple heterogeneous networks make our method JointCluster an advance over earlier approaches. Simulation results showed JointCluster to be more robust than alternate methods in recovering clusters implanted in networks with high false positive rates. In systematic evaluation of JointCluster and some earlier approaches for combined analysis of the yeast physical network and two gene expression datasets under glucose and ethanol growth conditions, JointCluster discovers clusters that are more consistently enriched for various reference classes capturing different aspects of yeast biology or yield better coverage of the analysed genes. These robust clusters, which are supported across multiple genomic datasets and diverse reference classes, agree with known biology of yeast under these growth conditions, elucidate the genetic control of coordinated transcription, and enable functional predictions for a number of uncharacterized genes. The generation of high-dimensional datasets in the biological sciences has become routine (protein interaction, gene expression, and DNA/RNA sequence data, to name a few), stretching our ability to derive novel biological insights from them, with even less effort focused on integrating these disparate datasets available in the public domain. Hence a most pressing problem in the life sciences today is the development of algorithms to combine large-scale data on different biological dimensions to maximize our understanding of living systems. We present an algorithm for simultaneously clustering multiple biological networks to identify coherent sets of genes (clusters) underlying cellular processes. The algorithm allows theoretical guarantees on the quality of the detected clusters relative to the optimal clusters that are computationally infeasible to find, and could be applied to coexpression, protein interaction, protein-DNA networks, and other network types. When combining multiple physical and gene expression based networks in yeast, the clusters we identify are consistently enriched for reference classes capturing diverse aspects of biology, yield good coverage of the analysed genes, and highlight novel members in well-studied cellular processes.
Collapse
Affiliation(s)
- Manikandan Narayanan
- Department of Genetics, Rosetta Inpharmatics (Merck), Seattle, Washington, United States of America
- * E-mail: (MN); (JZ)
| | - Adrian Vetta
- Department of Mathematics and Statistics, and School of Computer Science, McGill University, Montreal, Quebec, Canada
| | - Eric E. Schadt
- Department of Genetics, Rosetta Inpharmatics (Merck), Seattle, Washington, United States of America
| | - Jun Zhu
- Department of Genetics, Rosetta Inpharmatics (Merck), Seattle, Washington, United States of America
- * E-mail: (MN); (JZ)
| |
Collapse
|
67
|
Transcriptional profiling of growth perturbations of the human malaria parasite Plasmodium falciparum. Nat Biotechnol 2009; 28:91-8. [PMID: 20037583 DOI: 10.1038/nbt.1597] [Citation(s) in RCA: 152] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2009] [Accepted: 12/06/2009] [Indexed: 12/25/2022]
Abstract
Functions have yet to be defined for the majority of genes of Plasmodium falciparum, the agent responsible for the most serious form of human malaria. Here we report changes in P. falciparum gene expression induced by 20 compounds that inhibit growth of the schizont stage of the intraerythrocytic development cycle. In contrast with previous studies, which reported only minimal changes in response to chemically induced perturbations of P. falciparum growth, we find that approximately 59% of its coding genes display over three-fold changes in expression in response to at least one of the chemicals we tested. We use this compendium for guilt-by-association prediction of protein function using an interaction network constructed from gene co-expression, sequence homology, domain-domain and yeast two-hybrid data. The subcellular localizations of 31 of 42 proteins linked with merozoite invasion is consistent with their role in this process, a key target for malaria control. Our network may facilitate identification of novel antimalarial drugs and vaccines.
Collapse
|
68
|
Huttenhower C, Mehmood SO, Troyanskaya OG. Graphle: Interactive exploration of large, dense graphs. BMC Bioinformatics 2009; 10:417. [PMID: 20003429 PMCID: PMC2803856 DOI: 10.1186/1471-2105-10-417] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2009] [Accepted: 12/14/2009] [Indexed: 01/24/2023] Open
Abstract
Background A wide variety of biological data can be modeled as network structures, including experimental results (e.g. protein-protein interactions), computational predictions (e.g. functional interaction networks), or curated structures (e.g. the Gene Ontology). While several tools exist for visualizing large graphs at a global level or small graphs in detail, previous systems have generally not allowed interactive analysis of dense networks containing thousands of vertices at a level of detail useful for biologists. Investigators often wish to explore specific portions of such networks from a detailed, gene-specific perspective, and balancing this requirement with the networks' large size, complex structure, and rich metadata is a substantial computational challenge. Results Graphle is an online interface to large collections of arbitrary undirected, weighted graphs, each possibly containing tens of thousands of vertices (e.g. genes) and hundreds of millions of edges (e.g. interactions). These are stored on a centralized server and accessed efficiently through an interactive Java applet. The Graphle applet allows a user to examine specific portions of a graph, retrieving the relevant neighborhood around a set of query vertices (genes). This neighborhood can then be refined and modified interactively, and the results can be saved either as publication-quality images or as raw data for further analysis. The Graphle web site currently includes several hundred biological networks representing predicted functional relationships from three heterogeneous data integration systems: S. cerevisiae data from bioPIXIE, E. coli data using MEFIT, and H. sapiens data from HEFalMp. Conclusions Graphle serves as a search and visualization engine for biological networks, which can be managed locally (simplifying collaborative data sharing) and investigated remotely. The Graphle framework is freely downloadable and easily installed on new servers, allowing any lab to quickly set up a Graphle site from which their own biological network data can be shared online.
Collapse
Affiliation(s)
- Curtis Huttenhower
- Department of Computer Science, Princeton University, Princeton, NJ 08540, USA.
| | | | | |
Collapse
|
69
|
Adler P, Kolde R, Kull M, Tkachenko A, Peterson H, Reimand J, Vilo J. Mining for coexpression across hundreds of datasets using novel rank aggregation and visualization methods. Genome Biol 2009; 10:R139. [PMID: 19961599 PMCID: PMC2812946 DOI: 10.1186/gb-2009-10-12-r139] [Citation(s) in RCA: 117] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2009] [Revised: 10/25/2009] [Accepted: 12/04/2009] [Indexed: 11/25/2022] Open
Abstract
The MEM web resource allows users to search for co-expressed genes across all microarray datasets in the ArrayExpress database. We present a web resource MEM (Multi-Experiment Matrix) for gene expression similarity searches across many datasets. MEM features large collections of microarray datasets and utilizes rank aggregation to merge information from different datasets into a single global ordering with simultaneous statistical significance estimation. Unique features of MEM include automatic detection, characterization and visualization of datasets that includes the strongest coexpression patterns. MEM is freely available at http://biit.cs.ut.ee/mem/.
Collapse
Affiliation(s)
- Priit Adler
- Institute of Molecular and Cell Biology, Riia 23, 51010 Tartu, Estonia.
| | | | | | | | | | | | | |
Collapse
|
70
|
Sinha A, Hripcsak G, Markatou M. Large datasets in biomedicine: a discussion of salient analytic issues. J Am Med Inform Assoc 2009; 16:759-67. [PMID: 19717808 PMCID: PMC3002128 DOI: 10.1197/jamia.m2780] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2008] [Accepted: 08/02/2009] [Indexed: 11/10/2022] Open
Abstract
Advances in high-throughput and mass-storage technologies have led to an information explosion in both biology and medicine, presenting novel challenges for analysis and modeling. With regards to multivariate analysis techniques such as clustering, classification, and regression, large datasets present unique and often misunderstood challenges. The authors' goal is to provide a discussion of the salient problems encountered in the analysis of large datasets as they relate to modeling and inference to inform a principled and generalizable analysis and highlight the interdisciplinary nature of these challenges. The authors present a detailed study of germane issues including high dimensionality, multiple testing, scientific significance, dependence, information measurement, and information management with a focus on appropriate methodologies available to address these concerns. A firm understanding of the challenges and statistical technology involved ultimately contributes to better science. The authors further suggest that the community consider facilitating discussion through interdisciplinary panels, invited papers and curriculum enhancement to establish guidelines for analysis and reporting.
Collapse
Affiliation(s)
- Anshu Sinha
- Department of Biomedical Informatics, Columbia University, New York, NY
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University, New York, NY
| | | |
Collapse
|
71
|
Dudley JT, Tibshirani R, Deshpande T, Butte AJ. Disease signatures are robust across tissues and experiments. Mol Syst Biol 2009; 5:307. [PMID: 19756046 PMCID: PMC2758720 DOI: 10.1038/msb.2009.66] [Citation(s) in RCA: 93] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2009] [Accepted: 08/17/2009] [Indexed: 11/09/2022] Open
Abstract
Meta-analyses combining gene expression microarray experiments offer new insights into the molecular pathophysiology of disease not evident from individual experiments. Although the established technical reproducibility of microarrays serves as a basis for meta-analysis, pathophysiological reproducibility across experiments is not well established. In this study, we carried out a large-scale analysis of disease-associated experiments obtained from NCBI GEO, and evaluated their concordance across a broad range of diseases and tissue types. On evaluating 429 experiments, representing 238 diseases and 122 tissues from 8435 microarrays, we find evidence for a general, pathophysiological concordance between experiments measuring the same disease condition. Furthermore, we find that the molecular signature of disease across tissues is overall more prominent than the signature of tissue expression across diseases. The results offer new insight into the quality of public microarray data using pathophysiological metrics, and support new directions in meta-analysis that include characterization of the commonalities of disease irrespective of tissue, as well as the creation of multi-tissue systems models of disease pathology using public data.
Collapse
Affiliation(s)
- Joel T Dudley
- Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University School of Medicine, Stanford, CA, USA
| | | | | | | |
Collapse
|
72
|
Baughman JM, Nilsson R, Gohil VM, Arlow DH, Gauhar Z, Mootha VK. A computational screen for regulators of oxidative phosphorylation implicates SLIRP in mitochondrial RNA homeostasis. PLoS Genet 2009; 5:e1000590. [PMID: 19680543 PMCID: PMC2721412 DOI: 10.1371/journal.pgen.1000590] [Citation(s) in RCA: 118] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2009] [Accepted: 07/09/2009] [Indexed: 11/18/2022] Open
Abstract
The human oxidative phosphorylation (OxPhos) system consists of approximately 90 proteins encoded by nuclear and mitochondrial genomes and serves as the primary cellular pathway for ATP biosynthesis. While the core protein machinery for OxPhos is well characterized, many of its assembly, maturation, and regulatory factors remain unknown. We exploited the tight transcriptional control of the genes encoding the core OxPhos machinery to identify novel regulators. We developed a computational procedure, which we call expression screening, which integrates information from thousands of microarray data sets in a principled manner to identify genes that are consistently co-expressed with a target pathway across biological contexts. We applied expression screening to predict dozens of novel regulators of OxPhos. For two candidate genes, CHCHD2 and SLIRP, we show that silencing with RNAi results in destabilization of OxPhos complexes and a marked loss of OxPhos enzymatic activity. Moreover, we show that SLIRP plays an essential role in maintaining mitochondrial-localized mRNA transcripts that encode OxPhos protein subunits. Our findings provide a catalogue of potential novel OxPhos regulators that advance our understanding of the coordination between nuclear and mitochondrial genomes for the regulation of cellular energy metabolism. Respiratory chain disorders represent the largest class of inborn errors in metabolism affecting 1 in every 5,000 individuals. Biochemically, these disorders are characterized by a breakdown in the cellular process called oxidative phosphorylation (OxPhos), which is responsible for generating most of the cell's energy in the form of ATP. Sadly, for approximately 50% of patients diagnosed, we do not know the molecular cause behind these disorders. One possible reason for our limited diagnostic capability is that these patients harbor a mutation in a gene that is not known to act in the OxPhos pathway. We therefore designed a computational strategy called expression screening that integrates publicly available genome-wide gene expression data to predict new genes that may play a role in OxPhos biology. We identified several uncharacterized genes that were strongly predicted by our procedure to function in the OxPhos pathway and experimentally validated two genes, SLIRP and CHCHD2, as being essential for OxPhos function. These genes, as well as others predicted by expression screening to regulate OxPhos, represent a valuable resource for identifying the molecular underpinnings of respiratory chain disorders.
Collapse
Affiliation(s)
- Joshua M. Baughman
- Center for Human Genetic Research, Massachusetts General Hospital, Boston, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Department of Systems Biology, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Roland Nilsson
- Center for Human Genetic Research, Massachusetts General Hospital, Boston, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Department of Systems Biology, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Vishal M. Gohil
- Center for Human Genetic Research, Massachusetts General Hospital, Boston, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Department of Systems Biology, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Daniel H. Arlow
- Center for Human Genetic Research, Massachusetts General Hospital, Boston, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Department of Systems Biology, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Zareen Gauhar
- Center for Human Genetic Research, Massachusetts General Hospital, Boston, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Department of Systems Biology, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Vamsi K. Mootha
- Center for Human Genetic Research, Massachusetts General Hospital, Boston, Massachusetts, United States of America
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America
- Department of Systems Biology, Harvard Medical School, Boston, Massachusetts, United States of America
- * E-mail:
| |
Collapse
|
73
|
Huttenhower C, Hibbs MA, Myers CL, Caudy AA, Hess DC, Troyanskaya OG. The impact of incomplete knowledge on evaluation: an experimental benchmark for protein function prediction. ACTA ACUST UNITED AC 2009; 25:2404-10. [PMID: 19561015 DOI: 10.1093/bioinformatics/btp397] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Rapidly expanding repositories of highly informative genomic data have generated increasing interest in methods for protein function prediction and inference of biological networks. The successful application of supervised machine learning to these tasks requires a gold standard for protein function: a trusted set of correct examples, which can be used to assess performance through cross-validation or other statistical approaches. Since gene annotation is incomplete for even the best studied model organisms, the biological reliability of such evaluations may be called into question. RESULTS We address this concern by constructing and analyzing an experimentally based gold standard through comprehensive validation of protein function predictions for mitochondrion biogenesis in Saccharomyces cerevisiae. Specifically, we determine that (i) current machine learning approaches are able to generalize and predict novel biology from an incomplete gold standard and (ii) incomplete functional annotations adversely affect the evaluation of machine learning performance. While computational approaches performed better than predicted in the face of incomplete data, relative comparison of competing approaches-even those employing the same training data-is problematic with a sparse gold standard. Incomplete knowledge causes individual methods' performances to be differentially underestimated, resulting in misleading performance evaluations. We provide a benchmark gold standard for yeast mitochondria to complement current databases and an analysis of our experimental results in the hopes of mitigating these effects in future comparative evaluations. AVAILABILITY The mitochondrial benchmark gold standard, as well as experimental results and additional data, is available at http://function.princeton.edu/mitochondria.
Collapse
Affiliation(s)
- Curtis Huttenhower
- Department of Computer Science, Princeton University, Princeton, NJ 08540-5233, USA
| | | | | | | | | | | |
Collapse
|
74
|
Huttenhower C, Haley EM, Hibbs MA, Dumeaux V, Barrett DR, Coller HA, Troyanskaya OG. Exploring the human genome with functional maps. Genes Dev 2009; 19:1093-106. [PMID: 19246570 PMCID: PMC2694471 DOI: 10.1101/gr.082214.108] [Citation(s) in RCA: 147] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2008] [Accepted: 02/09/2009] [Indexed: 11/24/2022]
Abstract
Human genomic data of many types are readily available, but the complexity and scale of human molecular biology make it difficult to integrate this body of data, understand it from a systems level, and apply it to the study of specific pathways or genetic disorders. An investigator could best explore a particular protein, pathway, or disease if given a functional map summarizing the data and interactions most relevant to his or her area of interest. Using a regularized Bayesian integration system, we provide maps of functional activity and interaction networks in over 200 areas of human cellular biology, each including information from approximately 30,000 genome-scale experiments pertaining to approximately 25,000 human genes. Key to these analyses is the ability to efficiently summarize this large data collection from a variety of biologically informative perspectives: prediction of protein function and functional modules, cross-talk among biological processes, and association of novel genes and pathways with known genetic disorders. In addition to providing maps of each of these areas, we also identify biological processes active in each data set. Experimental investigation of five specific genes, AP3B1, ATP6AP1, BLOC1S1, LAMP2, and RAB11A, has confirmed novel roles for these proteins in the proper initiation of macroautophagy in amino acid-starved human fibroblasts. Our functional maps can be explored using HEFalMp (Human Experimental/Functional Mapper), a web interface allowing interactive visualization and investigation of this large body of information.
Collapse
Affiliation(s)
- Curtis Huttenhower
- Department of Computer Science, Princeton University, Princeton, New Jersey 08540, USA
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08544, USA
| | - Erin M. Haley
- Department of Molecular Biology, Princeton University, Princeton, New Jersey 08544, USA
| | | | - Vanessa Dumeaux
- Institute of Community Medicine, Tromsø University, Tromsø, Norway
| | - Daniel R. Barrett
- Department of Computer Science, Princeton University, Princeton, New Jersey 08540, USA
| | - Hilary A. Coller
- Department of Molecular Biology, Princeton University, Princeton, New Jersey 08544, USA
| | - Olga G. Troyanskaya
- Department of Computer Science, Princeton University, Princeton, New Jersey 08540, USA
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08544, USA
| |
Collapse
|
75
|
Barrett AB, Phan JH, Wang MD. Combining multiple microarray studies using bootstrap meta-analysis. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2009; 2008:5660-3. [PMID: 19164001 DOI: 10.1109/iembs.2008.4650498] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Microarray technology has enabled us to simultaneously measure the expression of thousands of genes. Using this high-throughput data collection, we can examine subtle genetic changes between biological samples and build predictive models for clinical applications. Although microarrays have dramatically increased the rate of data collection, sample size is still a major issue in feature selection. Previous methods show that microarray data combination is successful in improving selection when using z-scores and fold change. We propose a wrapper based gene selection technique that combines bootstrap estimated classification errors for individual genes across multiple datasets. The bootstrap is an unbiased estimator of classification error and has been shown to be effective for small sample data. Coupled with data combination across multiple data sets, we show that this meta-analytic approach improves gene selection.
Collapse
Affiliation(s)
- Andrea B Barrett
- Department of Biomedical Engineering at the Georgia Institute of Technology, Atlanta, 30318 USA.
| | | | | |
Collapse
|
76
|
Hess DC, Myers CL, Huttenhower C, Hibbs MA, Hayes AP, Paw J, Clore JJ, Mendoza RM, Luis BS, Nislow C, Giaever G, Costanzo M, Troyanskaya OG, Caudy AA. Computationally driven, quantitative experiments discover genes required for mitochondrial biogenesis. PLoS Genet 2009; 5:e1000407. [PMID: 19300474 PMCID: PMC2648979 DOI: 10.1371/journal.pgen.1000407] [Citation(s) in RCA: 116] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2008] [Accepted: 02/05/2009] [Indexed: 01/09/2023] Open
Abstract
Mitochondria are central to many cellular processes including respiration, ion homeostasis, and apoptosis. Using computational predictions combined with traditional quantitative experiments, we have identified 100 proteins whose deficiency alters mitochondrial biogenesis and inheritance in Saccharomyces cerevisiae. In addition, we used computational predictions to perform targeted double-mutant analysis detecting another nine genes with synthetic defects in mitochondrial biogenesis. This represents an increase of about 25% over previously known participants. Nearly half of these newly characterized proteins are conserved in mammals, including several orthologs known to be involved in human disease. Mutations in many of these genes demonstrate statistically significant mitochondrial transmission phenotypes more subtle than could be detected by traditional genetic screens or high-throughput techniques, and 47 have not been previously localized to mitochondria. We further characterized a subset of these genes using growth profiling and dual immunofluorescence, which identified genes specifically required for aerobic respiration and an uncharacterized cytoplasmic protein required for normal mitochondrial motility. Our results demonstrate that by leveraging computational analysis to direct quantitative experimental assays, we have characterized mutants with subtle mitochondrial defects whose phenotypes were undetected by high-throughput methods. Mitochondria are the proverbial powerhouses of the cell, running the fundamental biochemical processes that produce energy from nutrients using oxygen. These processes are conserved in all eukaryotes, from humans to model organisms such as baker's yeast. In humans, mitochondrial dysfunction plays a role in a variety of diseases, including diabetes, neuromuscular disorders, and aging. In order to better understand fundamental mitochondrial biology, we studied genes involved in mitochondrial biogenesis in the yeast S. cerevisiae, discovering over 100 proteins with novel roles in this process. These experiments assigned function to 5% of the genes whose function was not known. In order to achieve this rapid rate of discovery, we developed a system incorporating highly quantitative experimental assays and an integrated, iterative process of computational protein function prediction. Beginning from relatively little prior knowledge, we found that computational predictions achieved about 60% accuracy and rapidly guided our laboratory work towards hundreds of promising candidate genes. Thus, in addition to providing a more thorough understanding of mitochondrial biology, this study establishes a framework for successfully integrating computation and experimentation to drive biological discovery. A companion manuscript, published in PLoS Computational Biology (doi:10.1371/journal.pcbi.1000322), discusses observations and conclusions important for the computational community.
Collapse
Affiliation(s)
- David C. Hess
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - Chad L. Myers
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Curtis Huttenhower
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
| | - Matthew A. Hibbs
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
| | - Alicia P. Hayes
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - Jadine Paw
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - John J. Clore
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - Rosa M. Mendoza
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - Bryan San Luis
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Corey Nislow
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Guri Giaever
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Michael Costanzo
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Olga G. Troyanskaya
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- * E-mail: (OGT); (AAC)
| | - Amy A. Caudy
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- * E-mail: (OGT); (AAC)
| |
Collapse
|
77
|
Hibbs MA, Myers CL, Huttenhower C, Hess DC, Li K, Caudy AA, Troyanskaya OG. Directing experimental biology: a case study in mitochondrial biogenesis. PLoS Comput Biol 2009; 5:e1000322. [PMID: 19300515 PMCID: PMC2654405 DOI: 10.1371/journal.pcbi.1000322] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2008] [Accepted: 02/06/2009] [Indexed: 11/25/2022] Open
Abstract
Computational approaches have promised to organize collections of functional genomics data into testable predictions of gene and protein involvement in biological processes and pathways. However, few such predictions have been experimentally validated on a large scale, leaving many bioinformatic methods unproven and underutilized in the biology community. Further, it remains unclear what biological concerns should be taken into account when using computational methods to drive real-world experimental efforts. To investigate these concerns and to establish the utility of computational predictions of gene function, we experimentally tested hundreds of predictions generated from an ensemble of three complementary methods for the process of mitochondrial organization and biogenesis in Saccharomyces cerevisiae. The biological data with respect to the mitochondria are presented in a companion manuscript published in PLoS Genetics (doi:10.1371/journal.pgen.1000407). Here we analyze and explore the results of this study that are broadly applicable for computationalists applying gene function prediction techniques, including a new experimental comparison with 48 genes representing the genomic background. Our study leads to several conclusions that are important to consider when driving laboratory investigations using computational prediction approaches. While most genes in yeast are already known to participate in at least one biological process, we confirm that genes with known functions can still be strong candidates for annotation of additional gene functions. We find that different analysis techniques and different underlying data can both greatly affect the types of functional predictions produced by computational methods. This diversity allows an ensemble of techniques to substantially broaden the biological scope and breadth of predictions. We also find that performing prediction and validation steps iteratively allows us to more completely characterize a biological area of interest. While this study focused on a specific functional area in yeast, many of these observations may be useful in the contexts of other processes and organisms. Genome sequencing has provided us with “parts lists” of genes for many organisms, but many of the biological roles these genes are still unknown. While a great deal of functional genomic data exists, providing information about these genes and their roles, the rate at which these data are leveraged into concrete biological knowledge lags far behind the rate of data generation. Many computational approaches have been developed to generate accurate predictions of gene functions, with the goal of bridging this divide. However, as no large-scale experimental efforts have been based on such approaches, their validity and utility remains unproven. We have performed a study that experimentally evaluates predictions from a combination of three computational function prediction approaches, focusing on mitochondrion-related processes in brewer's yeast as a model system. By using computational predictions to guide our laboratory investigation, we have greatly accelerated the rate at which proteins can be assigned to biological processes. Further, our results demonstrate that in order to achieve the best results, it is important for computational biologists to consider both the underlying data and the algorithmic foundations of the methods used to predict function. Lastly, we demonstrate that iterating through phases of prediction and validation has quickly and extensively expanded our knowledge of mitochondrial biology.
Collapse
Affiliation(s)
- Matthew A. Hibbs
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Carl Icahn Laboratory, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
| | - Chad L. Myers
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Carl Icahn Laboratory, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, Minnesota, United States of America
| | - Curtis Huttenhower
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Carl Icahn Laboratory, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
| | - David C. Hess
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Carl Icahn Laboratory, Princeton, New Jersey, United States of America
| | - Kai Li
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
| | - Amy A. Caudy
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Carl Icahn Laboratory, Princeton, New Jersey, United States of America
| | - Olga G. Troyanskaya
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Carl Icahn Laboratory, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- * E-mail:
| |
Collapse
|
78
|
Airoldi EM, Huttenhower C, Gresham D, Lu C, Caudy AA, Dunham MJ, Broach JR, Botstein D, Troyanskaya OG. Predicting cellular growth from gene expression signatures. PLoS Comput Biol 2009; 5:e1000257. [PMID: 19119411 PMCID: PMC2599889 DOI: 10.1371/journal.pcbi.1000257] [Citation(s) in RCA: 77] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2008] [Accepted: 11/18/2008] [Indexed: 11/18/2022] Open
Abstract
Maintaining balanced growth in a changing environment is a fundamental systems-level challenge for cellular physiology, particularly in microorganisms. While the complete set of regulatory and functional pathways supporting growth and cellular proliferation are not yet known, portions of them are well understood. In particular, cellular proliferation is governed by mechanisms that are highly conserved from unicellular to multicellular organisms, and the disruption of these processes in metazoans is a major factor in the development of cancer. In this paper, we develop statistical methodology to identify quantitative aspects of the regulatory mechanisms underlying cellular proliferation in Saccharomyces cerevisiae. We find that the expression levels of a small set of genes can be exploited to predict the instantaneous growth rate of any cellular culture with high accuracy. The predictions obtained in this fashion are robust to changing biological conditions, experimental methods, and technological platforms. The proposed model is also effective in predicting growth rates for the related yeast Saccharomyces bayanus and the highly diverged yeast Schizosaccharomyces pombe, suggesting that the underlying regulatory signature is conserved across a wide range of unicellular evolution. We investigate the biological significance of the gene expression signature that the predictions are based upon from multiple perspectives: by perturbing the regulatory network through the Ras/PKA pathway, observing strong upregulation of growth rate even in the absence of appropriate nutrients, and discovering putative transcription factor binding sites, observing enrichment in growth-correlated genes. More broadly, the proposed methodology enables biological insights about growth at an instantaneous time scale, inaccessible by direct experimental methods. Data and tools enabling others to apply our methods are available at http://function.princeton.edu/growthrate.
Collapse
Affiliation(s)
- Edoardo M. Airoldi
- Lewis-Sigler Institute for Integrative Genomics, Carl Icahn Laboratory,
Princeton University, Princeton, New Jersey, United States of
America
- Department of Computer Science, Princeton University, Princeton, New
Jersey, United States of America
| | - Curtis Huttenhower
- Lewis-Sigler Institute for Integrative Genomics, Carl Icahn Laboratory,
Princeton University, Princeton, New Jersey, United States of
America
- Department of Computer Science, Princeton University, Princeton, New
Jersey, United States of America
| | - David Gresham
- Lewis-Sigler Institute for Integrative Genomics, Carl Icahn Laboratory,
Princeton University, Princeton, New Jersey, United States of
America
- Department of Molecular Biology, Princeton University, Princeton, New
Jersey, United States of America
| | - Charles Lu
- Lewis-Sigler Institute for Integrative Genomics, Carl Icahn Laboratory,
Princeton University, Princeton, New Jersey, United States of
America
- Department of Molecular Biology, Princeton University, Princeton, New
Jersey, United States of America
| | - Amy A. Caudy
- Lewis-Sigler Institute for Integrative Genomics, Carl Icahn Laboratory,
Princeton University, Princeton, New Jersey, United States of
America
| | - Maitreya J. Dunham
- Department of Genome Sciences, University of Washington, Seattle,
Washington, United States of America
| | - James R. Broach
- Department of Molecular Biology, Princeton University, Princeton, New
Jersey, United States of America
| | - David Botstein
- Lewis-Sigler Institute for Integrative Genomics, Carl Icahn Laboratory,
Princeton University, Princeton, New Jersey, United States of
America
- Department of Molecular Biology, Princeton University, Princeton, New
Jersey, United States of America
- * E-mail: (DB); (OGT)
| | - Olga G. Troyanskaya
- Lewis-Sigler Institute for Integrative Genomics, Carl Icahn Laboratory,
Princeton University, Princeton, New Jersey, United States of
America
- Department of Computer Science, Princeton University, Princeton, New
Jersey, United States of America
- * E-mail: (DB); (OGT)
| |
Collapse
|
79
|
Blom EJ, Breitling R, Hofstede KJ, Roerdink JBTM, van Hijum SAFT, Kuipers OP. Prosecutor: parameter-free inference of gene function for prokaryotes using DNA microarray data, genomic context and multiple gene annotation sources. BMC Genomics 2008; 9:495. [PMID: 18939968 PMCID: PMC2585105 DOI: 10.1186/1471-2164-9-495] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2008] [Accepted: 10/21/2008] [Indexed: 01/23/2023] Open
Abstract
Background Despite a plethora of functional genomic efforts, the function of many genes in sequenced genomes remains unknown. The increasing amount of microarray data for many species allows employing the guilt-by-association principle to predict function on a large scale: genes exhibiting similar expression patterns are more likely to participate in shared biological processes. Results We developed Prosecutor, an application that enables researchers to rapidly infer gene function based on available gene expression data and functional annotations. Our parameter-free functional prediction method uses a sensitive algorithm to achieve a high association rate of linking genes with unknown function to annotated genes. Furthermore, Prosecutor utilizes additional biological information such as genomic context and known regulatory mechanisms that are specific for prokaryotes. We analyzed publicly available transcriptome data sets and used literature sources to validate putative functions suggested by Prosecutor. We supply the complete results of our analysis for 11 prokaryotic organisms on a dedicated website. Conclusion The Prosecutor software and supplementary datasets available at allow researchers working on any of the analyzed organisms to quickly identify the putative functions of their genes of interest. A de novo analysis allows new organisms to be studied.
Collapse
Affiliation(s)
- Evert Jan Blom
- Molecular Genetics, Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, the Netherlands.
| | | | | | | | | | | |
Collapse
|
80
|
[Association between ion channel subtype and its gene co-expression]. YI CHUAN = HEREDITAS 2008; 30:1157-62. [PMID: 18779173 DOI: 10.3724/sp.j.1005.2008.01157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Association between ion channel functional subtype and its genes expression is important for exploring function of ion channel, annotating function of an unknown subtype and probing into molecular mechanism of ion channel diseases. In this study, we began with noise reduction by standardizing original micro-array data, which consisted of human and mouse gene expression profiles, and then we employed principle component analysis (PCA) together with fuzzy C-mean clustering algorithm to analyze the pre-processed gene expression profiles. PCA is applied to rebuild the feature space of human gene in 21 dimensions as well as the feature space of mouse gene in 26 dimensions. Using this method we largely reduced computational complexity without losing much information involved in the original data. Subsequently, fuzzy C-mean clustering was used to classify the ion channel genes of human and mouse in their reduced feature space. In the end, four ion channel functional subtypes, such as potassium ion channels, calcium ion channel, chloride ion channel, and receptor-mediated ion channel were clustered in both human and mouse gene feature space. We applied two statistic ways to conduct significance test of the findings. In one way, we randomly sampled the data for each functional subtype of the ion channel genes and recorded the true positive rate. As a result, in both human and mouse gene feature spaces, genes that belong to one functional subtype were more likely to be clustered together than expected by chance. In the other way, we performed Kappa test and used the functional subtypes as gold standard. The result showed that consistency between the ion channel gene clusters and the ion channel gene subtypes was significantly high for both human and mouse. These results indicate that ion channel genes within the same functional subtype tend to be co-expressed at least at the mRNA-level.
Collapse
|
81
|
Murali TM, Rivera CG. Network Legos: Building Blocks of Cellular Wiring Diagrams. J Comput Biol 2008; 15:829-44. [DOI: 10.1089/cmb.2007.0139] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Affiliation(s)
- T. M. Murali
- Department of Computer Science, Virginia Polytechnic Institute and State University, Blacksburg, VA
| | - Corban G. Rivera
- Department of Computer Science, Virginia Polytechnic Institute and State University, Blacksburg, VA
| |
Collapse
|
82
|
Abstract
MOTIVATION The availability of genome-scale data has enabled an abundance of novel analysis techniques for investigating a variety of systems-level biological relationships. As thousands of such datasets become available, they provide an opportunity to study high-level associations between cellular pathways and processes. This also allows the exploration of shared functional enrichments between diverse biological datasets, and it serves to direct experimenters to areas of low data coverage or with high probability of new discoveries. RESULTS We analyze the functional structure of Saccharomyces cerevisiae datasets from over 950 publications in the context of over 140 biological processes. This includes a coverage analysis of biological processes given current high-throughput data, a data-driven map of associations between processes, and a measure of similar functional activity between genome-scale datasets. This uncovers subtle gene expression similarities in three otherwise disparate microarray datasets due to a shared strain background. We also provide several means of predicting areas of yeast biology likely to benefit from additional high-throughput experimental screens. AVAILABILITY Predictions are provided in supplementary tables; software and additional data are available from the authors by request. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- C Huttenhower
- Department of Computer Science, Princeton University, 35 Olden Street, Princeton, NJ 08540, USA
| | | |
Collapse
|
83
|
Engelmann JC, Schwarz R, Blenk S, Friedrich T, Seibel PN, Dandekar T, Müller T. Unsupervised meta-analysis on diverse gene expression datasets allows insight into gene function and regulation. Bioinform Biol Insights 2008; 2:265-80. [PMID: 19812781 PMCID: PMC2735942 DOI: 10.4137/bbi.s665] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Over the past years, microarray databases have increased rapidly in size. While they offer a wealth of data, it remains challenging to integrate data arising from different studies. Here we propose an unsupervised approach of a large-scale meta-analysis on Arabidopsis thaliana whole genome expression datasets to gain additional insights into the function and regulation of genes. Applying kernel principal component analysis and hierarchical clustering, we found three major groups of experimental contrasts sharing a common biological trait. Genes associated to two of these clusters are known to play an important role in indole-3-acetic acid (IAA) mediated plant growth and development or pathogen defense. Novel functions could be assigned to genes including a cluster of serine/threonine kinases that carry two uncharacterized domains (DUF26) in their receptor part implicated in host defense. With the approach shown here, hidden interrelations between genes regulated under different conditions can be unraveled.
Collapse
Affiliation(s)
- Julia C Engelmann
- Department of Bioinformatics, Biocenter, University of Würzburg, Am Hubland, D-97074 Würzburg, Germany
| | | | | | | | | | | | | |
Collapse
|
84
|
Huttenhower C, Schroeder M, Chikina MD, Troyanskaya OG. The Sleipnir library for computational functional genomics. Bioinformatics 2008; 24:1559-61. [PMID: 18499696 PMCID: PMC2718674 DOI: 10.1093/bioinformatics/btn237] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Motivation: Biological data generation has accelerated to the point where hundreds or thousands of whole-genome datasets of various types are available for many model organisms. This wealth of data can lead to valuable biological insights when analyzed in an integrated manner, but the computational challenge of managing such large data collections is substantial. In order to mine these data efficiently, it is necessary to develop methods that use storage, memory and processing resources carefully. Results: The Sleipnir C++ library implements a variety of machine learning and data manipulation algorithms with a focus on heterogeneous data integration and efficiency for very large biological data collections. Sleipnir allows microarray processing, functional ontology mining, clustering, Bayesian learning and inference and support vector machine tasks to be performed for heterogeneous data on scales not previously practical. In addition to the library, which can easily be integrated into new computational systems, prebuilt tools are provided to perform a variety of common tasks. Many tools are multithreaded for parallelization in desktop or high-throughput computing environments, and most tasks can be performed in minutes for hundreds of datasets using a standard personal computer. Availability: Source code (C++) and documentation are available at http://function.princeton.edu/sleipnir and compiled binaries are available from the authors on request. Contact:ogt@princeton.edu
Collapse
Affiliation(s)
- Curtis Huttenhower
- Lewis-Sigler Institute for Integrative Genomics, Carl Icahn Laboratory, Princeton University, Princeton, NJ 08540, USA
| | | | | | | |
Collapse
|
85
|
Aguilar D, Skrabanek L, Gross SS, Oliva B, Campagne F. Beyond tissueInfo: functional prediction using tissue expression profile similarity searches. Nucleic Acids Res 2008; 36:3728-37. [PMID: 18483083 PMCID: PMC2441795 DOI: 10.1093/nar/gkn233] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
We present and validate tissue expression profile similarity searches (TEPSS), a computational approach to identify transcripts that share similar tissue expression profiles to one or more transcripts in a group of interest. We evaluated TEPSS for its ability to discriminate between pairs of transcripts coding for interacting proteins and non-interacting pairs. We found that ordering protein-protein pairs by TEPSS score produces sets significantly enriched in reported pairs of interacting proteins [interacting versus non-interacting pairs, Odds-ratio (OR) = 157.57, 95% confidence interval (CI) (36.81-375.51) at 1% coverage, employing a large dataset of about 50 000 human protein interactions]. When used with multiple transcripts as input, we find that TEPSS can predict non-obvious members of the cytosolic ribosome. We used TEPSS to predict S-nitrosylation (SNO) protein targets from a set of brain proteins that undergo SNO upon exposure to physiological levels of S-nitrosoglutathione in vitro. While some of the top TEPSS predictions have been validated independently, several of the strongest SNO TEPSS predictions await experimental validation. Our data indicate that TEPSS is an effective and flexible approach to functional prediction. Since the approach does not use sequence similarity, we expect that TEPSS will be useful for various gene discovery applications. TEPSS programs and data are distributed at http://icb.med.cornell.edu/crt/tepss/index.xml.
Collapse
Affiliation(s)
- Daniel Aguilar
- HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Medical College of Cornell University, 1305 York Ave, New York, NY 10021, USA
| | | | | | | | | |
Collapse
|
86
|
Linghu B, Snitkin ES, Holloway DT, Gustafson AM, Xia Y, DeLisi C. High-precision high-coverage functional inference from integrated data sources. BMC Bioinformatics 2008; 9:119. [PMID: 18298847 PMCID: PMC2292694 DOI: 10.1186/1471-2105-9-119] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2007] [Accepted: 02/25/2008] [Indexed: 11/15/2022] Open
Abstract
Background Information obtained from diverse data sources can be combined in a principled manner using various machine learning methods to increase the reliability and range of knowledge about protein function. The result is a weighted functional linkage network (FLN) in which linked neighbors share at least one function with high probability. Precision is, however, low. Aiming to provide precise functional annotation for as many proteins as possible, we explore and propose a two-step framework for functional annotation (1) construction of a high-coverage and reliable FLN via machine learning techniques (2) development of a decision rule for the constructed FLN to optimize functional annotation. Results We first apply this framework to Saccharomyces cerevisiae. In the first step, we demonstrate that four commonly used machine learning methods, Linear SVM, Linear Discriminant Analysis, Naïve Bayes, and Neural Network, all combine heterogeneous data to produce reliable and high-coverage FLNs, in which the linkage weight more accurately estimates functional coupling of linked proteins than use individual data sources alone. In the second step, empirical tuning of an adjustable decision rule on the constructed FLN reveals that basing annotation on maximum edge weight results in the most precise annotation at high coverages. In particular at low coverage all rules evaluated perform comparably. At coverage above approximately 50%, however, they diverge rapidly. At full coverage, the maximum weight decision rule still has a precision of approximately 70%, whereas for other methods, precision ranges from a high of slightly more than 30%, down to 3%. In addition, a scoring scheme to estimate the precisions of individual predictions is also provided. Finally, tests of the robustness of the framework indicate that our framework can be successfully applied to less studied organisms. Conclusion We provide a general two-step function-annotation framework, and show that high coverage, high precision annotations can be achieved by constructing a high-coverage and reliable FLN via data integration followed by applying a maximum weight decision rule.
Collapse
Affiliation(s)
- Bolan Linghu
- Bioinformatics Graduate Program, Boston University, Boston, MA, 02215, USA.
| | | | | | | | | | | |
Collapse
|
87
|
Lee I, Li Z, Marcotte EM. An improved, bias-reduced probabilistic functional gene network of baker's yeast, Saccharomyces cerevisiae. PLoS One 2007; 2:e988. [PMID: 17912365 PMCID: PMC1991590 DOI: 10.1371/journal.pone.0000988] [Citation(s) in RCA: 162] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2007] [Accepted: 09/10/2007] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Probabilistic functional gene networks are powerful theoretical frameworks for integrating heterogeneous functional genomics and proteomics data into objective models of cellular systems. Such networks provide syntheses of millions of discrete experimental observations, spanning DNA microarray experiments, physical protein interactions, genetic interactions, and comparative genomics; the resulting networks can then be easily applied to generate testable hypotheses regarding specific gene functions and associations. METHODOLOGY/PRINCIPAL FINDINGS We report a significantly improved version (v. 2) of a probabilistic functional gene network of the baker's yeast, Saccharomyces cerevisiae. We describe our optimization methods and illustrate their effects in three major areas: the reduction of functional bias in network training reference sets, the application of a probabilistic model for calculating confidences in pair-wise protein physical or genetic interactions, and the introduction of simple thresholds that eliminate many false positive mRNA co-expression relationships. Using the network, we predict and experimentally verify the function of the yeast RNA binding protein Puf6 in 60S ribosomal subunit biogenesis. CONCLUSIONS/SIGNIFICANCE YeastNet v. 2, constructed using these optimizations together with additional data, shows significant reduction in bias and improvements in precision and recall, in total covering 102,803 linkages among 5,483 yeast proteins (95% of the validated proteome). YeastNet is available from http://www.yeastnet.org.
Collapse
Affiliation(s)
- Insuk Lee
- Center for Systems and Synthetic Biology, Institute for Cellular and Molecular Biology, University of Texas at Austin, Austin, Texas, United States of America
| | - Zhihua Li
- Center for Systems and Synthetic Biology, Institute for Cellular and Molecular Biology, University of Texas at Austin, Austin, Texas, United States of America
| | - Edward M. Marcotte
- Center for Systems and Synthetic Biology, Institute for Cellular and Molecular Biology, University of Texas at Austin, Austin, Texas, United States of America
- Department of Chemistry and Biochemistry, Institute for Cellular and Molecular Biology, University of Texas at Austin, Austin, Texas, United States of America
- * To whom correspondence should be addressed. E-mail:
| |
Collapse
|
88
|
Shi Y, Klustein M, Simon I, Mitchell T, Bar-Joseph Z. Continuous hidden process model for time series expression experiments. ACTA ACUST UNITED AC 2007; 23:i459-67. [PMID: 17646331 DOI: 10.1093/bioinformatics/btm218] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION When analyzing expression experiments, researchers are often interested in identifying the set of biological processes that are up- or down-regulated under the experimental condition studied. Current approaches, including clustering expression profiles and averaging the expression profiles of genes known to participate in specific processes, fail to provide an accurate estimate of the activity levels of many biological processes. RESULTS We introduce a probabilistic continuous hidden process Model (CHPM) for time series expression data. CHPM can simultaneously determine the most probable assignment of genes to processes and the level of activation of these processes over time. To estimate model parameters, CHPM uses multiple time series datasets and incorporates prior biological knowledge. Applying CHPM to yeast expression data, we show that our algorithm produces more accurate functional assignments for genes compared to other expression analysis methods. The inferred process activity levels can be used to study the relationships between biological processes. We also report new biological experiments confirming some of the process activity levels predicted by CHPM. AVAILABILITY A Java implementation is available at http:\\www.cs.cmu.edu\~yanxins\chpm. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yanxin Shi
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | | | | | | | | |
Collapse
|
89
|
Nearest Neighbor Networks: clustering expression data based on gene neighborhoods. BMC Bioinformatics 2007; 8:250. [PMID: 17626636 PMCID: PMC1941745 DOI: 10.1186/1471-2105-8-250] [Citation(s) in RCA: 50] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2006] [Accepted: 07/12/2007] [Indexed: 11/23/2022] Open
Abstract
Background The availability of microarrays measuring thousands of genes simultaneously across hundreds of biological conditions represents an opportunity to understand both individual biological pathways and the integrated workings of the cell. However, translating this amount of data into biological insight remains a daunting task. An important initial step in the analysis of microarray data is clustering of genes with similar behavior. A number of classical techniques are commonly used to perform this task, particularly hierarchical and K-means clustering, and many novel approaches have been suggested recently. While these approaches are useful, they are not without drawbacks; these methods can find clusters in purely random data, and even clusters enriched for biological functions can be skewed towards a small number of processes (e.g. ribosomes). Results We developed Nearest Neighbor Networks (NNN), a graph-based algorithm to generate clusters of genes with similar expression profiles. This method produces clusters based on overlapping cliques within an interaction network generated from mutual nearest neighborhoods. This focus on nearest neighbors rather than on absolute distance measures allows us to capture clusters with high connectivity even when they are spatially separated, and requiring mutual nearest neighbors allows genes with no sufficiently similar partners to remain unclustered. We compared the clusters generated by NNN with those generated by eight other clustering methods. NNN was particularly successful at generating functionally coherent clusters with high precision, and these clusters generally represented a much broader selection of biological processes than those recovered by other methods. Conclusion The Nearest Neighbor Networks algorithm is a valuable clustering method that effectively groups genes that are likely to be functionally related. It is particularly attractive due to its simplicity, its success in the analysis of large datasets, and its ability to span a wide range of biological functions with high precision.
Collapse
|
90
|
Sykacek P, Clarkson R, Print C, Furlong R, Micklem G. Bayesian modelling of shared gene function. Bioinformatics 2007; 23:1936-44. [PMID: 17540682 DOI: 10.1093/bioinformatics/btm280] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Biological assays are often carried out on tissues that contain many cell lineages and active pathways. Microarray data produced using such material therefore reflect superimpositions of biological processes. Analysing such data for shared gene function by means of well-matched assays may help to provide a better focus on specific cell types and processes. The identification of genes that behave similarly in different biological systems also has the potential to reveal new insights into preserved biological mechanisms. RESULTS In this article, we propose a hierarchical Bayesian model allowing integrated analysis of several microarray data sets for shared gene function. Each gene is associated with an indicator variable that selects whether binary class labels are predicted from expression values or by a classifier which is common to all genes. Each indicator selects the component models for all involved data sets simultaneously. A quantitative measure of shared gene function is obtained by inferring a probability measure over these indicators. Through experiments on synthetic data, we illustrate potential advantages of this Bayesian approach over a standard method. A shared analysis of matched microarray experiments covering (a) a cycle of mouse mammary gland development and (b) the process of in vitro endothelial cell apoptosis is proposed as a biological gold standard. Several useful sanity checks are introduced during data analysis, and we confirm the prior biological belief that shared apoptosis events occur in both systems. We conclude that a Bayesian analysis for shared gene function has the potential to reveal new biological insights, unobtainable by other means. AVAILABILITY An online supplement and MatLab code are available at http://www.sykacek.net/research.html#mcabf
Collapse
Affiliation(s)
- P Sykacek
- Department of Biotechnology, BOKU University, Vienna, Austria.
| | | | | | | | | |
Collapse
|
91
|
Shoemaker BA, Panchenko AR. Deciphering protein-protein interactions. Part II. Computational methods to predict protein and domain interaction partners. PLoS Comput Biol 2007; 3:e43. [PMID: 17465672 PMCID: PMC1857810 DOI: 10.1371/journal.pcbi.0030043] [Citation(s) in RCA: 222] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Recent advances in high-throughput experimental methods for the identification of protein interactions have resulted in a large amount of diverse data that are somewhat incomplete and contradictory. As valuable as they are, such experimental approaches studying protein interactomes have certain limitations that can be complemented by the computational methods for predicting protein interactions. In this review we describe different approaches to predict protein interaction partners as well as highlight recent achievements in the prediction of specific domains mediating protein-protein interactions. We discuss the applicability of computational methods to different types of prediction problems and point out limitations common to all of them.
Collapse
|
92
|
Markowetz F, Troyanskaya OG. Computational identification of cellular networks and pathways. MOLECULAR BIOSYSTEMS 2007; 3:478-82. [PMID: 17579773 DOI: 10.1039/b617014p] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
In this article we highlight recent developments in computational functional genomics to identify networks of functionally related genes and proteins based on diverse sources of genomic data. Our specific focus is on statistical methods to identify genetic networks. We discuss integrated analysis of microarray datasets, methods to combine heterogeneous data sources, the analysis of high-dimensional phenotyping screens and describe efforts to establish a reliable and unbiased gold standard for method comparison and evaluation.
Collapse
Affiliation(s)
- Florian Markowetz
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | | |
Collapse
|