1
|
Yuan H, Mancuso CA, Johnson K, Braasch I, Krishnan A. Computational strategies for cross-species knowledge transfer and translational biomedicine. ARXIV 2024:arXiv:2408.08503v1. [PMID: 39184546 PMCID: PMC11343225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 08/27/2024]
Abstract
Research organisms provide invaluable insights into human biology and diseases, serving as essential tools for functional experiments, disease modeling, and drug testing. However, evolutionary divergence between humans and research organisms hinders effective knowledge transfer across species. Here, we review state-of-the-art methods for computationally transferring knowledge across species, primarily focusing on methods that utilize transcriptome data and/or molecular networks. We introduce the term "agnology" to describe the functional equivalence of molecular components regardless of evolutionary origin, as this concept is becoming pervasive in integrative data-driven models where the role of evolutionary origin can become unclear. Our review addresses four key areas of information and knowledge transfer across species: (1) transferring disease and gene annotation knowledge, (2) identifying agnologous molecular components, (3) inferring equivalent perturbed genes or gene sets, and (4) identifying agnologous cell types. We conclude with an outlook on future directions and several key challenges that remain in cross-species knowledge transfer.
Collapse
Affiliation(s)
- Hao Yuan
- Genetics and Genome Science Program; Ecology, Evolution, and Behavior Program, Michigan State University
| | - Christopher A. Mancuso
- Department of Biostatistics & Informatics, University of Colorado Anschutz Medical Campus
| | - Kayla Johnson
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus
| | - Ingo Braasch
- Department of Integrative Biology; Genetics and Genome Science Program; Ecology, Evolution, and Behavior Program, Michigan State University
| | - Arjun Krishnan
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus
| |
Collapse
|
2
|
Mancuso CA, Johnson KA, Liu R, Krishnan A. Joint representation of molecular networks from multiple species improves gene classification. PLoS Comput Biol 2024; 20:e1011773. [PMID: 38198480 PMCID: PMC10805316 DOI: 10.1371/journal.pcbi.1011773] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2023] [Revised: 01/23/2024] [Accepted: 12/20/2023] [Indexed: 01/12/2024] Open
Abstract
Network-based machine learning (ML) has the potential for predicting novel genes associated with nearly any health and disease context. However, this approach often uses network information from only the single species under consideration even though networks for most species are noisy and incomplete. While some recent methods have begun addressing this shortcoming by using networks from more than one species, they lack one or more key desirable properties: handling networks from more than two species simultaneously, incorporating many-to-many orthology information, or generating a network representation that is reusable across different types of and newly-defined prediction tasks. Here, we present GenePlexusZoo, a framework that casts molecular networks from multiple species into a single reusable feature space for network-based ML. We demonstrate that this multi-species network representation improves both gene classification within a single species and knowledge-transfer across species, even in cases where the inter-species correspondence is undetectable based on shared orthologous genes. Thus, GenePlexusZoo enables effectively leveraging the high evolutionary molecular, functional, and phenotypic conservation across species to discover novel genes associated with diverse biological contexts.
Collapse
Affiliation(s)
- Christopher A. Mancuso
- Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, Colorado, United States of America
| | - Kayla A. Johnson
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, Colorado, United States of America
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan, United States of America
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America
| | - Renming Liu
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America
| | - Arjun Krishnan
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, Colorado, United States of America
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America
| |
Collapse
|
3
|
Li L, Dannenfelser R, Zhu Y, Hejduk N, Segarra S, Yao V. Joint embedding of biological networks for cross-species functional alignment. Bioinformatics 2023; 39:btad529. [PMID: 37632792 PMCID: PMC10477935 DOI: 10.1093/bioinformatics/btad529] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Revised: 07/12/2023] [Accepted: 08/24/2023] [Indexed: 08/28/2023] Open
Abstract
MOTIVATION Model organisms are widely used to better understand the molecular causes of human disease. While sequence similarity greatly aids this cross-species transfer, sequence similarity does not imply functional similarity, and thus, several current approaches incorporate protein-protein interactions to help map findings between species. Existing transfer methods either formulate the alignment problem as a matching problem which pits network features against known orthology, or more recently, as a joint embedding problem. RESULTS We propose a novel state-of-the-art joint embedding solution: Embeddings to Network Alignment (ETNA). ETNA generates individual network embeddings based on network topological structure and then uses a Natural Language Processing-inspired cross-training approach to align the two embeddings using sequence-based orthologs. The final embedding preserves both within and between species gene functional relationships, and we demonstrate that it captures both pairwise and group functional relevance. In addition, ETNA's embeddings can be used to transfer genetic interactions across species and identify phenotypic alignments, laying the groundwork for potential opportunities for drug repurposing and translational studies. AVAILABILITY AND IMPLEMENTATION https://github.com/ylaboratory/ETNA.
Collapse
Affiliation(s)
- Lechuan Li
- Department of Computer Science, Rice University, Houston, TX 77005, United States
| | - Ruth Dannenfelser
- Department of Computer Science, Rice University, Houston, TX 77005, United States
| | - Yu Zhu
- Department of Electrical and Computer Engineering, Rice University, Houston, TX 77005, United States
| | - Nathaniel Hejduk
- Department of Computer Science, Rice University, Houston, TX 77005, United States
| | - Santiago Segarra
- Department of Electrical and Computer Engineering, Rice University, Houston, TX 77005, United States
| | - Vicky Yao
- Department of Computer Science, Rice University, Houston, TX 77005, United States
| |
Collapse
|
4
|
Ding K, Wang S, Luo Y. Supervised biological network alignment with graph neural networks. Bioinformatics 2023; 39:i465-i474. [PMID: 37387160 PMCID: PMC10311300 DOI: 10.1093/bioinformatics/btad241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Despite the advances in sequencing technology, massive proteins with known sequences remain functionally unannotated. Biological network alignment (NA), which aims to find the node correspondence between species' protein-protein interaction (PPI) networks, has been a popular strategy to uncover missing annotations by transferring functional knowledge across species. Traditional NA methods assumed that topologically similar proteins in PPIs are functionally similar. However, it was recently reported that functionally unrelated proteins can be as topologically similar as functionally related pairs, and a new data-driven or supervised NA paradigm has been proposed, which uses protein function data to discern which topological features correspond to functional relatedness. RESULTS Here, we propose GraNA, a deep learning framework for the supervised NA paradigm for the pairwise NA problem. Employing graph neural networks, GraNA utilizes within-network interactions and across-network anchor links for learning protein representations and predicting functional correspondence between across-species proteins. A major strength of GraNA is its flexibility to integrate multi-faceted non-functional relationship data, such as sequence similarity and ortholog relationships, as anchor links to guide the mapping of functionally related proteins across species. Evaluating GraNA on a benchmark dataset composed of several NA tasks between different pairs of species, we observed that GraNA accurately predicted the functional relatedness of proteins and robustly transferred functional annotations across species, outperforming a number of existing NA methods. When applied to a case study on a humanized yeast network, GraNA also successfully discovered functionally replaceable human-yeast protein pairs that were documented in previous studies. AVAILABILITY AND IMPLEMENTATION The code of GraNA is available at https://github.com/luo-group/GraNA.
Collapse
Affiliation(s)
- Kerr Ding
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, United States
| | - Sheng Wang
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA 98195, United States
| | - Yunan Luo
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA 30332, United States
| |
Collapse
|
5
|
Lachmann A, Rizzo KA, Bartal A, Jeon M, Clarke DJB, Ma’ayan A. PrismEXP: gene annotation prediction from stratified gene-gene co-expression matrices. PeerJ 2023; 11:e14927. [PMID: 36874981 PMCID: PMC9979837 DOI: 10.7717/peerj.14927] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2022] [Accepted: 01/30/2023] [Indexed: 03/03/2023] Open
Abstract
Background Gene-gene co-expression correlations measured by mRNA-sequencing (RNA-seq) can be used to predict gene annotations based on the co-variance structure within these data. In our prior work, we showed that uniformly aligned RNA-seq co-expression data from thousands of diverse studies is highly predictive of both gene annotations and protein-protein interactions. However, the performance of the predictions varies depending on whether the gene annotations and interactions are cell type and tissue specific or agnostic. Tissue and cell type-specific gene-gene co-expression data can be useful for making more accurate predictions because many genes perform their functions in unique ways in different cellular contexts. However, identifying the optimal tissues and cell types to partition the global gene-gene co-expression matrix is challenging. Results Here we introduce and validate an approach called PRediction of gene Insights from Stratified Mammalian gene co-EXPression (PrismEXP) for improved gene annotation predictions based on RNA-seq gene-gene co-expression data. Using uniformly aligned data from ARCHS4, we apply PrismEXP to predict a wide variety of gene annotations including pathway membership, Gene Ontology terms, as well as human and mouse phenotypes. Predictions made with PrismEXP outperform predictions made with the global cross-tissue co-expression correlation matrix approach on all tested domains, and training using one annotation domain can be used to predict annotations in other domains. Conclusions By demonstrating the utility of PrismEXP predictions in multiple use cases we show how PrismEXP can be used to enhance unsupervised machine learning methods to better understand the roles of understudied genes and proteins. To make PrismEXP accessible, it is provided via a user-friendly web interface, a Python package, and an Appyter. AVAILABILITY. The PrismEXP web-based application, with pre-computed PrismEXP predictions, is available from: https://maayanlab.cloud/prismexp; PrismEXP is also available as an Appyter: https://appyters.maayanlab.cloud/PrismEXP/; and as Python package: https://github.com/maayanlab/prismexp.
Collapse
Affiliation(s)
- Alexander Lachmann
- Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, New York, USA
| | - Kaeli A. Rizzo
- Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, New York, USA
| | - Alon Bartal
- Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, New York, USA
| | - Minji Jeon
- Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, New York, USA
| | - Daniel J. B. Clarke
- Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, New York, USA
| | - Avi Ma’ayan
- Pharmacological Sciences, Icahn School of Medicine at Mount Sinai, New York, USA
| |
Collapse
|
6
|
Mancuso CA, Bills PS, Krum D, Newsted J, Liu R, Krishnan A. GenePlexus: a web-server for gene discovery using network-based machine learning. Nucleic Acids Res 2022; 50:W358-W366. [PMID: 35580053 PMCID: PMC9252732 DOI: 10.1093/nar/gkac335] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Revised: 04/13/2022] [Accepted: 04/30/2022] [Indexed: 11/28/2022] Open
Abstract
Biomedical researchers take advantage of high-throughput, high-coverage technologies to routinely generate sets of genes of interest across a wide range of biological conditions. Although these technologies have directly shed light on the molecular underpinnings of various biological processes and diseases, the list of genes from any individual experiment is often noisy and incomplete. Additionally, interpreting these lists of genes can be challenging in terms of how they are related to each other and to other genes in the genome. In this work, we present GenePlexus (https://www.geneplexus.net/), a web-server that allows a researcher to utilize a powerful, network-based machine learning method to gain insights into their gene set of interest and additional functionally similar genes. Once a user uploads their own set of human genes and chooses between a number of different human network representations, GenePlexus provides predictions of how associated every gene in the network is to the input set. The web-server also provides interpretability through network visualization and comparison to other machine learning models trained on thousands of known process/pathway and disease gene sets. GenePlexus is free and open to all users without the need for registration.
Collapse
Affiliation(s)
- Christopher A Mancuso
- Department Of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| | - Patrick S Bills
- Data Management and Analytics, IT Services, Michigan State University, East Lansing, MI 48824, USA
| | - Douglas Krum
- Data Management and Analytics, IT Services, Michigan State University, East Lansing, MI 48824, USA
| | - Jacob Newsted
- Data Management and Analytics, IT Services, Michigan State University, East Lansing, MI 48824, USA
| | - Renming Liu
- Department Of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA
| | - Arjun Krishnan
- Department Of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.,Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
7
|
Joblin-Mills A, Wu Z, Fraser K, Jones B, Yip W, Lim JJ, Lu L, Sequeira I, Poppitt S. The impact of ethnicity and intra-pancreatic fat on the postprandial metabolome response to whey protein in overweight Asian Chinese and European Caucasian women with prediabetes. FRONTIERS IN CLINICAL DIABETES AND HEALTHCARE 2022; 3:980856. [PMID: 36992769 PMCID: PMC10012149 DOI: 10.3389/fcdhc.2022.980856] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Accepted: 07/27/2022] [Indexed: 03/31/2023]
Abstract
The "Thin on the Outside Fat on the Inside" TOFI_Asia study found Asian Chinese to be more susceptible to Type 2 Diabetes (T2D) compared to European Caucasians matched for gender and body mass index (BMI). This was influenced by degree of visceral adipose deposition and ectopic fat accumulation in key organs, including liver and pancreas, leading to altered fasting plasma glucose, insulin resistance, and differences in plasma lipid and metabolite profiles. It remains unclear how intra-pancreatic fat deposition (IPFD) impacts TOFI phenotype-related T2D risk factors associated with Asian Chinese. Cow's milk whey protein isolate (WPI) is an insulin secretagogue which can suppress hyperglycemia in prediabetes. In this dietary intervention, we used untargeted metabolomics to characterize the postprandial WPI response in 24 overweight women with prediabetes. Participants were classified by ethnicity (Asian Chinese, n=12; European Caucasian, n=12) and IPFD (low IPFD < 4.66%, n=10; high IPFD ≥ 4.66%, n=10). Using a cross-over design participants were randomized to consume three WPI beverages on separate occasions; 0 g (water control), 12.5 g (low protein, LP) and 50 g (high protein, HP), consumed when fasted. An exclusion pipeline for isolating metabolites with temporal (T0-240mins) WPI responses was implemented, and a support vector machine-recursive feature elimination (SVM-RFE) algorithm was used to model relevant metabolites by ethnicity and IPFD classes. Metabolic network analysis identified glycine as a central hub in both ethnicity and IPFD WPI response networks. A depletion of glycine relative to WPI concentration was detected in Chinese and high IPFD participants independent of BMI. Urea cycle metabolites were highly represented among the ethnicity WPI metabolome model, implicating a dysregulation in ammonia and nitrogen metabolism among Chinese participants. Uric acid and purine synthesis pathways were enriched within the high IPFD cohort's WPI metabolome response, implicating adipogenesis and insulin resistance pathways. In conclusion, the discrimination of ethnicity from WPI metabolome profiles was a stronger prediction model than IPFD in overweight women with prediabetes. Each models' discriminatory metabolites enriched different metabolic pathways that help to further characterize prediabetes in Asian Chinese women and women with increased IPFD, independently.
Collapse
Affiliation(s)
- Aidan Joblin-Mills
- Food Chemistry and Structure Team, Agresearch, Palmerston North, New Zealand
- High-Value Nutrition, National Science Challenge, Auckland, New Zealand
- *Correspondence: Aidan Joblin-Mills,
| | - Zhanxuan Wu
- Food Chemistry and Structure Team, Agresearch, Palmerston North, New Zealand
- High-Value Nutrition, National Science Challenge, Auckland, New Zealand
- School of Food and Nutrition, Massey University, Palmerston North, New Zealand
| | - Karl Fraser
- Food Chemistry and Structure Team, Agresearch, Palmerston North, New Zealand
- High-Value Nutrition, National Science Challenge, Auckland, New Zealand
| | - Beatrix Jones
- High-Value Nutrition, National Science Challenge, Auckland, New Zealand
- Department of Statistics, University of Auckland, Auckland, New Zealand
| | - Wilson Yip
- High-Value Nutrition, National Science Challenge, Auckland, New Zealand
- Human Nutrition Unit, School of Biological Sciences, University of Auckland, Auckland, New Zealand
| | - Jia Jiet Lim
- High-Value Nutrition, National Science Challenge, Auckland, New Zealand
- Human Nutrition Unit, School of Biological Sciences, University of Auckland, Auckland, New Zealand
| | - Louise Lu
- High-Value Nutrition, National Science Challenge, Auckland, New Zealand
- Human Nutrition Unit, School of Biological Sciences, University of Auckland, Auckland, New Zealand
| | - Ivana Sequeira
- High-Value Nutrition, National Science Challenge, Auckland, New Zealand
- Human Nutrition Unit, School of Biological Sciences, University of Auckland, Auckland, New Zealand
| | - Sally Poppitt
- High-Value Nutrition, National Science Challenge, Auckland, New Zealand
- Human Nutrition Unit, School of Biological Sciences, University of Auckland, Auckland, New Zealand
| |
Collapse
|
8
|
Yu G, Zhou G, Zhang X, Domeniconi C, Guo M. DMIL-IsoFun: predicting isoform function using deep multi-instance learning. Bioinformatics 2021; 37:4818-4825. [PMID: 34282449 DOI: 10.1093/bioinformatics/btab532] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2021] [Revised: 06/20/2021] [Accepted: 07/16/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Alternative splicing creates the considerable proteomic diversity and complexity on relatively limited genome. Proteoforms translated from alternatively spliced isoforms of a gene actually execute the biological functions of this gene, which reflect the functional knowledge of genes at a finer granular level. Recently, some computational approaches have been proposed to differentiate isoform functions using sequence and expression data. However, their performance is far from being desirable, mainly due to the imbalance and lack of annotations at isoform-level, and the difficulty of modeling gene-isoform relations. RESULT We propose a deep multi-instance learning based framework (DMIL-IsoFun) to differentiate the functions of isoforms. DMIL-IsoFun firstly introduces a multi-instance learning convolution neural network trained with isoform sequences and gene-level annotations to extract the feature vectors and initialize the annotations of isoforms, and then uses a class-imbalance Graph Convolution Network to refine the annotations of individual isoforms based on the isoform co-expression network and extracted features. Extensive experimental results show that DMIL-IsoFun improves the Smin and Fmax of state-of-the-art solutions by at least 29.6% and 40.8%. The effectiveness of DMIL-IsoFun is further confirmed on a testbed of human multiple-isoform genes, and Maize isoforms related with photosynthesis. AVAILABILITY The code and data are available at http://www.sdu-idea.cn/codes.php?name=DMIL-Isofun. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Guoxian Yu
- School of Software, Shandong University, Jinan, 250101, China.,College of Computer and Information Sciences, Southwest University, Chongqing, 400715, China.,Computer, Electrical, and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, SA
| | - Guangjie Zhou
- School of Software, Shandong University, Jinan, 250101, China.,College of Computer and Information Sciences, Southwest University, Chongqing, 400715, China
| | - Xiangliang Zhang
- Computer, Electrical, and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology, SA
| | - Carlotta Domeniconi
- Department of Computer Science, George Mason University, Fairfax, 22030, USA
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China
| |
Collapse
|
9
|
Zhao Y, Wang J, Guo M, Zhang X, Yu G. Cross-Species Protein Function Prediction with Asynchronous-Random Walk. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1439-1450. [PMID: 31562099 DOI: 10.1109/tcbb.2019.2943342] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Protein function prediction is a fundamental task in the post-genomic era. Available functional annotations of proteins are incomplete and the annotations of two homologous species are complementary to each other. However, how to effectively leverage mutually complementary annotations of different species to further boost the prediction performance is still not well studied. In this paper, we propose a cross-species protein function prediction approach by performing Asynchronous Random Walk on a heterogeneous network (AsyRW). AsyRW first constructs a heterogeneous network to integrate multiple functional association networks derived from different biological data, established homology-relationships between proteins from different species, known annotations of proteins and Gene Ontology (GO). To account for the intrinsic structures of intra- and inter-species of proteins and that of GO, AsyRW quantifies the individual walk lengths of each network node using the gravity-like theory, and then performs asynchronous-random walk with the individual length to predict associations between proteins and GO terms. Experiments on annotations archived in different years show that individual walk length and asynchronous-random walk can effectively leverage the complementary annotations of different species, AsyRW has a significantly improved performance to other related and competitive methods. The codes of AsyRW are available at: http://mlda.swu.edu.cn/codes.php?name=AsyRW.
Collapse
|
10
|
Karbalayghareh A, Qian X, Dougherty ER. Optimal Bayesian Transfer Learning for Count Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:644-655. [PMID: 31180899 DOI: 10.1109/tcbb.2019.2920981] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
There is often a limited amount of omics data to design predictive models in biomedicine. Knowing that these omics data come from underlying processes that may share common pathways and disease mechanisms, it may be beneficial for designing a more accurate and reliable predictor in a target domain of interest, where there is a lack of labeled data to leverage available data in relevant source domains. Here, we focus on developing Bayesian transfer learning methods for analyzing next-generation sequencing (NGS) data to help improve predictions in the target domain. We formulate transfer learning in a fully Bayesian framework and define the relatedness by a joint prior distribution of the model parameters of the source and target domains. Defining joint priors acts as a bridge across domains, through which the related knowledge of source data is transferred to the target domain. We focus on RNA-seq discrete count data, which are often overdispersed. To appropriately model them, we consider the Negative Binomial model and propose an Optimal Bayesian Transfer Learning (OBTL) classifier that minimizes the expected classification error in the target domain. We evaluate the performance of the OBTL classifier via both synthetic and cancer data from The Cancer Genome Atlas (TCGA).
Collapse
|
11
|
Liu R, Mancuso CA, Yannakopoulos A, Johnson KA, Krishnan A. Supervised learning is an accurate method for network-based gene classification. Bioinformatics 2020; 36:3457-3465. [PMID: 32129827 PMCID: PMC7267831 DOI: 10.1093/bioinformatics/btaa150] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Revised: 12/01/2019] [Accepted: 02/27/2020] [Indexed: 12/22/2022] Open
Abstract
Background Assigning every human gene to specific functions, diseases and traits is a grand challenge in modern genetics. Key to addressing this challenge are computational methods, such as supervised learning and label propagation, that can leverage molecular interaction networks to predict gene attributes. In spite of being a popular machine-learning technique across fields, supervised learning has been applied only in a few network-based studies for predicting pathway-, phenotype- or disease-associated genes. It is unknown how supervised learning broadly performs across different networks and diverse gene classification tasks, and how it compares to label propagation, the widely benchmarked canonical approach for this problem. Results In this study, we present a comprehensive benchmarking of supervised learning for network-based gene classification, evaluating this approach and a classic label propagation technique on hundreds of diverse prediction tasks and multiple networks using stringent evaluation schemes. We demonstrate that supervised learning on a gene’s full network connectivity outperforms label propagaton and achieves high prediction accuracy by efficiently capturing local network properties, rivaling label propagation’s appeal for naturally using network topology. We further show that supervised learning on the full network is also superior to learning on node embeddings (derived using node2vec), an increasingly popular approach for concisely representing network connectivity. These results show that supervised learning is an accurate approach for prioritizing genes associated with diverse functions, diseases and traits and should be considered a staple of network-based gene classification workflows. Availability and implementation The datasets and the code used to reproduce the results and add new gene classification methods have been made freely available. Contact arjun@msu.edu Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Renming Liu
- Department of Computational Mathematics, Science and Engineering
| | | | | | - Kayla A Johnson
- Department of Computational Mathematics, Science and Engineering
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| | - Arjun Krishnan
- Department of Computational Mathematics, Science and Engineering
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
- To whom correspondence should be addressed.
| |
Collapse
|
12
|
Lack of a site-specific phosphorylation of Presenilin 1 disrupts microglial gene networks and progenitors during development. PLoS One 2020; 15:e0237773. [PMID: 32822378 PMCID: PMC7444478 DOI: 10.1371/journal.pone.0237773] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2020] [Accepted: 08/03/2020] [Indexed: 12/27/2022] Open
Abstract
Microglial cells play a key role in brain homeostasis from development to adulthood. Here we show the involvement of a site-specific phosphorylation of Presenilin 1 (PS1) in microglial development. Profiles of microglia-specific transcripts in different temporal stages of development, combined with multiple systematic transcriptomic analysis and quantitative determination of microglia progenitors, indicate that the phosphorylation of PS1 at serine 367 is involved in the temporal dynamics of microglial development, specifically in the developing brain rudiment during embryonic microgliogenesis. We constructed a developing brain-specific microglial network to identify transcription factors linked to PS1 during development. Our data showed that PS1 functional connections appear through interaction hubs at Pu.1, Irf8 and Rela-p65 transcription factors. Finally, we showed that the total number of microglia progenitors was markedly reduced in the developing brain rudiment of embryos lacking PS1 phosphorylation compared to WT. Our work identifies a novel role for PS1 in microglial development.
Collapse
|
13
|
Selective Neuronal Vulnerability in Alzheimer's Disease: A Network-Based Analysis. Neuron 2020; 107:821-835.e12. [PMID: 32603655 DOI: 10.1016/j.neuron.2020.06.010] [Citation(s) in RCA: 115] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2019] [Revised: 04/23/2020] [Accepted: 06/05/2020] [Indexed: 12/17/2022]
Abstract
A major obstacle to treating Alzheimer's disease (AD) is our lack of understanding of the molecular mechanisms underlying selective neuronal vulnerability, a key characteristic of the disease. Here, we present a framework integrating high-quality neuron-type-specific molecular profiles across the lifetime of the healthy mouse, which we generated using bacTRAP, with postmortem human functional genomics and quantitative genetics data. We demonstrate human-mouse conservation of cellular taxonomy at the molecular level for neurons vulnerable and resistant in AD, identify specific genes and pathways associated with AD neuropathology, and pinpoint a specific functional gene module underlying selective vulnerability, enriched in processes associated with axonal remodeling, and affected by amyloid accumulation and aging. We have made all cell-type-specific profiles and functional networks available at http://alz.princeton.edu. Overall, our study provides a molecular framework for understanding the complex interplay between Aβ, aging, and neurodegeneration within the most vulnerable neurons in AD.
Collapse
|
14
|
Liu R, Mancuso CA, Yannakopoulos A, Johnson KA, Krishnan A. Supervised learning is an accurate method for network-based gene classification. BIOINFORMATICS (OXFORD, ENGLAND) 2020; 36:3457-3465. [PMID: 32129827 DOI: 10.1101/721423] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/26/2019] [Revised: 12/01/2019] [Accepted: 02/27/2020] [Indexed: 05/26/2023]
Abstract
BACKGROUND Assigning every human gene to specific functions, diseases and traits is a grand challenge in modern genetics. Key to addressing this challenge are computational methods, such as supervised learning and label propagation, that can leverage molecular interaction networks to predict gene attributes. In spite of being a popular machine-learning technique across fields, supervised learning has been applied only in a few network-based studies for predicting pathway-, phenotype- or disease-associated genes. It is unknown how supervised learning broadly performs across different networks and diverse gene classification tasks, and how it compares to label propagation, the widely benchmarked canonical approach for this problem. RESULTS In this study, we present a comprehensive benchmarking of supervised learning for network-based gene classification, evaluating this approach and a classic label propagation technique on hundreds of diverse prediction tasks and multiple networks using stringent evaluation schemes. We demonstrate that supervised learning on a gene's full network connectivity outperforms label propagaton and achieves high prediction accuracy by efficiently capturing local network properties, rivaling label propagation's appeal for naturally using network topology. We further show that supervised learning on the full network is also superior to learning on node embeddings (derived using node2vec), an increasingly popular approach for concisely representing network connectivity. These results show that supervised learning is an accurate approach for prioritizing genes associated with diverse functions, diseases and traits and should be considered a staple of network-based gene classification workflows. AVAILABILITY AND IMPLEMENTATION The datasets and the code used to reproduce the results and add new gene classification methods have been made freely available. CONTACT arjun@msu.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Renming Liu
- Department of Computational Mathematics, Science and Engineering
| | | | | | - Kayla A Johnson
- Department of Computational Mathematics, Science and Engineering
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| | - Arjun Krishnan
- Department of Computational Mathematics, Science and Engineering
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
15
|
Zhao Y, Wang J, Chen J, Zhang X, Guo M, Yu G. A Literature Review of Gene Function Prediction by Modeling Gene Ontology. Front Genet 2020; 11:400. [PMID: 32391061 PMCID: PMC7193026 DOI: 10.3389/fgene.2020.00400] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2020] [Accepted: 03/30/2020] [Indexed: 12/14/2022] Open
Abstract
Annotating the functional properties of gene products, i.e., RNAs and proteins, is a fundamental task in biology. The Gene Ontology database (GO) was developed to systematically describe the functional properties of gene products across species, and to facilitate the computational prediction of gene function. As GO is routinely updated, it serves as the gold standard and main knowledge source in functional genomics. Many gene function prediction methods making use of GO have been proposed. But no literature review has summarized these methods and the possibilities for future efforts from the perspective of GO. To bridge this gap, we review the existing methods with an emphasis on recent solutions. First, we introduce the conventions of GO and the widely adopted evaluation metrics for gene function prediction. Next, we summarize current methods of gene function prediction that apply GO in different ways, such as using hierarchical or flat inter-relationships between GO terms, compressing massive GO terms and quantifying semantic similarities. Although many efforts have improved performance by harnessing GO, we conclude that there remain many largely overlooked but important topics for future research.
Collapse
Affiliation(s)
- Yingwen Zhao
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Jun Wang
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Jian Chen
- State Key Laboratory of Agrobiotechnology and National Maize Improvement Center, China Agricultural University, Beijing, China
| | - Xiangliang Zhang
- CBRC, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China
| | - Guoxian Yu
- College of Computer and Information Science, Southwest University, Chongqing, China
- CBRC, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| |
Collapse
|
16
|
Zhou J, Schor IE, Yao V, Theesfeld CL, Marco-Ferreres R, Tadych A, Furlong EEM, Troyanskaya OG. Accurate genome-wide predictions of spatio-temporal gene expression during embryonic development. PLoS Genet 2019; 15:e1008382. [PMID: 31553718 PMCID: PMC6779412 DOI: 10.1371/journal.pgen.1008382] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2018] [Revised: 10/07/2019] [Accepted: 08/22/2019] [Indexed: 11/18/2022] Open
Abstract
Comprehensive information on the timing and location of gene expression is fundamental to our understanding of embryonic development and tissue formation. While high-throughput in situ hybridization projects provide invaluable information about developmental gene expression patterns for model organisms like Drosophila, the output of these experiments is primarily qualitative, and a high proportion of protein coding genes and most non-coding genes lack any annotation. Accurate data-centric predictions of spatio-temporal gene expression will therefore complement current in situ hybridization efforts. Here, we applied a machine learning approach by training models on all public gene expression and chromatin data, even from whole-organism experiments, to provide genome-wide, quantitative spatio-temporal predictions for all genes. We developed structured in silico nano-dissection, a computational approach that predicts gene expression in >200 tissue-developmental stages. The algorithm integrates expression signals from a compendium of 6,378 genome-wide expression and chromatin profiling experiments in a cell lineage-aware fashion. We systematically evaluated our performance via cross-validation and experimentally confirmed 22 new predictions for four different embryonic tissues. The model also predicts complex, multi-tissue expression and developmental regulation with high accuracy. We further show the potential of applying these genome-wide predictions to extract tissue specificity signals from non-tissue-dissected experiments, and to prioritize tissues and stages for disease modeling. This resource, together with the exploratory tools are freely available at our webserver http://find.princeton.edu, which provides a valuable tool for a range of applications, from predicting spatio-temporal expression patterns to recognizing tissue signatures from differential gene expression profiles.
Collapse
Affiliation(s)
- Jian Zhou
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Graduate Program in Quantitative and Computational Biology, Princeton University, Princeton, New Jersey, United States of America
- Center for Computational Biology, Flatiron Institute, New York, New York, United States of America
| | - Ignacio E. Schor
- Genome Biology Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
| | - Victoria Yao
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
| | - Chandra L. Theesfeld
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - Raquel Marco-Ferreres
- Genome Biology Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
| | - Alicja Tadych
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - Eileen E. M. Furlong
- Genome Biology Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
- * E-mail: (EEMF); (OGT)
| | - Olga G. Troyanskaya
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Center for Computational Biology, Flatiron Institute, New York, New York, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- * E-mail: (EEMF); (OGT)
| |
Collapse
|
17
|
Proost S, Mutwil M. CoNekT: an open-source framework for comparative genomic and transcriptomic network analyses. Nucleic Acids Res 2019; 46:W133-W140. [PMID: 29718322 PMCID: PMC6030989 DOI: 10.1093/nar/gky336] [Citation(s) in RCA: 71] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2018] [Accepted: 04/18/2018] [Indexed: 12/22/2022] Open
Abstract
The recent accumulation of gene expression data in the form of RNA sequencing creates unprecedented opportunities to study gene regulation and function. Furthermore, comparative analysis of the expression data from multiple species can elucidate which functional gene modules are conserved across species, allowing the study of the evolution of these modules. However, performing such comparative analyses on raw data is not feasible for many biologists. Here, we present CoNekT (Co-expression Network Toolkit), an open source web server, that contains user-friendly tools and interactive visualizations for comparative analyses of gene expression data and co-expression networks. These tools allow analysis and cross-species comparison of (i) gene expression profiles; (ii) co-expression networks; (iii) co-expressed clusters involved in specific biological processes; (iv) tissue-specific gene expression; and (v) expression profiles of gene families. To demonstrate these features, we constructed CoNekT-Plants for green alga, seed plants and flowering plants (Picea abies, Chlamydomonas reinhardtii, Vitis vinifera, Arabidopsis thaliana, Oryza sativa, Zea mays and Solanum lycopersicum) and thus provide a web-tool with the broadest available collection of plant phyla. CoNekT-Plants is freely available from http://conekt.plant.tools, while the CoNekT source code and documentation can be found at https://github.molgen.mpg.de/proost/CoNekT/.
Collapse
Affiliation(s)
- Sebastian Proost
- Max-Planck Institute for Molecular Plant Physiology, Am Muehlenberg 1, 14476 Potsdam, Germany
| | - Marek Mutwil
- Max-Planck Institute for Molecular Plant Physiology, Am Muehlenberg 1, 14476 Potsdam, Germany.,School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| |
Collapse
|
18
|
Wong AK, Krishnan A, Troyanskaya OG. GIANT 2.0: genome-scale integrated analysis of gene networks in tissues. Nucleic Acids Res 2019; 46:W65-W70. [PMID: 29800226 PMCID: PMC6030827 DOI: 10.1093/nar/gky408] [Citation(s) in RCA: 48] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2018] [Accepted: 05/07/2018] [Indexed: 01/09/2023] Open
Abstract
GIANT2 (Genome-wide Integrated Analysis of gene Networks in Tissues) is an interactive web server that enables biomedical researchers to analyze their proteins and pathways of interest and generate hypotheses in the context of genome-scale functional maps of human tissues. The precise actions of genes are frequently dependent on their tissue context, yet direct assay of tissue-specific protein function and interactions remains infeasible in many normal human tissues and cell-types. With GIANT2, researchers can explore predicted tissue-specific functional roles of genes and reveal changes in those roles across tissues, all through interactive multi-network visualizations and analyses. Additionally, the NetWAS approach available through the server uses tissue-specific/cell-type networks predicted by GIANT2 to re-prioritize statistical associations from GWAS studies and identify disease-associated genes. GIANT2 predicts tissue-specific interactions by integrating diverse functional genomics data from now over 61 400 experiments for 283 diverse tissues and cell-types. GIANT2 does not require any registration or installation and is freely available for use at http://giant-v2.princeton.edu.
Collapse
Affiliation(s)
- Aaron K Wong
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY 10010, USA
| | - Arjun Krishnan
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI 48824, USA.,Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI 48824, USA
| | - Olga G Troyanskaya
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY 10010, USA.,Department of Computer Science, Princeton University, Princeton, NJ 08544, USA.,Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| |
Collapse
|
19
|
Ferrari C, Proost S, Ruprecht C, Mutwil M. PhytoNet: comparative co-expression network analyses across phytoplankton and land plants. Nucleic Acids Res 2019; 46:W76-W83. [PMID: 29718316 PMCID: PMC6030924 DOI: 10.1093/nar/gky298] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2018] [Accepted: 04/11/2018] [Indexed: 11/15/2022] Open
Abstract
Phytoplankton consists of autotrophic, photosynthesizing microorganisms that are a crucial component of freshwater and ocean ecosystems. However, despite being the major primary producers of organic compounds, accounting for half of the photosynthetic activity worldwide and serving as the entry point to the food chain, functions of most of the genes of the model phytoplankton organisms remain unknown. To remedy this, we have gathered publicly available expression data for one chlorophyte, one rhodophyte, one haptophyte, two heterokonts and four cyanobacteria and integrated it into our PlaNet (Plant Networks) database, which now allows mining gene expression profiles and identification of co-expressed genes of 19 species. We exemplify how the co-expressed gene networks can be used to reveal functionally related genes and how the comparative features of PhytoNet allow detection of conserved transcriptional programs between cyanobacteria, green algae, and land plants. Additionally, we illustrate how the database allows detection of duplicated transcriptional programs within an organism, as exemplified by two putative DNA repair programs within Chlamydomonas reinhardtii. PhytoNet is available from www.gene2function.de.
Collapse
Affiliation(s)
- Camilla Ferrari
- Max-Planck Institute for Molecular Plant Physiology, Am Muehlenberg 1, 14476 Potsdam, Germany
| | - Sebastian Proost
- Max-Planck Institute for Molecular Plant Physiology, Am Muehlenberg 1, 14476 Potsdam, Germany
| | - Colin Ruprecht
- Max-Planck Institute of Colloids and Interfaces, Am Muehlenberg 1, 14476 Potsdam, Germany
| | - Marek Mutwil
- Max-Planck Institute for Molecular Plant Physiology, Am Muehlenberg 1, 14476 Potsdam, Germany.,School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore 637551, Singapore
| |
Collapse
|
20
|
Guala D, Ogris C, Müller N, Sonnhammer ELL. Genome-wide functional association networks: background, data & state-of-the-art resources. Brief Bioinform 2019; 21:1224-1237. [PMID: 31281921 PMCID: PMC7373183 DOI: 10.1093/bib/bbz064] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2019] [Revised: 04/29/2019] [Accepted: 05/04/2019] [Indexed: 02/06/2023] Open
Abstract
The vast amount of experimental data from recent advances in the field of high-throughput biology begs for integration into more complex data structures such as genome-wide functional association networks. Such networks have been used for elucidation of the interplay of intra-cellular molecules to make advances ranging from the basic science understanding of evolutionary processes to the more translational field of precision medicine. The allure of the field has resulted in rapid growth of the number of available network resources, each with unique attributes exploitable to answer different biological questions. Unfortunately, the high volume of network resources makes it impossible for the intended user to select an appropriate tool for their particular research question. The aim of this paper is to provide an overview of the underlying data and representative network resources as well as to mention methods of integration, allowing a customized approach to resource selection. Additionally, this report will provide a primer for researchers venturing into the field of network integration.
Collapse
Affiliation(s)
- Dimitri Guala
- Science for Life Laboratory, Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Box 1031, 17121 Solna, Sweden
| | - Christoph Ogris
- Computational Cell Maps, Institute of Computational Biology, Helmholtz Center Munich, Ingolstädter Landstr. 1, 85764 Neuherberg, Germany
| | - Nikola Müller
- Computational Cell Maps, Institute of Computational Biology, Helmholtz Center Munich, Ingolstädter Landstr. 1, 85764 Neuherberg, Germany
| | - Erik L L Sonnhammer
- Science for Life Laboratory, Stockholm Bioinformatics Center, Department of Biochemistry and Biophysics, Stockholm University, Box 1031, 17121 Solna, Sweden
| |
Collapse
|
21
|
Lee YS, Wong AK, Tadych A, Hartmann BM, Park CY, DeJesus VA, Ramos I, Zaslavsky E, Sealfon SC, Troyanskaya OG. Interpretation of an individual functional genomics experiment guided by massive public data. Nat Methods 2018; 15:1049-1052. [PMID: 30478325 PMCID: PMC6941785 DOI: 10.1038/s41592-018-0218-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2018] [Accepted: 09/27/2018] [Indexed: 12/11/2022]
Abstract
A key unmet challenge in interpreting omics experiments is inferring biological meaning in the context of public functional genomics data. We developed a computational framework, Your Evidence Tailored Integration (YETI; http://yeti.princeton.edu/ ), which creates specialized functional interaction maps from large public datasets relevant to an individual omics experiment. Using this tailored integration, we predicted and experimentally confirmed an unexpected divergence in viral replication after seasonal or pandemic human influenza virus infection.
Collapse
Affiliation(s)
- Young-suk Lee
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
- Department of Computer Science, Princeton University, Princeton, NJ, USA
- Present address: School of Biological Sciences, Seoul National University, Seoul, Korea
| | - Aaron K. Wong
- Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Alicja Tadych
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
| | - Boris M. Hartmann
- Department of Neurology and Center for Advanced Research on Diagnostic Assays, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | | | - Veronica A. DeJesus
- Department of Microbiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Irene Ramos
- Department of Microbiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Elena Zaslavsky
- Department of Neurology and Center for Advanced Research on Diagnostic Assays, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Stuart C. Sealfon
- Department of Neurology and Center for Advanced Research on Diagnostic Assays, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Olga G. Troyanskaya
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
- Department of Computer Science, Princeton University, Princeton, NJ, USA
- Flatiron Institute, Simons Foundation, New York, NY, USA
| |
Collapse
|
22
|
Yao V, Kaletsky R, Keyes W, Mor DE, Wong AK, Sohrabi S, Murphy CT, Troyanskaya OG. An integrative tissue-network approach to identify and test human disease genes. Nat Biotechnol 2018; 36:nbt.4246. [PMID: 30346941 PMCID: PMC7021177 DOI: 10.1038/nbt.4246] [Citation(s) in RCA: 44] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2018] [Accepted: 08/08/2018] [Indexed: 01/09/2023]
Abstract
Effective discovery of causal disease genes must overcome the statistical challenges of quantitative genetics studies and the practical limitations of human biology experiments. Here we developed diseaseQUEST, an integrative approach that combines data from human genome-wide disease studies with in silico network models of tissue- and cell-type-specific function in model organisms to prioritize candidates within functionally conserved processes and pathways. We used diseaseQUEST to predict candidate genes for 25 different diseases and traits, including cancer, longevity, and neurodegenerative diseases. Focusing on Parkinson's disease (PD), a diseaseQUEST-directed Caenhorhabditis elegans behavioral screen identified several candidate genes, which we experimentally verified and found to be associated with age-dependent motility defects mirroring PD clinical symptoms. Furthermore, knockdown of the top candidate gene, bcat-1, encoding a branched chain amino acid transferase, caused spasm-like 'curling' and neurodegeneration in C. elegans, paralleling decreased BCAT1 expression in PD patient brains. diseaseQUEST is modular and generalizable to other model organisms and human diseases of interest.
Collapse
Affiliation(s)
- Victoria Yao
- Department of Computer Science, Princeton University, Princeton, New Jersey, USA
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA
| | - Rachel Kaletsky
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, USA
| | - William Keyes
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, USA
| | - Danielle E Mor
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, USA
| | - Aaron K Wong
- Flatiron Institute, Simons Foundation, New York, New York, USA
| | - Salman Sohrabi
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, USA
| | - Coleen T Murphy
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, USA
| | - Olga G Troyanskaya
- Department of Computer Science, Princeton University, Princeton, New Jersey, USA
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA
- Flatiron Institute, Simons Foundation, New York, New York, USA
| |
Collapse
|
23
|
Enabling Precision Medicine through Integrative Network Models. J Mol Biol 2018; 430:2913-2923. [DOI: 10.1016/j.jmb.2018.07.004] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2018] [Revised: 06/15/2018] [Accepted: 07/03/2018] [Indexed: 11/17/2022]
|
24
|
Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, Ferrero E, Agapow PM, Zietz M, Hoffman MM, Xie W, Rosen GL, Lengerich BJ, Israeli J, Lanchantin J, Woloszynek S, Carpenter AE, Shrikumar A, Xu J, Cofer EM, Lavender CA, Turaga SC, Alexandari AM, Lu Z, Harris DJ, DeCaprio D, Qi Y, Kundaje A, Peng Y, Wiley LK, Segler MHS, Boca SM, Swamidass SJ, Huang A, Gitter A, Greene CS. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface 2018; 15:20170387. [PMID: 29618526 PMCID: PMC5938574 DOI: 10.1098/rsif.2017.0387] [Citation(s) in RCA: 905] [Impact Index Per Article: 129.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2017] [Accepted: 03/07/2018] [Indexed: 11/12/2022] Open
Abstract
Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems-patient classification, fundamental biological processes and treatment of patients-and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network's prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine.
Collapse
Affiliation(s)
- Travers Ching
- Molecular Biosciences and Bioengineering Graduate Program, University of Hawaii at Manoa, Honolulu, HI, USA
| | - Daniel S Himmelstein
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Brett K Beaulieu-Jones
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Alexandr A Kalinin
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, USA
| | | | - Gregory P Way
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Enrico Ferrero
- Computational Biology and Stats, Target Sciences, GlaxoSmithKline, Stevenage, UK
| | | | - Michael Zietz
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Michael M Hoffman
- Princess Margaret Cancer Centre, Toronto, Ontario, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
| | - Wei Xie
- Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN, USA
| | - Gail L Rosen
- Ecological and Evolutionary Signal-processing and Informatics Laboratory, Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Benjamin J Lengerich
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Johnny Israeli
- Biophysics Program, Stanford University, Stanford, CA, USA
| | - Jack Lanchantin
- Department of Computer Science, University of Virginia, Charlottesville, VA, USA
| | - Stephen Woloszynek
- Ecological and Evolutionary Signal-processing and Informatics Laboratory, Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Anne E Carpenter
- Imaging Platform, Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Avanti Shrikumar
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Jinbo Xu
- Toyota Technological Institute at Chicago, Chicago, IL, USA
| | - Evan M Cofer
- Department of Computer Science, Trinity University, San Antonio, TX, USA
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
| | - Christopher A Lavender
- Integrative Bioinformatics, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, NC, USA
| | - Srinivas C Turaga
- Howard Hughes Medical Institute, Janelia Research Campus, Ashburn, VA, USA
| | - Amr M Alexandari
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information and National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - David J Harris
- Department of Wildlife Ecology and Conservation, University of Florida, Gainesville, FL, USA
| | | | - Yanjun Qi
- Department of Computer Science, University of Virginia, Charlottesville, VA, USA
| | - Anshul Kundaje
- Department of Computer Science, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Yifan Peng
- National Center for Biotechnology Information and National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Laura K Wiley
- Division of Biomedical Informatics and Personalized Medicine, University of Colorado School of Medicine, Aurora, CO, USA
| | - Marwin H S Segler
- Institute of Organic Chemistry, Westfälische Wilhelms-Universität Münster, Münster, Germany
| | - Simina M Boca
- Innovation Center for Biomedical Informatics, Georgetown University Medical Center, Washington, DC, USA
| | - S Joshua Swamidass
- Department of Pathology and Immunology, Washington University in Saint Louis, St Louis, MO, USA
| | - Austin Huang
- Department of Medicine, Brown University, Providence, RI, USA
| | - Anthony Gitter
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA
- Morgridge Institute for Research, Madison, WI, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
25
|
Sibout R, Proost S, Hansen BO, Vaid N, Giorgi FM, Ho-Yue-Kuang S, Legée F, Cézart L, Bouchabké-Coussa O, Soulhat C, Provart N, Pasha A, Le Bris P, Roujol D, Hofte H, Jamet E, Lapierre C, Persson S, Mutwil M. Expression atlas and comparative coexpression network analyses reveal important genes involved in the formation of lignified cell wall in Brachypodium distachyon. THE NEW PHYTOLOGIST 2017; 215:1009-1025. [PMID: 28617955 DOI: 10.1111/nph.14635] [Citation(s) in RCA: 30] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/28/2017] [Accepted: 04/26/2017] [Indexed: 05/08/2023]
Abstract
While Brachypodium distachyon (Brachypodium) is an emerging model for grasses, no expression atlas or gene coexpression network is available. Such tools are of high importance to provide insights into the function of Brachypodium genes. We present a detailed Brachypodium expression atlas, capturing gene expression in its major organs at different developmental stages. The data were integrated into a large-scale coexpression database ( www.gene2function.de), enabling identification of duplicated pathways and conserved processes across 10 plant species, thus allowing genome-wide inference of gene function. We highlight the importance of the atlas and the platform through the identification of duplicated cell wall modules, and show that a lignin biosynthesis module is conserved across angiosperms. We identified and functionally characterised a putative ferulate 5-hydroxylase gene through overexpression of it in Brachypodium, which resulted in an increase in lignin syringyl units and reduced lignin content of mature stems, and led to improved saccharification of the stem biomass. Our Brachypodium expression atlas thus provides a powerful resource to reveal functionally related genes, which may advance our understanding of important biological processes in grasses.
Collapse
Affiliation(s)
- Richard Sibout
- Institut Jean-Pierre Bourgin, UMR 1318, INRA, AgroParisTech, CNRS, Université Paris-Saclay, RD10, Versailles Cedex, 78026, France
| | - Sebastian Proost
- Max Planck Institute of Molecular Plant Physiology, Am Muehlenberg 1, Potsdam, 14476, Germany
| | - Bjoern Oest Hansen
- Max Planck Institute of Molecular Plant Physiology, Am Muehlenberg 1, Potsdam, 14476, Germany
| | - Neha Vaid
- Max Planck Institute of Molecular Plant Physiology, Am Muehlenberg 1, Potsdam, 14476, Germany
| | - Federico M Giorgi
- Cancer Research UK, Cambridge Institute, Robinson Way, Cambridge, CB2 0RE, UK
| | - Severine Ho-Yue-Kuang
- Institut Jean-Pierre Bourgin, UMR 1318, INRA, AgroParisTech, CNRS, Université Paris-Saclay, RD10, Versailles Cedex, 78026, France
| | - Frédéric Legée
- Institut Jean-Pierre Bourgin, UMR 1318, INRA, AgroParisTech, CNRS, Université Paris-Saclay, RD10, Versailles Cedex, 78026, France
| | - Laurent Cézart
- Institut Jean-Pierre Bourgin, UMR 1318, INRA, AgroParisTech, CNRS, Université Paris-Saclay, RD10, Versailles Cedex, 78026, France
| | - Oumaya Bouchabké-Coussa
- Institut Jean-Pierre Bourgin, UMR 1318, INRA, AgroParisTech, CNRS, Université Paris-Saclay, RD10, Versailles Cedex, 78026, France
| | - Camille Soulhat
- Institut Jean-Pierre Bourgin, UMR 1318, INRA, AgroParisTech, CNRS, Université Paris-Saclay, RD10, Versailles Cedex, 78026, France
| | - Nicholas Provart
- Department of Cell and Systems Biology, Centre for the Analysis of Genome Evolution and Function, University of Toronto, 25 Willcocks St., Toronto, ON, M5S 3B2, Canada
| | - Asher Pasha
- Department of Cell and Systems Biology, Centre for the Analysis of Genome Evolution and Function, University of Toronto, 25 Willcocks St., Toronto, ON, M5S 3B2, Canada
| | - Philippe Le Bris
- Institut Jean-Pierre Bourgin, UMR 1318, INRA, AgroParisTech, CNRS, Université Paris-Saclay, RD10, Versailles Cedex, 78026, France
| | - David Roujol
- Laboratoire de Recherche en Sciences Végétales, Université de Toulouse, CNRS, UPS, Castanet-Tolosan, France
| | - Herman Hofte
- Institut Jean-Pierre Bourgin, UMR 1318, INRA, AgroParisTech, CNRS, Université Paris-Saclay, RD10, Versailles Cedex, 78026, France
| | - Elisabeth Jamet
- Laboratoire de Recherche en Sciences Végétales, Université de Toulouse, CNRS, UPS, Castanet-Tolosan, France
| | - Catherine Lapierre
- Institut Jean-Pierre Bourgin, UMR 1318, INRA, AgroParisTech, CNRS, Université Paris-Saclay, RD10, Versailles Cedex, 78026, France
| | - Staffan Persson
- School of Biosciences, University of Melbourne, Parkville, Vic., 3010, Australia
| | - Marek Mutwil
- Max Planck Institute of Molecular Plant Physiology, Am Muehlenberg 1, Potsdam, 14476, Germany
| |
Collapse
|
26
|
Ruprecht C, Proost S, Hernandez-Coronado M, Ortiz-Ramirez C, Lang D, Rensing SA, Becker JD, Vandepoele K, Mutwil M. Phylogenomic analysis of gene co-expression networks reveals the evolution of functional modules. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2017; 90:447-465. [PMID: 28161902 DOI: 10.1111/tpj.13502] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/19/2016] [Revised: 01/05/2017] [Accepted: 01/25/2017] [Indexed: 05/08/2023]
Abstract
Molecular evolutionary studies correlate genomic and phylogenetic information with the emergence of new traits of organisms. These traits are, however, the consequence of dynamic gene networks composed of functional modules, which might not be captured by genomic analyses. Here, we established a method that combines large-scale genomic and phylogenetic data with gene co-expression networks to extensively study the evolutionary make-up of modules in the moss Physcomitrella patens, and in the angiosperms Arabidopsis thaliana and Oryza sativa (rice). We first show that younger genes are less annotated than older genes. By mapping genomic data onto the co-expression networks, we found that genes from the same evolutionary period tend to be connected, whereas old and young genes tend to be disconnected. Consequently, the analysis revealed modules that emerged at a specific time in plant evolution. To uncover the evolutionary relationships of the modules that are conserved across the plant kingdom, we added phylogenetic information that revealed duplication and speciation events on the module level. This combined analysis revealed an independent duplication of cell wall modules in bryophytes and angiosperms, suggesting a parallel evolution of cell wall pathways in land plants. We provide an online tool allowing plant researchers to perform these analyses at http://www.gene2function.de.
Collapse
Affiliation(s)
- Colin Ruprecht
- Max Planck Institute of Molecular Plant Physiology, Am Muehlenberg 1, 14476, Potsdam, Germany
| | - Sebastian Proost
- Max Planck Institute of Molecular Plant Physiology, Am Muehlenberg 1, 14476, Potsdam, Germany
| | | | - Carlos Ortiz-Ramirez
- Instituto Gulbekian De Ciencia, Rua da Quinta Grande 6, 2780-156, Oeiras, Portugal
| | - Daniel Lang
- University of Freiburg, Schänzlestr. 1, D-79104, Freiburg, Germany
| | - Stefan A Rensing
- University of Marburg, Karl-von-Frisch-Str. 8, D-35043, Marburg, Germany
| | - Jörg D Becker
- Instituto Gulbekian De Ciencia, Rua da Quinta Grande 6, 2780-156, Oeiras, Portugal
| | - Klaas Vandepoele
- Department of Plant Systems Biology VIB, Department of Plant Biotechnology and Bioinformatics Ghent University, Technologiepark 927, B-9052, Gent, Belgium
| | - Marek Mutwil
- Max Planck Institute of Molecular Plant Physiology, Am Muehlenberg 1, 14476, Potsdam, Germany
| |
Collapse
|
27
|
Ruprecht C, Vaid N, Proost S, Persson S, Mutwil M. Beyond Genomics: Studying Evolution with Gene Coexpression Networks. TRENDS IN PLANT SCIENCE 2017; 22:298-307. [PMID: 28126286 DOI: 10.1016/j.tplants.2016.12.011] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/02/2016] [Revised: 12/06/2016] [Accepted: 12/22/2016] [Indexed: 05/08/2023]
Abstract
Understanding how genomes change as organisms become more complex is a central question in evolution. Molecular evolutionary studies typically correlate the appearance of genes and gene families with the emergence of biological pathways and morphological features. While such approaches are of great importance to understand how organisms evolve, they are also limited, as functionally related genes work together in contexts of dynamic gene networks. Since functionally related genes are often transcriptionally coregulated, gene coexpression networks present a resource to study the evolution of biological pathways. In this opinion article, we discuss recent developments in this field and how coexpression analyses can be merged with existing genomic approaches to transfer functional knowledge between species to study the appearance or extension of pathways.
Collapse
Affiliation(s)
- Colin Ruprecht
- Max-Planck Institute of Colloids and Interfaces, Am Muehlenberg 1, 14476 Potsdam, Germany
| | - Neha Vaid
- Max-Planck Institute for Molecular Plant Physiology, Am Muehlenberg 1, 14476 Potsdam, Germany
| | - Sebastian Proost
- Max-Planck Institute for Molecular Plant Physiology, Am Muehlenberg 1, 14476 Potsdam, Germany
| | - Staffan Persson
- School of BioSciences, University of Melbourne, Parkville, VIC 3010, Australia; ARC Centre of Excellence in Plant Cell Walls, School of Biosciences, University of Melbourne,Parkville, VIC 3010, Australia
| | - Marek Mutwil
- Max-Planck Institute for Molecular Plant Physiology, Am Muehlenberg 1, 14476 Potsdam, Germany.
| |
Collapse
|
28
|
Abstract
Functional relations between genes can be represented as networks. These networks have been successfully used to infer gene function and to mediate transfer of functional knowledge between species. Transcriptionally coordinated or co-expressed genes tend to be functionally related, which combined with availability of transcriptomic data for multiple plant species make the co-expression networks a useful resource for the plant community. In this chapter, we describe PlaNet ( www.gene2function.de ), a database that includes comparative analyses for co-expression networks of 11 plant species. We exemplify how the tools included in PlaNet can be used to predict gene function, transfer knowledge, and discover conserved and multiplied gene modules.
Collapse
Affiliation(s)
- Sebastian Proost
- Max Planck Institute of Molecular Plant Physiology, Am Mühlenberg 1, 14476, Potsdam-Golm, Germany
| | - Marek Mutwil
- Max Planck Institute of Molecular Plant Physiology, Am Mühlenberg 1, 14476, Potsdam-Golm, Germany.
| |
Collapse
|
29
|
Krishnan A, Taroni JN, Greene CS. Integrative Networks Illuminate Biological Factors Underlying Gene–Disease Associations. CURRENT GENETIC MEDICINE REPORTS 2016. [DOI: 10.1007/s40142-016-0102-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
|
30
|
Abstract
“Big Data” has surpassed “systems biology” and “omics” as the hottest buzzword in the biological sciences, but is there any substance behind the hype? Certainly, we have learned about various aspects of cell and molecular biology from the many individual high-throughput data sets that have been published in the past 15–20 years. These data, although useful as individual data sets, can provide much more knowledge when interrogated with Big Data approaches, such as applying integrative methods that leverage the heterogeneous data compendia in their entirety. Here we discuss the benefits and challenges of such Big Data approaches in biology and how cell and molecular biologists can best take advantage of them.
Collapse
Affiliation(s)
- Kara Dolinski
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540
| | - Olga G Troyanskaya
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540 Department of Computer Science, Princeton University, Princeton, NJ 08540 Simons Center for Data Analysis, Simons Foundation, New York, NY 10010
| |
Collapse
|
31
|
Guan Y, Martini S, Mariani LH. Genes Caught In Flagranti: Integrating Renal Transcriptional Profiles With Genotypes and Phenotypes. Semin Nephrol 2016. [PMID: 26215861 DOI: 10.1016/j.semnephrol.2015.04.003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Abstract
In the past decade, population genetics has gained tremendous success in identifying genetic variations that are statistically relevant to renal diseases and kidney function. However, it is challenging to interpret the functional relevance of the genetic variations found by population genetics studies. In this review, we discuss studies that integrate multiple levels of data, especially transcriptome profiles and phenotype data, to assign functional roles of genetic variations involved in kidney function. Furthermore, we introduce state-of-the-art machine learning algorithms, Bayesian networks, support vector machines, and Gaussian process regression, which have been applied successfully to integrating genetic, regulatory, and clinical information to predict clinical outcomes. These methods are likely to be deployed successfully in the nephrology field in the near future.
Collapse
Affiliation(s)
- Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI; Department of Internal Medicine, University of Michigan, Ann Arbor, MI; Department of Computer Science and Engineering, University of Michigan, Ann Arbor, MI
| | - Sebastian Martini
- Department of Internal Medicine, University of Michigan, Ann Arbor, MI; Nephrologisches Zentrum, Medizinische Klinik und Poliklinik IV, Klinikum der Universität München, Ludwig-Maximilians-University Munich, Munich, Germany
| | - Laura H Mariani
- Department of Internal Medicine, University of Michigan, Ann Arbor, MI
| |
Collapse
|
32
|
Abstract
The laboratory mouse is the primary mammalian species used for studying alternative splicing events. Recent studies have generated computational models to predict functions for splice isoforms in the mouse. However, the functional relationship network, describing the probability of splice isoforms participating in the same biological process or pathway, has not yet been studied in the mouse. Here we describe a rich genome-wide resource of mouse networks at the isoform level, which was generated using a unique framework that was originally developed to infer isoform functions. This network was built through integrating heterogeneous genomic and protein data, including RNA-seq, exon array, protein docking and pseudo-amino acid composition. Through simulation and cross-validation studies, we demonstrated the accuracy of the algorithm in predicting isoform-level functional relationships. We showed that this network enables the users to reveal functional differences of the isoforms of the same gene, as illustrated by literature evidence with Anxa6 (annexin a6) as an example. We expect this work will become a useful resource for the mouse genetics community to understand gene functions. The network is publicly available at: http://guanlab.ccmb.med.umich.edu/isoformnetwork.
Collapse
|
33
|
Ruprecht C, Mendrinna A, Tohge T, Sampathkumar A, Klie S, Fernie AR, Nikoloski Z, Persson S, Mutwil M. FamNet: A Framework to Identify Multiplied Modules Driving Pathway Expansion in Plants. PLANT PHYSIOLOGY 2016; 170:1878-94. [PMID: 26754669 PMCID: PMC4775111 DOI: 10.1104/pp.15.01281] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/14/2015] [Accepted: 01/07/2016] [Indexed: 05/07/2023]
Abstract
Gene duplications generate new genes that can acquire similar but often diversified functions. Recent studies of gene coexpression networks have indicated that, not only genes, but also pathways can be multiplied and diversified to perform related functions in different parts of an organism. Identification of such diversified pathways, or modules, is needed to expand our knowledge of biological processes in plants and to understand how biological functions evolve. However, systematic explorations of modules remain scarce, and no user-friendly platform to identify them exists. We have established a statistical framework to identify modules and show that approximately one-third of the genes of a plant's genome participate in hundreds of multiplied modules. Using this framework as a basis, we implemented a platform that can explore and visualize multiplied modules in coexpression networks of eight plant species. To validate the usefulness of the platform, we identified and functionally characterized pollen- and root-specific cell wall modules that multiplied to confer tip growth in pollen tubes and root hairs, respectively. Furthermore, we identified multiplied modules involved in secondary metabolite synthesis and corroborated them by metabolite profiling of tobacco (Nicotiana tabacum) tissues. The interactive platform, referred to as FamNet, is available at http://www.gene2function.de/famnet.html.
Collapse
Affiliation(s)
- Colin Ruprecht
- Max Planck Institute for Molecular Plant Physiology, 14476 Potsdam, Germany (C.R., T.T, S.K., A.R.F., Z.N., M.M.), School of Biosciences and Australian Research Council Centre of Excellence in Plant Cell Walls, University of Melbourne, Parkville, Victoria 3010, Australia (A.M., S.P.); andDivision of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125 (A.S.)
| | - Amelie Mendrinna
- Max Planck Institute for Molecular Plant Physiology, 14476 Potsdam, Germany (C.R., T.T, S.K., A.R.F., Z.N., M.M.), School of Biosciences and Australian Research Council Centre of Excellence in Plant Cell Walls, University of Melbourne, Parkville, Victoria 3010, Australia (A.M., S.P.); andDivision of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125 (A.S.)
| | - Takayuki Tohge
- Max Planck Institute for Molecular Plant Physiology, 14476 Potsdam, Germany (C.R., T.T, S.K., A.R.F., Z.N., M.M.), School of Biosciences and Australian Research Council Centre of Excellence in Plant Cell Walls, University of Melbourne, Parkville, Victoria 3010, Australia (A.M., S.P.); andDivision of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125 (A.S.)
| | - Arun Sampathkumar
- Max Planck Institute for Molecular Plant Physiology, 14476 Potsdam, Germany (C.R., T.T, S.K., A.R.F., Z.N., M.M.), School of Biosciences and Australian Research Council Centre of Excellence in Plant Cell Walls, University of Melbourne, Parkville, Victoria 3010, Australia (A.M., S.P.); andDivision of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125 (A.S.)
| | - Sebastian Klie
- Max Planck Institute for Molecular Plant Physiology, 14476 Potsdam, Germany (C.R., T.T, S.K., A.R.F., Z.N., M.M.), School of Biosciences and Australian Research Council Centre of Excellence in Plant Cell Walls, University of Melbourne, Parkville, Victoria 3010, Australia (A.M., S.P.); andDivision of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125 (A.S.)
| | - Alisdair R Fernie
- Max Planck Institute for Molecular Plant Physiology, 14476 Potsdam, Germany (C.R., T.T, S.K., A.R.F., Z.N., M.M.), School of Biosciences and Australian Research Council Centre of Excellence in Plant Cell Walls, University of Melbourne, Parkville, Victoria 3010, Australia (A.M., S.P.); andDivision of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125 (A.S.)
| | - Zoran Nikoloski
- Max Planck Institute for Molecular Plant Physiology, 14476 Potsdam, Germany (C.R., T.T, S.K., A.R.F., Z.N., M.M.), School of Biosciences and Australian Research Council Centre of Excellence in Plant Cell Walls, University of Melbourne, Parkville, Victoria 3010, Australia (A.M., S.P.); andDivision of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125 (A.S.)
| | - Staffan Persson
- Max Planck Institute for Molecular Plant Physiology, 14476 Potsdam, Germany (C.R., T.T, S.K., A.R.F., Z.N., M.M.), School of Biosciences and Australian Research Council Centre of Excellence in Plant Cell Walls, University of Melbourne, Parkville, Victoria 3010, Australia (A.M., S.P.); andDivision of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125 (A.S.)
| | - Marek Mutwil
- Max Planck Institute for Molecular Plant Physiology, 14476 Potsdam, Germany (C.R., T.T, S.K., A.R.F., Z.N., M.M.), School of Biosciences and Australian Research Council Centre of Excellence in Plant Cell Walls, University of Melbourne, Parkville, Victoria 3010, Australia (A.M., S.P.); andDivision of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125 (A.S.)
| |
Collapse
|
34
|
ADAGE-Based Integration of Publicly Available Pseudomonas aeruginosa Gene Expression Data with Denoising Autoencoders Illuminates Microbe-Host Interactions. mSystems 2016; 1:mSystems00025-15. [PMID: 27822512 PMCID: PMC5069748 DOI: 10.1128/msystems.00025-15] [Citation(s) in RCA: 76] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2015] [Accepted: 12/08/2015] [Indexed: 12/21/2022] Open
Abstract
The increasing number of genome-wide assays of gene expression available from public databases presents opportunities for computational methods that facilitate hypothesis generation and biological interpretation of these data. We present an unsupervised machine learning approach, ADAGE (analysis using denoising autoencoders of gene expression), and apply it to the publicly available gene expression data compendium for Pseudomonas aeruginosa. In this approach, the machine-learned ADAGE model contained 50 nodes which we predicted would correspond to gene expression patterns across the gene expression compendium. While no biological knowledge was used during model construction, cooperonic genes had similar weights across nodes, and genes with similar weights across nodes were significantly more likely to share KEGG pathways. By analyzing newly generated and previously published microarray and transcriptome sequencing data, the ADAGE model identified differences between strains, modeled the cellular response to low oxygen, and predicted the involvement of biological processes based on low-level gene expression differences. ADAGE compared favorably with traditional principal component analysis and independent component analysis approaches in its ability to extract validated patterns, and based on our analyses, we propose that these approaches differ in the types of patterns they preferentially identify. We provide the ADAGE model with analysis of all publicly available P. aeruginosa GeneChip experiments and open source code for use with other species and settings. Extraction of consistent patterns across large-scale collections of genomic data using methods like ADAGE provides the opportunity to identify general principles and biologically important patterns in microbial biology. This approach will be particularly useful in less-well-studied microbial species. IMPORTANCE The quantity and breadth of genome-scale data sets that examine RNA expression in diverse bacterial and eukaryotic species are increasing more rapidly than for curated knowledge. Our ADAGE method integrates such data without requiring gene function, gene pathway, or experiment labeling, making practical its application to any large gene expression compendium. We built a Pseudomonas aeruginosa ADAGE model from a diverse set of publicly available experiments without any prespecified biological knowledge, and this model was accurate and predictive. We provide ADAGE results for the complete P. aeruginosa GeneChip compendium for use by researchers studying P. aeruginosa and source code that facilitates ADAGE's application to other species and data types. Author Video: An author video summary of this article is available.
Collapse
|
35
|
Gonzalez GH, Tahsin T, Goodale BC, Greene AC, Greene CS. Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery. Brief Bioinform 2015; 17:33-42. [PMID: 26420781 PMCID: PMC4719073 DOI: 10.1093/bib/bbv087] [Citation(s) in RCA: 73] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2015] [Indexed: 02/06/2023] Open
Abstract
Precision medicine will revolutionize the way we treat and prevent disease. A major barrier to the implementation of precision medicine that clinicians and translational scientists face is understanding the underlying mechanisms of disease. We are starting to address this challenge through automatic approaches for information extraction, representation and analysis. Recent advances in text and data mining have been applied to a broad spectrum of key biomedical questions in genomics, pharmacogenomics and other fields. We present an overview of the fundamental methods for text and data mining, as well as recent advances and emerging applications toward precision medicine.
Collapse
|
36
|
Gui J, Greene CS, Sullivan C, Taylor W, Moore JH, Kim C. Testing multiple hypotheses through IMP weighted FDR based on a genetic functional network with application to a new zebrafish transcriptome study. BioData Min 2015; 8:17. [PMID: 26097506 PMCID: PMC4474579 DOI: 10.1186/s13040-015-0050-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2014] [Accepted: 06/08/2015] [Indexed: 11/10/2022] Open
Abstract
In genome-wide studies, hundreds of thousands of hypothesis tests are performed simultaneously. Bonferroni correction and False Discovery Rate (FDR) can effectively control type I error but often yield a high false negative rate. We aim to develop a more powerful method to detect differentially expressed genes. We present a Weighted False Discovery Rate (WFDR) method that incorporate biological knowledge from genetic networks. We first identify weights using Integrative Multi-species Prediction (IMP) and then apply the weights in WFDR to identify differentially expressed genes through an IMP-WFDR algorithm. We performed a gene expression experiment to identify zebrafish genes that change expression in the presence of arsenic during a systemic Pseudomonas aeruginosa infection. Zebrafish were exposed to arsenic at 10 parts per billion and/or infected with P. aeruginosa. Appropriate controls were included. We then applied IMP-WFDR during the analysis of differentially expressed genes. We compared the mRNA expression for each group and found over 200 differentially expressed genes and several enriched pathways including defense response pathways, arsenic response pathways, and the Notch signaling pathway.
Collapse
Affiliation(s)
- Jiang Gui
- Department of Biomedical Data Science, Geisel school of medicine, Dartmouth College, Hanover, NH USA.,Dartmouth-Hitchcock Medical Center, 883 Rubin Bldg, HB7927, One Medical Center Dr., Lebanon, NH USA
| | - Casey S Greene
- Department of Genetics, Geisel school of medicine, Dartmouth College, Hanover, NH USA
| | - Con Sullivan
- Department of Molecular and Biomedical Sciences, University of Maine, Orono, ME USA.,Graduate School of Biomedical Science and Engineeering, University of Maine, Orono, ME USA
| | - Walter Taylor
- Department of Genetics, Geisel school of medicine, Dartmouth College, Hanover, NH USA
| | - Jason H Moore
- Department of Biostatistics and Epidemiology, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA USA
| | - Carol Kim
- Department of Molecular and Biomedical Sciences, University of Maine, Orono, ME USA.,Graduate School of Biomedical Science and Engineeering, University of Maine, Orono, ME USA
| |
Collapse
|
37
|
Wong AK, Krishnan A, Yao V, Tadych A, Troyanskaya OG. IMP 2.0: a multi-species functional genomics portal for integration, visualization and prediction of protein functions and networks. Nucleic Acids Res 2015; 43:W128-33. [PMID: 25969450 PMCID: PMC4489318 DOI: 10.1093/nar/gkv486] [Citation(s) in RCA: 65] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2015] [Accepted: 05/02/2015] [Indexed: 01/08/2023] Open
Abstract
IMP (Integrative Multi-species Prediction), originally released in 2012, is an interactive web server that enables molecular biologists to interpret experimental results and to generate hypotheses in the context of a large cross-organism compendium of functional predictions and networks. The system provides biologists with a framework to analyze their candidate gene sets in the context of functional networks, expanding or refining their sets using functional relationships predicted from integrated high-throughput data. IMP 2.0 integrates updated prior knowledge and data collections from the last three years in the seven supported organisms (Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Danio rerio, Caenorhabditis elegans, and Saccharomyces cerevisiae) and extends function prediction coverage to include human disease. IMP identifies homologs with conserved functional roles for disease knowledge transfer, allowing biologists to analyze disease contexts and predictions across all organisms. Additionally, IMP 2.0 implements a new flexible platform for experts to generate custom hypotheses about biological processes or diseases, making sophisticated data-driven methods easily accessible to researchers. IMP does not require any registration or installation and is freely available for use at http://imp.princeton.edu.
Collapse
Affiliation(s)
- Aaron K Wong
- Department of Computer Science, Princeton University, Princeton, NJ 08540, USA Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA Simons Center for Data Analysis, Simons Foundation, NY 10010, USA
| | - Arjun Krishnan
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA
| | - Victoria Yao
- Department of Computer Science, Princeton University, Princeton, NJ 08540, USA Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA
| | - Alicja Tadych
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA
| | - Olga G Troyanskaya
- Department of Computer Science, Princeton University, Princeton, NJ 08540, USA Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA Simons Center for Data Analysis, Simons Foundation, NY 10010, USA
| |
Collapse
|
38
|
Understanding multicellular function and disease with human tissue-specific networks. Nat Genet 2015; 47:569-76. [PMID: 25915600 PMCID: PMC4828725 DOI: 10.1038/ng.3259] [Citation(s) in RCA: 594] [Impact Index Per Article: 59.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2014] [Accepted: 03/06/2015] [Indexed: 12/17/2022]
Abstract
Tissue and cell-type identity lie at the core of human physiology and disease. Understanding the genetic underpinnings of complex tissues and individual cell lineages is crucial for developing improved diagnostics and therapeutics. We present genome-wide functional interaction networks for 144 human tissues and cell types developed using a data-driven Bayesian methodology that integrates thousands of diverse experiments spanning tissue and disease states. Tissue-specific networks predict lineage-specific responses to perturbation, reveal genes’ changing functional roles across tissues, and illuminate disease-disease relationships. We introduce NetWAS, which combines genes with nominally significant GWAS p-values and tissue-specific networks to identify disease-gene associations more accurately than GWAS alone. Our webserver, GIANT, provides an interface to human tissue networks through multi-gene queries, network visualization, analysis tools including NetWAS, and downloadable networks. GIANT enables systematic exploration of the landscape of interacting genes that shape specialized cellular functions across more than one hundred human tissues and cell types.
Collapse
|
39
|
Greene AC, Giffin KA, Greene CS, Moore JH. Adapting bioinformatics curricula for big data. Brief Bioinform 2015; 17:43-50. [PMID: 25829469 PMCID: PMC4719066 DOI: 10.1093/bib/bbv018] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2014] [Indexed: 12/16/2022] Open
Abstract
Modern technologies are capable of generating enormous amounts of data that measure complex biological systems. Computational biologists and bioinformatics scientists are increasingly being asked to use these data to reveal key systems-level properties. We review the extent to which curricula are changing in the era of big data. We identify key competencies that scientists dealing with big data are expected to possess across fields, and we use this information to propose courses to meet these growing needs. While bioinformatics programs have traditionally trained students in data-intensive science, we identify areas of particular biological, computational and statistical emphasis important for this era that can be incorporated into existing curricula. For each area, we propose a course structured around these topics, which can be adapted in whole or in parts into existing curricula. In summary, specific challenges associated with big data provide an important opportunity to update existing curricula, but we do not foresee a wholesale redesign of bioinformatics training programs.
Collapse
|
40
|
Wangler MF, Yamamoto S, Bellen HJ. Fruit flies in biomedical research. Genetics 2015; 199:639-653. [PMID: 25624315 PMCID: PMC4349060 DOI: 10.1534/genetics.114.171785] [Citation(s) in RCA: 119] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2014] [Accepted: 12/09/2014] [Indexed: 12/13/2022] Open
Abstract
Many scientists complain that the current funding situation is dire. Indeed, there has been an overall decline in support in funding for research from the National Institutes of Health and the National Science Foundation. Within the Drosophila field, some of us question how long this funding crunch will last as it demotivates principal investigators and perhaps more importantly affects the long-term career choice of many young scientists. Yet numerous very interesting biological processes and avenues remain to be investigated in Drosophila, and probing questions can be answered fast and efficiently in flies to reveal new biological phenomena. Moreover, Drosophila is an excellent model organism for studies that have translational impact for genetic disease and for other medical implications such as vector-borne illnesses. We would like to promote a better collaboration between Drosophila geneticists/biologists and human geneticists/bioinformaticians/clinicians, as it would benefit both fields and significantly impact the research on human diseases.
Collapse
Affiliation(s)
- Michael F Wangler
- Department of Molecular and Human Genetics, Baylor College of Medicine (BCM), Houston, Texas 77030 Department of Pediatrics, Baylor College of Medicine (BCM), Houston, Texas 77030 Jan and Dan Duncan Neurological Research Institute, Texas Children's Hospital, Houston, Texas 77030
| | - Shinya Yamamoto
- Department of Molecular and Human Genetics, Baylor College of Medicine (BCM), Houston, Texas 77030 Jan and Dan Duncan Neurological Research Institute, Texas Children's Hospital, Houston, Texas 77030 Program in Developmental Biology, Baylor College of Medicine (BCM), Texas 77030
| | - Hugo J Bellen
- Department of Molecular and Human Genetics, Baylor College of Medicine (BCM), Houston, Texas 77030 Jan and Dan Duncan Neurological Research Institute, Texas Children's Hospital, Houston, Texas 77030 Program in Developmental Biology, Baylor College of Medicine (BCM), Texas 77030 Department of Neuroscience, Baylor College of Medicine (BCM), Texas 77030 Howard Hughes Medical Institute, Houston, Texas 77030
| |
Collapse
|
41
|
Park CY, Krishnan A, Zhu Q, Wong AK, Lee YS, Troyanskaya OG. Tissue-aware data integration approach for the inference of pathway interactions in metazoan organisms. ACTA ACUST UNITED AC 2014; 31:1093-101. [PMID: 25431329 DOI: 10.1093/bioinformatics/btu786] [Citation(s) in RCA: 69] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2014] [Accepted: 11/20/2014] [Indexed: 11/12/2022]
Abstract
MOTIVATION Leveraging the large compendium of genomic data to predict biomedical pathways and specific mechanisms of protein interactions genome-wide in metazoan organisms has been challenging. In contrast to unicellular organisms, biological and technical variation originating from diverse tissues and cell-lineages is often the largest source of variation in metazoan data compendia. Therefore, a new computational strategy accounting for the tissue heterogeneity in the functional genomic data is needed to accurately translate the vast amount of human genomic data into specific interaction-level hypotheses. RESULTS We developed an integrated, scalable strategy for inferring multiple human gene interaction types that takes advantage of data from diverse tissue and cell-lineage origins. Our approach specifically predicts both the presence of a functional association and also the most likely interaction type among human genes or its protein products on a whole-genome scale. We demonstrate that directly incorporating tissue contextual information improves the accuracy of our predictions, and further, that such genome-wide results can be used to significantly refine regulatory interactions from primary experimental datasets (e.g. ChIP-Seq, mass spectrometry). AVAILABILITY AND IMPLEMENTATION An interactive website hosting all of our interaction predictions is publically available at http://pathwaynet.princeton.edu. Software was implemented using the open-source Sleipnir library, which is available for download at https://bitbucket.org/libsleipnir/libsleipnir.bitbucket.org. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Christopher Y Park
- Department of Computer Science, Princeton University, Princeton, NJ 08544, USA, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA and Simons Center for Data Analysis, Simons Foundation, New York, NY, 10010, USA Department of Computer Science, Princeton University, Princeton, NJ 08544, USA, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA and Simons Center for Data Analysis, Simons Foundation, New York, NY, 10010, USA
| | - Arjun Krishnan
- Department of Computer Science, Princeton University, Princeton, NJ 08544, USA, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA and Simons Center for Data Analysis, Simons Foundation, New York, NY, 10010, USA
| | - Qian Zhu
- Department of Computer Science, Princeton University, Princeton, NJ 08544, USA, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA and Simons Center for Data Analysis, Simons Foundation, New York, NY, 10010, USA Department of Computer Science, Princeton University, Princeton, NJ 08544, USA, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA and Simons Center for Data Analysis, Simons Foundation, New York, NY, 10010, USA
| | - Aaron K Wong
- Department of Computer Science, Princeton University, Princeton, NJ 08544, USA, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA and Simons Center for Data Analysis, Simons Foundation, New York, NY, 10010, USA Department of Computer Science, Princeton University, Princeton, NJ 08544, USA, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA and Simons Center for Data Analysis, Simons Foundation, New York, NY, 10010, USA
| | - Young-Suk Lee
- Department of Computer Science, Princeton University, Princeton, NJ 08544, USA, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA and Simons Center for Data Analysis, Simons Foundation, New York, NY, 10010, USA Department of Computer Science, Princeton University, Princeton, NJ 08544, USA, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA and Simons Center for Data Analysis, Simons Foundation, New York, NY, 10010, USA
| | - Olga G Troyanskaya
- Department of Computer Science, Princeton University, Princeton, NJ 08544, USA, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA and Simons Center for Data Analysis, Simons Foundation, New York, NY, 10010, USA Department of Computer Science, Princeton University, Princeton, NJ 08544, USA, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA and Simons Center for Data Analysis, Simons Foundation, New York, NY, 10010, USA Department of Computer Science, Princeton University, Princeton, NJ 08544, USA, Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08540, USA and Simons Center for Data Analysis, Simons Foundation, New York, NY, 10010, USA
| |
Collapse
|
42
|
Li HD, Menon R, Omenn GS, Guan Y. Revisiting the identification of canonical splice isoforms through integration of functional genomics and proteomics evidence. Proteomics 2014; 14:2709-18. [PMID: 25265570 DOI: 10.1002/pmic.201400170] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2014] [Revised: 08/11/2014] [Accepted: 09/23/2014] [Indexed: 01/08/2023]
Abstract
Canonical isoforms in different databases have been defined as the most prevalent, most conserved, most expressed, longest, or the one with the clearest description of domains or posttranslational modifications. In this article, we revisit these definitions of canonical isoforms based on functional genomics and proteomics evidence, focusing on mouse data. We report a novel functional relationship network-based approach for identifying the highest connected isoforms (HCIs). We show that 46% of these HCIs are not the longest transcripts. In addition, this approach revealed many genes that have more than one highly connected isoforms. Averaged across 175 RNA-seq datasets covering diverse tissues and conditions, 65% of the HCIs show higher expression levels than nonhighest connected isoforms at the transcript level. At the protein level, these HCIs highly overlap with the expressed splice variants, based on proteomic data from eight different normal tissues. These results suggest that a more confident definition of canonical isoforms can be made through integration of multiple lines of evidence, including HCIs defined by biological processes and pathways, expression prevalence at the transcript level, and relative or absolute abundance at the protein level. This integrative proteogenomics approach can successfully identify principal isoforms that are responsible for the canonical functions of genes.
Collapse
Affiliation(s)
- Hong-Dong Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | | | | | | |
Collapse
|
43
|
Joice R, Yasuda K, Shafquat A, Morgan XC, Huttenhower C. Determining microbial products and identifying molecular targets in the human microbiome. Cell Metab 2014; 20:731-741. [PMID: 25440055 PMCID: PMC4254638 DOI: 10.1016/j.cmet.2014.10.003] [Citation(s) in RCA: 71] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Human-associated microbes are the source of many bioactive microbial products (proteins and metabolites) that play key functions both in human host pathways and in microbe-microbe interactions. Culture-independent studies now provide an accelerated means of exploring novel bioactives in the human microbiome; however, intriguingly, a substantial fraction of the microbial metagenome cannot be mapped to annotated genes or isolate genomes and is thus of unknown function. Meta'omic approaches, including metagenomic sequencing, metatranscriptomics, metabolomics, and integration of multiple assay types, represent an opportunity to efficiently explore this large pool of potential therapeutics. In combination with appropriate follow-up validation, high-throughput culture-independent assays can be combined with computational approaches to identify and characterize novel and biologically interesting microbial products. Here we briefly review the state of microbial product identification and characterization and discuss possible next steps to catalog and leverage the large uncharted fraction of the microbial metagenome.
Collapse
Affiliation(s)
- Regina Joice
- Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Koji Yasuda
- Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Afrah Shafquat
- Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Xochitl C Morgan
- Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
| | - Curtis Huttenhower
- Department of Biostatistics, Harvard School of Public Health, Boston, MA 02115, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
| |
Collapse
|
44
|
Yu J, Wu H, Wen Y, Liu Y, Zhou T, Ni B, Lin Y, Dong J, Zhou Z, Hu Z, Guo X, Sha J, Tong C. Identification of seven genes essential for male fertility through a genome-wide association study of non-obstructive azoospermia and RNA interference-mediated large-scale functional screening in Drosophila. Hum Mol Genet 2014; 24:1493-503. [DOI: 10.1093/hmg/ddu557] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022] Open
|
45
|
Zhu F, Shi L, Li H, Eksi R, Engel JD, Guan Y. Modeling dynamic functional relationship networks and application to ex vivo human erythroid differentiation. ACTA ACUST UNITED AC 2014; 30:3325-33. [PMID: 25115705 DOI: 10.1093/bioinformatics/btu542] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
MOTIVATION Functional relationship networks, which summarize the probability of co-functionality between any two genes in the genome, could complement the reductionist focus of modern biology for understanding diverse biological processes in an organism. One major limitation of the current networks is that they are static, while one might expect functional relationships to consistently reprogram during the differentiation of a cell lineage. To address this potential limitation, we developed a novel algorithm that leverages both differentiation stage-specific expression data and large-scale heterogeneous functional genomic data to model such dynamic changes. We then applied this algorithm to the time-course RNA-Seq data we collected for ex vivo human erythroid cell differentiation. RESULTS Through computational cross-validation and literature validation, we show that the resulting networks correctly predict the (de)-activated functional connections between genes during erythropoiesis. We identified known critical genes, such as HBD and GATA1, and functional connections during erythropoiesis using these dynamic networks, while the traditional static network was not able to provide such information. Furthermore, by comparing the static and the dynamic networks, we identified novel genes (such as OSBP2 and PDZK1IP1) that are potential drivers of erythroid cell differentiation. This novel method of modeling dynamic networks is applicable to other differentiation processes where time-course genome-scale expression data are available, and should assist in generating greater understanding of the functional dynamics at play across the genome during development. AVAILABILITY AND IMPLEMENTATION The network described in this article is available at http://guanlab.ccmb.med.umich.edu/stageSpecificNetwork.
Collapse
Affiliation(s)
- Fan Zhu
- Department of Computational Medicine and Bioinformatics, Department of Cell and Developmental Biology, Department of Internal Medicine and Department of Computer Science and Engineering, University of Michigan, MI48109, USA
| | - Lihong Shi
- Department of Computational Medicine and Bioinformatics, Department of Cell and Developmental Biology, Department of Internal Medicine and Department of Computer Science and Engineering, University of Michigan, MI48109, USA
| | - Hongdong Li
- Department of Computational Medicine and Bioinformatics, Department of Cell and Developmental Biology, Department of Internal Medicine and Department of Computer Science and Engineering, University of Michigan, MI48109, USA
| | - Ridvan Eksi
- Department of Computational Medicine and Bioinformatics, Department of Cell and Developmental Biology, Department of Internal Medicine and Department of Computer Science and Engineering, University of Michigan, MI48109, USA
| | - James Douglas Engel
- Department of Computational Medicine and Bioinformatics, Department of Cell and Developmental Biology, Department of Internal Medicine and Department of Computer Science and Engineering, University of Michigan, MI48109, USA
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, Department of Cell and Developmental Biology, Department of Internal Medicine and Department of Computer Science and Engineering, University of Michigan, MI48109, USA Department of Computational Medicine and Bioinformatics, Department of Cell and Developmental Biology, Department of Internal Medicine and Department of Computer Science and Engineering, University of Michigan, MI48109, USA Department of Computational Medicine and Bioinformatics, Department of Cell and Developmental Biology, Department of Internal Medicine and Department of Computer Science and Engineering, University of Michigan, MI48109, USA
| |
Collapse
|
46
|
Selecting biologically informative genes in co-expression networks with a centrality score. Biol Direct 2014; 9:12. [PMID: 24947308 PMCID: PMC4079186 DOI: 10.1186/1745-6150-9-12] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2014] [Accepted: 06/11/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Measures of node centrality in biological networks are useful to detect genes with critical functional roles. In gene co-expression networks, highly connected genes (i.e., candidate hubs) have been associated with key disease-related pathways. Although different approaches to estimating gene centrality are available, their potential biological relevance in gene co-expression networks deserves further investigation. Moreover, standard measures of gene centrality focus on binary interaction networks, which may not always be suitable in the context of co-expression networks. Here, I also investigate a method that identifies potential biologically meaningful genes based on a weighted connectivity score and indicators of statistical relevance. RESULTS The method enables a characterization of the strength and diversity of co-expression associations in the network. It outperformed standard centrality measures by highlighting more biologically informative genes in different gene co-expression networks and biological research domains. As part of the illustration of the gene selection potential of this approach, I present an application case in zebrafish heart regeneration. The proposed technique predicted genes that are significantly implicated in cellular processes required for tissue regeneration after injury. CONCLUSIONS A method for selecting biologically informative genes from gene co-expression networks is provided, together with free open software.
Collapse
|
47
|
Penrod NM, Greene CS, Moore JH. Predicting targeted drug combinations based on Pareto optimal patterns of coexpression network connectivity. Genome Med 2014; 6:33. [PMID: 24944582 PMCID: PMC4062052 DOI: 10.1186/gm550] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2013] [Accepted: 04/22/2014] [Indexed: 01/05/2023] Open
Abstract
Background Molecularly targeted drugs promise a safer and more effective treatment modality than conventional chemotherapy for cancer patients. However, tumors are dynamic systems that readily adapt to these agents activating alternative survival pathways as they evolve resistant phenotypes. Combination therapies can overcome resistance but finding the optimal combinations efficiently presents a formidable challenge. Here we introduce a new paradigm for the design of combination therapy treatment strategies that exploits the tumor adaptive process to identify context-dependent essential genes as druggable targets. Methods We have developed a framework to mine high-throughput transcriptomic data, based on differential coexpression and Pareto optimization, to investigate drug-induced tumor adaptation. We use this approach to identify tumor-essential genes as druggable candidates. We apply our method to a set of ER+ breast tumor samples, collected before (n = 58) and after (n = 60) neoadjuvant treatment with the aromatase inhibitor letrozole, to prioritize genes as targets for combination therapy with letrozole treatment. We validate letrozole-induced tumor adaptation through coexpression and pathway analyses in an independent data set (n = 18). Results We find pervasive differential coexpression between the untreated and letrozole-treated tumor samples as evidence of letrozole-induced tumor adaptation. Based on patterns of coexpression, we identify ten genes as potential candidates for combination therapy with letrozole including EPCAM, a letrozole-induced essential gene and a target to which drugs have already been developed as cancer therapeutics. Through replication, we validate six letrozole-induced coexpression relationships and confirm the epithelial-to-mesenchymal transition as a process that is upregulated in the residual tumor samples following letrozole treatment. Conclusions To derive the greatest benefit from molecularly targeted drugs it is critical to design combination treatment strategies rationally. Incorporating knowledge of the tumor adaptation process into the design provides an opportunity to match targeted drugs to the evolving tumor phenotype and surmount resistance.
Collapse
Affiliation(s)
- Nadia M Penrod
- Department of Pharmacology and Toxicology, Geisel School of Medicine at Dartmouth College, HB7937 One Medical Center Dr, Lebanon NH 03766, USA
| | - Casey S Greene
- Department of Genetics, Geisel School of Medicine at Dartmouth College, HB7937 One Medical Center Dr, Lebanon NH 03766, USA ; Institute for Quantitative Biomedical Sciences, Geisel School of Medicine at Dartmouth College, HB7937 One Medical Center Dr, Lebanon NH 03766, USA
| | - Jason H Moore
- Department of Genetics, Geisel School of Medicine at Dartmouth College, HB7937 One Medical Center Dr, Lebanon NH 03766, USA ; Institute for Quantitative Biomedical Sciences, Geisel School of Medicine at Dartmouth College, HB7937 One Medical Center Dr, Lebanon NH 03766, USA
| |
Collapse
|
48
|
Li L, Cui X, Yu S, Zhang Y, Luo Z, Yang H, Zhou Y, Zheng X. PSSP-RFE: accurate prediction of protein structural class by recursive feature extraction from PSI-BLAST profile, physical-chemical property and functional annotations. PLoS One 2014; 9:e92863. [PMID: 24675610 PMCID: PMC3968047 DOI: 10.1371/journal.pone.0092863] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2013] [Accepted: 02/27/2014] [Indexed: 02/05/2023] Open
Abstract
Protein structure prediction is critical to functional annotation of the massively accumulated biological sequences, which prompts an imperative need for the development of high-throughput technologies. As a first and key step in protein structure prediction, protein structural class prediction becomes an increasingly challenging task. Amongst most homological-based approaches, the accuracies of protein structural class prediction are sufficiently high for high similarity datasets, but still far from being satisfactory for low similarity datasets, i.e., below 40% in pairwise sequence similarity. Therefore, we present a novel method for accurate and reliable protein structural class prediction for both high and low similarity datasets. This method is based on Support Vector Machine (SVM) in conjunction with integrated features from position-specific score matrix (PSSM), PROFEAT and Gene Ontology (GO). A feature selection approach, SVM-RFE, is also used to rank the integrated feature vectors through recursively removing the feature with the lowest ranking score. The definitive top features selected by SVM-RFE are input into the SVM engines to predict the structural class of a query protein. To validate our method, jackknife tests were applied to seven widely used benchmark datasets, reaching overall accuracies between 84.61% and 99.79%, which are significantly higher than those achieved by state-of-the-art tools. These results suggest that our method could serve as an accurate and cost-effective alternative to existing methods in protein structural classification, especially for low similarity datasets.
Collapse
Affiliation(s)
- Liqi Li
- Department of General Surgery, Xinqiao Hospital, Third Military Medical University, Chongqing, China
| | - Xiang Cui
- Department of Orthopedics, Xinqiao Hospital, Third Military Medical University, Chongqing, China
| | - Sanjiu Yu
- Institute of Cardiovascular Diseases of PLA, Xinqiao Hospital, Third Military Medical University, Chongqing, China
| | - Yuan Zhang
- Department of Orthopedics, Xinqiao Hospital, Third Military Medical University, Chongqing, China
| | - Zhong Luo
- Key Laboratory of Biorheological Science and Technology, Ministry of Education, College of Bioengineering, Chongqing University, Chongqing, China
| | - Hua Yang
- Department of General Surgery, Xinqiao Hospital, Third Military Medical University, Chongqing, China
- Department of Surgery, The University of Michigan Medical School, Ann Arbor, Michigan, United States of America
| | - Yue Zhou
- Department of Orthopedics, Xinqiao Hospital, Third Military Medical University, Chongqing, China
| | - Xiaoqi Zheng
- Department of Mathematics, Shanghai Normal University, Shanghai, China
- Department of Biostatistics and Computational Biology, Harvard School of Public Health, Boston, United States of America
| |
Collapse
|