1
|
Kernfeld E, Keener R, Cahan P, Battle A. Transcriptome data are insufficient to control false discoveries in regulatory network inference. Cell Syst 2024; 15:709-724.e13. [PMID: 39173585 PMCID: PMC11642480 DOI: 10.1016/j.cels.2024.07.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Revised: 05/31/2024] [Accepted: 07/22/2024] [Indexed: 08/24/2024]
Abstract
Inference of causal transcriptional regulatory networks (TRNs) from transcriptomic data suffers notoriously from false positives. Approaches to control the false discovery rate (FDR), for example, via permutation, bootstrapping, or multivariate Gaussian distributions, suffer from several complications: difficulty in distinguishing direct from indirect regulation, nonlinear effects, and causal structure inference requiring "causal sufficiency," meaning experiments that are free of any unmeasured, confounding variables. Here, we use a recently developed statistical framework, model-X knockoffs, to control the FDR while accounting for indirect effects, nonlinear dose-response, and user-provided covariates. We adjust the procedure to estimate the FDR correctly even when measured against incomplete gold standards. However, benchmarking against chromatin immunoprecipitation (ChIP) and other gold standards reveals higher observed than reported FDR. This indicates that unmeasured confounding is a major driver of FDR in TRN inference. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Eric Kernfeld
- Department of Biomedical Engineering, Johns Hopkins University, 3400 N. Charles Street, Wyman Park Building, Suite 400 West, Baltimore, MD 21218, USA
| | - Rebecca Keener
- Department of Biomedical Engineering, Johns Hopkins University, 3400 N. Charles Street, Wyman Park Building, Suite 400 West, Baltimore, MD 21218, USA
| | - Patrick Cahan
- Department of Biomedical Engineering, Johns Hopkins University, 3400 N. Charles Street, Wyman Park Building, Suite 400 West, Baltimore, MD 21218, USA; Institute for Cell Engineering, Johns Hopkins Medicine, Baltimore, MD, USA; Department of Molecular Biology and Genetics, Johns Hopkins University, Baltimore, MD, USA.
| | - Alexis Battle
- Department of Biomedical Engineering, Johns Hopkins University, 3400 N. Charles Street, Wyman Park Building, Suite 400 West, Baltimore, MD 21218, USA; Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA; Department of Genetic Medicine, Johns Hopkins Medicine, Baltimore, MD, USA; Malone Center for Engineering and Healthcare, Johns Hopkins University, Baltimore, MD, USA; Data Science and AI Institute, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
2
|
Chen Y, Mao R, Xu J, Huang Y, Xu J, Cui S, Zhu Z, Ji X, Huang S, Huang Y, Huang HY, Yen SC, Lin YCD, Huang HD. A Causal Regulation Modeling Algorithm for Temporal Events with Application to Escherichia coli's Aerobic to Anaerobic Transition. Int J Mol Sci 2024; 25:5654. [PMID: 38891842 PMCID: PMC11171773 DOI: 10.3390/ijms25115654] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2024] [Revised: 05/10/2024] [Accepted: 05/21/2024] [Indexed: 06/21/2024] Open
Abstract
Time-series experiments are crucial for understanding the transient and dynamic nature of biological phenomena. These experiments, leveraging advanced classification and clustering algorithms, allow for a deep dive into the cellular processes. However, while these approaches effectively identify patterns and trends within data, they often need to improve in elucidating the causal mechanisms behind these changes. Building on this foundation, our study introduces a novel algorithm for temporal causal signaling modeling, integrating established knowledge networks with sequential gene expression data to elucidate signal transduction pathways over time. Focusing on Escherichia coli's (E. coli) aerobic to anaerobic transition (AAT), this research marks a significant leap in understanding the organism's metabolic shifts. By applying our algorithm to a comprehensive E. coli regulatory network and a time-series microarray dataset, we constructed the cross-time point core signaling and regulatory processes of E. coli's AAT. Through gene expression analysis, we validated the primary regulatory interactions governing this process. We identified a novel regulatory scheme wherein environmentally responsive genes, soxR and oxyR, activate fur, modulating the nitrogen metabolism regulators fnr and nac. This regulatory cascade controls the stress regulators ompR and lrhA, ultimately affecting the cell motility gene flhD, unveiling a novel regulatory axis that elucidates the complex regulatory dynamics during the AAT process. Our approach, merging empirical data with prior knowledge, represents a significant advance in modeling cellular signaling processes, offering a deeper understanding of microbial physiology and its applications in biotechnology.
Collapse
Affiliation(s)
- Yigang Chen
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.C.); (R.M.); (J.X.); (Y.H.); (J.X.); (S.C.); (Z.Z.); (X.J.); (S.H.); (Y.H.); (H.-Y.H.); (S.-C.Y.)
- Warshel Institute for Computational Biology, School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China
| | - Runbo Mao
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.C.); (R.M.); (J.X.); (Y.H.); (J.X.); (S.C.); (Z.Z.); (X.J.); (S.H.); (Y.H.); (H.-Y.H.); (S.-C.Y.)
| | - Jiatong Xu
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.C.); (R.M.); (J.X.); (Y.H.); (J.X.); (S.C.); (Z.Z.); (X.J.); (S.H.); (Y.H.); (H.-Y.H.); (S.-C.Y.)
- Warshel Institute for Computational Biology, School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China
| | - Yixian Huang
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.C.); (R.M.); (J.X.); (Y.H.); (J.X.); (S.C.); (Z.Z.); (X.J.); (S.H.); (Y.H.); (H.-Y.H.); (S.-C.Y.)
- Warshel Institute for Computational Biology, School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China
| | - Jingyi Xu
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.C.); (R.M.); (J.X.); (Y.H.); (J.X.); (S.C.); (Z.Z.); (X.J.); (S.H.); (Y.H.); (H.-Y.H.); (S.-C.Y.)
| | - Shidong Cui
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.C.); (R.M.); (J.X.); (Y.H.); (J.X.); (S.C.); (Z.Z.); (X.J.); (S.H.); (Y.H.); (H.-Y.H.); (S.-C.Y.)
- Warshel Institute for Computational Biology, School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China
| | - Zihao Zhu
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.C.); (R.M.); (J.X.); (Y.H.); (J.X.); (S.C.); (Z.Z.); (X.J.); (S.H.); (Y.H.); (H.-Y.H.); (S.-C.Y.)
- Warshel Institute for Computational Biology, School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China
| | - Xiang Ji
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.C.); (R.M.); (J.X.); (Y.H.); (J.X.); (S.C.); (Z.Z.); (X.J.); (S.H.); (Y.H.); (H.-Y.H.); (S.-C.Y.)
- Warshel Institute for Computational Biology, School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China
| | - Shenghan Huang
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.C.); (R.M.); (J.X.); (Y.H.); (J.X.); (S.C.); (Z.Z.); (X.J.); (S.H.); (Y.H.); (H.-Y.H.); (S.-C.Y.)
- Warshel Institute for Computational Biology, School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China
| | - Yanzhe Huang
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.C.); (R.M.); (J.X.); (Y.H.); (J.X.); (S.C.); (Z.Z.); (X.J.); (S.H.); (Y.H.); (H.-Y.H.); (S.-C.Y.)
| | - Hsi-Yuan Huang
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.C.); (R.M.); (J.X.); (Y.H.); (J.X.); (S.C.); (Z.Z.); (X.J.); (S.H.); (Y.H.); (H.-Y.H.); (S.-C.Y.)
- Warshel Institute for Computational Biology, School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China
| | - Shih-Chung Yen
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.C.); (R.M.); (J.X.); (Y.H.); (J.X.); (S.C.); (Z.Z.); (X.J.); (S.H.); (Y.H.); (H.-Y.H.); (S.-C.Y.)
- Warshel Institute for Computational Biology, School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China
| | - Yang-Chi-Duang Lin
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.C.); (R.M.); (J.X.); (Y.H.); (J.X.); (S.C.); (Z.Z.); (X.J.); (S.H.); (Y.H.); (H.-Y.H.); (S.-C.Y.)
- Warshel Institute for Computational Biology, School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China
| | - Hsien-Da Huang
- School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China; (Y.C.); (R.M.); (J.X.); (Y.H.); (J.X.); (S.C.); (Z.Z.); (X.J.); (S.H.); (Y.H.); (H.-Y.H.); (S.-C.Y.)
- Warshel Institute for Computational Biology, School of Medicine, The Chinese University of Hong Kong, Shenzhen, Longgang District, Shenzhen 518172, China
| |
Collapse
|
3
|
Velten B, Stegle O. Principles and challenges of modeling temporal and spatial omics data. Nat Methods 2023; 20:1462-1474. [PMID: 37710019 DOI: 10.1038/s41592-023-01992-y] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2022] [Accepted: 07/31/2023] [Indexed: 09/16/2023]
Abstract
Studies with temporal or spatial resolution are crucial to understand the molecular dynamics and spatial dependencies underlying a biological process or system. With advances in high-throughput omic technologies, time- and space-resolved molecular measurements at scale are increasingly accessible, providing new opportunities to study the role of timing or structure in a wide range of biological questions. At the same time, analyses of the data being generated in the context of spatiotemporal studies entail new challenges that need to be considered, including the need to account for temporal and spatial dependencies and compare them across different scales, biological samples or conditions. In this Review, we provide an overview of common principles and challenges in the analysis of temporal and spatial omics data. We discuss statistical concepts to model temporal and spatial dependencies and highlight opportunities for adapting existing analysis methods to data with temporal and spatial dimensions.
Collapse
Affiliation(s)
- Britta Velten
- Division of Computational Genomics and Systems Genetics, German Cancer Research Center (DKFZ), Heidelberg, Germany.
- Cellular Genetics Programme, Wellcome Sanger Institute, Hinxton, Cambridge, UK.
- Centre for Organismal Studies (COS) and Interdisciplinary Center for Scientific Computing (IWR), Heidelberg University, Heidelberg, Germany.
| | - Oliver Stegle
- Division of Computational Genomics and Systems Genetics, German Cancer Research Center (DKFZ), Heidelberg, Germany.
- Cellular Genetics Programme, Wellcome Sanger Institute, Hinxton, Cambridge, UK.
- Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany.
| |
Collapse
|
4
|
Deshpande A, Chu LF, Stewart R, Gitter A. Network inference with Granger causality ensembles on single-cell transcriptomics. Cell Rep 2022; 38:110333. [PMID: 35139376 PMCID: PMC9093087 DOI: 10.1016/j.celrep.2022.110333] [Citation(s) in RCA: 50] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2019] [Revised: 02/19/2021] [Accepted: 01/12/2022] [Indexed: 12/20/2022] Open
Abstract
Cellular gene expression changes throughout a dynamic biological process, such as differentiation. Pseudotimes estimate cells' progress along a dynamic process based on their individual gene expression states. Ordering the expression data by pseudotime provides information about the underlying regulator-gene interactions. Because the pseudotime distribution is not uniform, many standard mathematical methods are inapplicable for analyzing the ordered gene expression states. Here we present single-cell inference of networks using Granger ensembles (SINGE), an algorithm for gene regulatory network inference from ordered single-cell gene expression data. SINGE uses kernel-based Granger causality regression to smooth irregular pseudotimes and missing expression values. It aggregates predictions from an ensemble of regression analyses to compile a ranked list of candidate interactions between transcriptional regulators and target genes. In two mouse embryonic stem cell differentiation datasets, SINGE outperforms other contemporary algorithms. However, a more detailed examination reveals caveats about poor performance for individual regulators and uninformative pseudotimes.
Collapse
Affiliation(s)
- Atul Deshpande
- Department of Electrical and Computer Engineering, University of Wisconsin - Madison, Madison, WI 53706, USA; Morgridge Institute for Research, Madison, WI 53715, USA
| | - Li-Fang Chu
- Morgridge Institute for Research, Madison, WI 53715, USA
| | - Ron Stewart
- Morgridge Institute for Research, Madison, WI 53715, USA
| | - Anthony Gitter
- Morgridge Institute for Research, Madison, WI 53715, USA; Department of Biostatistics and Medical Informatics, University of Wisconsin - Madison, Madison, WI 53792, USA.
| |
Collapse
|
5
|
Schlamp F, Delbare SYN, Early AM, Wells MT, Basu S, Clark AG. Dense time-course gene expression profiling of the Drosophila melanogaster innate immune response. BMC Genomics 2021; 22:304. [PMID: 33902461 PMCID: PMC8074482 DOI: 10.1186/s12864-021-07593-3] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Accepted: 04/09/2021] [Indexed: 12/19/2022] Open
Abstract
BACKGROUND Immune responses need to be initiated rapidly, and maintained as needed, to prevent establishment and growth of infections. At the same time, resources need to be balanced with other physiological processes. On the level of transcription, studies have shown that this balancing act is reflected in tight control of the initiation kinetics and shutdown dynamics of specific immune genes. RESULTS To investigate genome-wide expression dynamics and trade-offs after infection at a high temporal resolution, we performed an RNA-seq time course on D. melanogaster with 20 time points post Imd stimulation. A combination of methods, including spline fitting, cluster analysis, and Granger causality inference, allowed detailed dissection of expression profiles, lead-lag interactions, and functional annotation of genes through guilt-by-association. We identified Imd-responsive genes and co-expressed, less well characterized genes, with an immediate-early response and sustained up-regulation up to 5 days after stimulation. In contrast, stress response and Toll-responsive genes, among which were Bomanins, demonstrated early and transient responses. We further observed a strong trade-off with metabolic genes, which strikingly recovered to pre-infection levels before the immune response was fully resolved. CONCLUSIONS This high-dimensional dataset enabled the comprehensive study of immune response dynamics through the parallel application of multiple temporal data analysis methods. The well annotated data set should also serve as a useful resource for further investigation of the D. melanogaster innate immune response, and for the development of methods for analysis of a post-stress transcriptional response time-series at whole-genome scale.
Collapse
Affiliation(s)
- Florencia Schlamp
- Molecular Biology and Genetics, Cornell University, Ithaca, NY, USA.
| | | | - Angela M Early
- Molecular Biology and Genetics, Cornell University, Ithaca, NY, USA
| | - Martin T Wells
- Statistics and Data Science, Cornell University, Ithaca, NY, USA
| | - Sumanta Basu
- Statistics and Data Science, Cornell University, Ithaca, NY, USA.
| | - Andrew G Clark
- Molecular Biology and Genetics, Cornell University, Ithaca, NY, USA.
- Statistics and Data Science, Cornell University, Ithaca, NY, USA.
| |
Collapse
|
6
|
Barraquand F, Picoche C, Detto M, Hartig F. Inferring species interactions using Granger causality and convergent cross mapping. THEOR ECOL-NETH 2020. [DOI: 10.1007/s12080-020-00482-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
7
|
Li M, Zheng R, Li Y, Wu FX, Wang J. MGT-SM: A Method for Constructing Cellular Signal Transduction Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:417-424. [PMID: 28541220 DOI: 10.1109/tcbb.2017.2705143] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
A cellular signal transduction network is an important means to describe biological responses to environmental stimuli and exchange of biological signals. Constructing the cellular signal transduction network provides an important basis for the study of the biological activities, the mechanism of the diseases, drug targets and so on. The statistical approaches to network inference are popular in literature. Granger test has been used as an effective method for causality inference. Compared with bivariate granger tests, multivariate granger tests reduce the indirect causality and were used widely for the construction of cellular signal transduction networks. A multivariate Granger test requires that the number of time points in the time-series data is more than the number of nodes involved in the network. However, there are many real datasets with a few time points which are much less than the number of nodes in the network. In this study, we propose a new multivariate Granger test-based framework to construct cellular signal transduction network, called MGT-SM. Our MGT-SM uses SVD to compute the coefficient matrix from gene expression data and adopts Monte Carlo simulation to estimate the significance of directed edges in the constructed networks. We apply the proposed MGT-SM to Yeast Synthetic Network and MDA-MB-468, and evaluate its performance in terms of the recall and the AUC. The results show that MGT-SM achieves better results, compared with other popular methods (CGC2SPR, PGC, and DBN).
Collapse
|
8
|
Maeda K. Gene grouping strategy for network modeling from a small time-series dataset: An illustrative analysis of human organogenesis. Biosystems 2019; 179:24-29. [PMID: 30797967 DOI: 10.1016/j.biosystems.2019.02.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2019] [Revised: 02/19/2019] [Accepted: 02/20/2019] [Indexed: 10/27/2022]
Abstract
Several algorithms have been proposed for modeling a gene regulatory network from a time-series expression dataset, but these have been used in relatively few studies because experimental cost often restricts the number of sampling time points to less than that of genes by more than one order of magnitude. In order to reduce the number of parameters for network modeling, we propose a method for grouping genes by both temporal expression pattern and biological function, modeling interactions between the gene groups by a dynamic Bayesian network approach. Results from applying the method to a gene expression dataset on human organogenesis demonstrate that more biologically plausible results can be obtained by modeling an interaction network for groups of genes than by modeling that for single genes.
Collapse
Affiliation(s)
- Kiyohiro Maeda
- Imaging Technology Center, Fujifilm Corporation, 798 Miyanodai, Kaisei-machi, Ashigarakami-gun, Kanagawa, 258-8538, Japan.
| |
Collapse
|
9
|
Causal discovery from sequential data in ALS disease based on entropy criteria. J Biomed Inform 2019; 89:41-55. [DOI: 10.1016/j.jbi.2018.10.004] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2018] [Revised: 10/14/2018] [Accepted: 10/15/2018] [Indexed: 11/21/2022]
|
10
|
Temporal genetic association and temporal genetic causality methods for dissecting complex networks. Nat Commun 2018; 9:3980. [PMID: 30266904 PMCID: PMC6162292 DOI: 10.1038/s41467-018-06203-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2017] [Accepted: 08/23/2018] [Indexed: 12/27/2022] Open
Abstract
A large amount of panomic data has been generated in populations for understanding causal relationships in complex biological systems. Both genetic and temporal models can be used to establish causal relationships among molecular, cellular, or phenotypical traits, but with limitations. To fully utilize high-dimension temporal and genetic data, we develop a multivariate polynomial temporal genetic association (MPTGA) approach for detecting temporal genetic loci (teQTLs) of quantitative traits monitored over time in a population and a temporal genetic causality test (TGCT) for inferring causal relationships between traits linked to the locus. We apply MPTGA and TGCT to simulated data sets and a yeast F2 population in response to rapamycin, and demonstrate increased power to detect teQTLs. We identify a teQTL hotspot locus interacting with rapamycin treatment, infer putative causal regulators of the teQTL hotspot, and experimentally validate RRD1 as the causal regulator for this teQTL hotspot.
Collapse
|
11
|
Li S, Xiao Y, Zhou D, Cai D. Causal inference in nonlinear systems: Granger causality versus time-delayed mutual information. Phys Rev E 2018; 97:052216. [PMID: 29906860 DOI: 10.1103/physreve.97.052216] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2018] [Indexed: 01/17/2023]
Abstract
The Granger causality (GC) analysis has been extensively applied to infer causal interactions in dynamical systems arising from economy and finance, physics, bioinformatics, neuroscience, social science, and many other fields. In the presence of potential nonlinearity in these systems, the validity of the GC analysis in general is questionable. To illustrate this, here we first construct minimal nonlinear systems and show that the GC analysis fails to infer causal relations in these systems-it gives rise to all types of incorrect causal directions. In contrast, we show that the time-delayed mutual information (TDMI) analysis is able to successfully identify the direction of interactions underlying these nonlinear systems. We then apply both methods to neuroscience data collected from experiments and demonstrate that the TDMI analysis but not the GC analysis can identify the direction of interactions among neuronal signals. Our work exemplifies inference hazards in the GC analysis in nonlinear systems and suggests that the TDMI analysis can be an appropriate tool in such a case.
Collapse
Affiliation(s)
- Songting Li
- Courant Institute of Mathematical Sciences, New York University, New York, New York 10012, USA
| | - Yanyang Xiao
- Courant Institute of Mathematical Sciences, New York University, New York, New York 10012, USA and NYUAD Institute, New York University Abu Dhabi, Abu Dhabi, United Arab Emirates
| | - Douglas Zhou
- School of Mathematical Sciences, MOE-LSC, and Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, China
| | - David Cai
- Courant Institute of Mathematical Sciences, New York University, New York, New York 10012, USA; NYUAD Institute, New York University Abu Dhabi, Abu Dhabi, United Arab Emirates; and School of Mathematical Sciences, MOE-LSC, and Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, China
| |
Collapse
|
12
|
Carlin DE, Paull EO, Graim K, Wong CK, Bivol A, Ryabinin P, Ellrott K, Sokolov A, Stuart JM. Prophetic Granger Causality to infer gene regulatory networks. PLoS One 2017; 12:e0170340. [PMID: 29211761 PMCID: PMC5718405 DOI: 10.1371/journal.pone.0170340] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2016] [Accepted: 10/26/2017] [Indexed: 01/09/2023] Open
Abstract
We introduce a novel method called Prophetic Granger Causality (PGC) for inferring gene regulatory networks (GRNs) from protein-level time series data. The method uses an L1-penalized regression adaptation of Granger Causality to model protein levels as a function of time, stimuli, and other perturbations. When combined with a data-independent network prior, the framework outperformed all other methods submitted to the HPN-DREAM 8 breast cancer network inference challenge. Our investigations reveal that PGC provides complementary information to other approaches, raising the performance of ensemble learners, while on its own achieves moderate performance. Thus, PGC serves as a valuable new tool in the bioinformatics toolkit for analyzing temporal datasets. We investigate the general and cell-specific interactions predicted by our method and find several novel interactions, demonstrating the utility of the approach in charting new tumor wiring.
Collapse
Affiliation(s)
- Daniel E. Carlin
- University of California San Diego, Department of Medicine, La Jolla, CA, United States of America
| | - Evan O. Paull
- University of California Santa Cruz, Department of Biomolecular Engineering, Santa Cruz, CA, United States of America
| | - Kiley Graim
- University of California Santa Cruz, Department of Biomolecular Engineering, Santa Cruz, CA, United States of America
| | - Christopher K. Wong
- University of California Santa Cruz, Department of Biomolecular Engineering, Santa Cruz, CA, United States of America
| | - Adrian Bivol
- University of California Santa Cruz, Department of Biomolecular Engineering, Santa Cruz, CA, United States of America
| | - Peter Ryabinin
- University of California Santa Cruz, Department of Biomolecular Engineering, Santa Cruz, CA, United States of America
| | - Kyle Ellrott
- Oregon Health Sciences University, Department of Biomedical Engineering, Portland, OR, United States of America
| | - Artem Sokolov
- University of California Santa Cruz, Department of Biomolecular Engineering, Santa Cruz, CA, United States of America
- * E-mail: (JMS); (AS)
| | - Joshua M. Stuart
- University of California Santa Cruz, Department of Biomolecular Engineering, Santa Cruz, CA, United States of America
- * E-mail: (JMS); (AS)
| |
Collapse
|
13
|
Stable Gene Regulatory Network Modeling From Steady-State Data. Bioengineering (Basel) 2016; 3:bioengineering3020012. [PMID: 28952574 PMCID: PMC5597136 DOI: 10.3390/bioengineering3020012] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2015] [Revised: 03/09/2016] [Accepted: 04/06/2016] [Indexed: 12/19/2022] Open
Abstract
Gene regulatory networks represent an abstract mapping of gene regulations in living cells. They aim to capture dependencies among molecular entities such as transcription factors, proteins and metabolites. In most applications, the regulatory network structure is unknown, and has to be reverse engineered from experimental data consisting of expression levels of the genes usually measured as messenger RNA concentrations in microarray experiments. Steady-state gene expression data are obtained from measurements of the variations in expression activity following the application of small perturbations to equilibrium states in genetic perturbation experiments. In this paper, the least absolute shrinkage and selection operator-vector autoregressive (LASSO-VAR) originally proposed for the analysis of economic time series data is adapted to include a stability constraint for the recovery of a sparse and stable regulatory network that describes data obtained from noisy perturbation experiments. The approach is applied to real experimental data obtained for the SOS pathway in Escherichia coli and the cell cycle pathway for yeast Saccharomyces cerevisiae. Significant features of this method are the ability to recover networks without inputting prior knowledge of the network topology, and the ability to be efficiently applied to large scale networks due to the convex nature of the method.
Collapse
|
14
|
Lo LY, Wong ML, Lee KH, Leung KS. Time Delayed Causal Gene Regulatory Network Inference with Hidden Common Causes. PLoS One 2015; 10:e0138596. [PMID: 26394325 PMCID: PMC4578777 DOI: 10.1371/journal.pone.0138596] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2015] [Accepted: 09/01/2015] [Indexed: 01/07/2023] Open
Abstract
Inferring the gene regulatory network (GRN) is crucial to understanding the working of the cell. Many computational methods attempt to infer the GRN from time series expression data, instead of through expensive and time-consuming experiments. However, existing methods make the convenient but unrealistic assumption of causal sufficiency, i.e. all the relevant factors in the causal network have been observed and there are no unobserved common cause. In principle, in the real world, it is impossible to be certain that all relevant factors or common causes have been observed, because some factors may not have been conceived of, and therefore are impossible to measure. In view of this, we have developed a novel algorithm named HCC-CLINDE to infer an GRN from time series data allowing the presence of hidden common cause(s). We assume there is a sparse causal graph (possibly with cycles) of interest, where the variables are continuous and each causal link has a delay (possibly more than one time step). A small but unknown number of variables are not observed. Each unobserved variable has only observed variables as children and parents, with at least two children, and the children are not linked to each other. Since it is difficult to obtain very long time series, our algorithm is also capable of utilizing multiple short time series, which is more realistic. To our knowledge, our algorithm is far less restrictive than previous works. We have performed extensive experiments using synthetic data on GRNs of size up to 100, with up to 10 hidden nodes. The results show that our algorithm can adequately recover the true causal GRN and is robust to slight deviation from Gaussian distribution in the error terms. We have also demonstrated the potential of our algorithm on small YEASTRACT subnetworks using limited real data.
Collapse
Affiliation(s)
- Leung-Yau Lo
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, Hong Kong
- * E-mail:
| | - Man-Leung Wong
- Department of Computing and Decision Sciences, Lingnan University, Tuen Mun, Hong Kong
| | - Kin-Hong Lee
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, Hong Kong
| | - Kwong-Sak Leung
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, Hong Kong
| |
Collapse
|
15
|
Lo LY, Leung KS, Lee KH. Inferring Time-Delayed Causal Gene Network Using Time-Series Expression Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:1169-1182. [PMID: 26451828 DOI: 10.1109/tcbb.2015.2394442] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Inferring gene regulatory network (GRN) from the microarray expression data is an important problem in Bioinformatics, because knowing the GRN is an essential first step in understanding the inner workings of the cell and the related diseases. Time delays exist in the regulatory effects from one gene to another due to the time needed for transcription, translation, and to accumulate a sufficient number of needed proteins. Also, it is known that the delays are important for oscillatory phenomenon. Therefore, it is crucial to develop a causal gene network model, preferably as a function of time. In this paper, we propose an algorithm CLINDE to infer causal directed links in GRN with time delays and regulatory effects in the links from time-series microarray gene expression data. It is one of the most comprehensive in terms of features compared to the state-of-the-art discrete gene network models. We have tested CLINDE on synthetic data, the in vivo IRMA (On and Off) datasets and the [1] yeast expression data validated using KEGG pathways. Results show that CLINDE can effectively recover the links, the time delays and the regulatory effects in the synthetic data, and outperforms other algorithms in the IRMA in vivo datasets.
Collapse
|
16
|
Yao S, Yoo S, Yu D. Prior knowledge driven Granger causality analysis on gene regulatory network discovery. BMC Bioinformatics 2015; 16:273. [PMID: 26316173 PMCID: PMC4551367 DOI: 10.1186/s12859-015-0710-1] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2015] [Accepted: 08/17/2015] [Indexed: 12/20/2022] Open
Abstract
Background Our study focuses on discovering gene regulatory networks from time series gene expression data using the Granger causality (GC) model. However, the number of available time points (T) usually is much smaller than the number of target genes (n) in biological datasets. The widely applied pairwise GC model (PGC) and other regularization strategies can lead to a significant number of false identifications when n>>T. Results In this study, we proposed a new method, viz., CGC-2SPR (CGC using two-step prior Ridge regularization) to resolve the problem by incorporating prior biological knowledge about a target gene data set. In our simulation experiments, the propose new methodology CGC-2SPR showed significant performance improvement in terms of accuracy over other widely used GC modeling (PGC, Ridge and Lasso) and MI-based (MRNET and ARACNE) methods. In addition, we applied CGC-2SPR to a real biological dataset, i.e., the yeast metabolic cycle, and discovered more true positive edges with CGC-2SPR than with the other existing methods. Conclusions In our research, we noticed a “ 1+1>2” effect when we combined prior knowledge and gene expression data to discover regulatory networks. Based on causality networks, we made a functional prediction that the Abm1 gene (its functions previously were unknown) might be related to the yeast’s responses to different levels of glucose. Our research improves causality modeling by combining heterogeneous knowledge, which is well aligned with the future direction in system biology. Furthermore, we proposed a method of Monte Carlo significance estimation (MCSE) to calculate the edge significances which provide statistical meanings to the discovered causality networks. All of our data and source codes will be available under the link https://bitbucket.org/dtyu/granger-causality/wiki/Home.
Collapse
Affiliation(s)
- Shun Yao
- Department of Biochemistry and Cell Biology, Stony Brook University, Stony Brook, 11790, NY, USA. .,Computational Science Center, Brookhaven National Laboratory, Upton, 11793, NY, USA.
| | - Shinjae Yoo
- Computational Science Center, Brookhaven National Laboratory, Upton, 11793, NY, USA.
| | - Dantong Yu
- Computational Science Center, Brookhaven National Laboratory, Upton, 11793, NY, USA.
| |
Collapse
|
17
|
A tutorial to identify nonlinear associations in gene expression time series data. Methods Mol Biol 2014; 1164:87-95. [PMID: 24927837 DOI: 10.1007/978-1-4939-0805-9_8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/21/2023]
Abstract
The study of gene regulatory networks is the basis to understand the biological complexity of several diseases and/or cell states. It has become the core of research in the field of systems biology. Several mathematical methods have been developed in the last decade, especially in the analysis of time series gene expression data derived from microarrays and sequencing-based methods. Most of the models available in the literature assumes linear associations among genes and do not infer directionality in these connections or uses a priori biological knowledge to set the directionality. However, in several cases, a priori biological information is not available. In this context, we describe a statistical method, namely nonlinear vector autoregressive model to estimate nonlinear relationships and also to infer directionality at the edges of the network by using the temporal information of the time series gene expression data without a priori biological information.
Collapse
|
18
|
CaSPIAN: a causal compressive sensing algorithm for discovering directed interactions in gene networks. PLoS One 2014; 9:e90781. [PMID: 24622336 PMCID: PMC3951243 DOI: 10.1371/journal.pone.0090781] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2013] [Accepted: 02/05/2014] [Indexed: 11/21/2022] Open
Abstract
We introduce a novel algorithm for inference of causal gene interactions, termed CaSPIAN (Causal Subspace Pursuit for Inference and Analysis of Networks), which is based on coupling compressive sensing and Granger causality techniques. The core of the approach is to discover sparse linear dependencies between shifted time series of gene expressions using a sequential list-version of the subspace pursuit reconstruction algorithm and to estimate the direction of gene interactions via Granger-type elimination. The method is conceptually simple and computationally efficient, and it allows for dealing with noisy measurements. Its performance as a stand-alone platform without biological side-information was tested on simulated networks, on the synthetic IRMA network in Saccharomyces cerevisiae, and on data pertaining to the human HeLa cell network and the SOS network in E. coli. The results produced by CaSPIAN are compared to the results of several related algorithms, demonstrating significant improvements in inference accuracy of documented interactions. These findings highlight the importance of Granger causality techniques for reducing the number of false-positives, as well as the influence of noise and sampling period on the accuracy of the estimates. In addition, the performance of the method was tested in conjunction with biological side information of the form of sparse “scaffold networks”, to which new edges were added using available RNA-seq or microarray data. These biological priors aid in increasing the sensitivity and precision of the algorithm in the small sample regime.
Collapse
|
19
|
Inferring regulatory networks by combining perturbation screens and steady state gene expression profiles. PLoS One 2014; 9:e82393. [PMID: 24586224 PMCID: PMC3938831 DOI: 10.1371/journal.pone.0082393] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2012] [Accepted: 11/01/2013] [Indexed: 11/19/2022] Open
Abstract
Reconstructing transcriptional regulatory networks is an important task in functional genomics. Data obtained from experiments that perturb genes by knockouts or RNA interference contain useful information for addressing this reconstruction problem. However, such data can be limited in size and/or are expensive to acquire. On the other hand, observational data of the organism in steady state (e.g., wild-type) are more readily available, but their informational content is inadequate for the task at hand. We develop a computational approach to appropriately utilize both data sources for estimating a regulatory network. The proposed approach is based on a three-step algorithm to estimate the underlying directed but cyclic network, that uses as input both perturbation screens and steady state gene expression data. In the first step, the algorithm determines causal orderings of the genes that are consistent with the perturbation data, by combining an exhaustive search method with a fast heuristic that in turn couples a Monte Carlo technique with a fast search algorithm. In the second step, for each obtained causal ordering, a regulatory network is estimated using a penalized likelihood based method, while in the third step a consensus network is constructed from the highest scored ones. Extensive computational experiments show that the algorithm performs well in reconstructing the underlying network and clearly outperforms competing approaches that rely only on a single data source. Further, it is established that the algorithm produces a consistent estimate of the regulatory network.
Collapse
|
20
|
Time-series RNA-seq analysis package (TRAP) and its application to the analysis of rice, Oryza sativa L. ssp. Japonica, upon drought stress. Methods 2014; 67:364-72. [PMID: 24518221 DOI: 10.1016/j.ymeth.2014.02.001] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2013] [Revised: 01/17/2014] [Accepted: 02/01/2014] [Indexed: 12/19/2022] Open
Abstract
Measuring expression levels of genes at the whole genome level can be useful for many purposes, especially for revealing biological pathways underlying specific phenotype conditions. When gene expression is measured over a time period, we have opportunities to understand how organisms react to stress conditions over time. Thus many biologists routinely measure whole genome level gene expressions at multiple time points. However, there are several technical difficulties for analyzing such whole genome expression data. In addition, these days gene expression data is often measured by using RNA-sequencing rather than microarray technologies and then analysis of expression data is much more complicated since the analysis process should start with mapping short reads and produce differentially activated pathways and also possibly interactions among pathways. In addition, many useful tools for analyzing microarray gene expression data are not applicable for the RNA-seq data. Thus a comprehensive package for analyzing time series transcriptome data is much needed. In this article, we present a comprehensive package, Time-series RNA-seq Analysis Package (TRAP), integrating all necessary tasks such as mapping short reads, measuring gene expression levels, finding differentially expressed genes (DEGs), clustering and pathway analysis for time-series data in a single environment. In addition to implementing useful algorithms that are not available for RNA-seq data, we extended existing pathway analysis methods, ORA and SPIA, for time series analysis and estimates statistical values for combined dataset by an advanced metric. TRAP also produces visual summary of pathway interactions. Gene expression change labeling, a practical clustering method used in TRAP, enables more accurate interpretation of the data when combined with pathway analysis. We applied our methods on a real dataset for the analysis of rice (Oryza sativa L. Japonica nipponbare) upon drought stress. The result showed that TRAP was able to detect pathways more accurately than several existing methods. TRAP is available at http://biohealth.snu.ac.kr/software/TRAP/.
Collapse
|
21
|
Liu M, Cai R, Hu Y, Matheny ME, Sun J, Hu J, Xu H. Determining molecular predictors of adverse drug reactions with causality analysis based on structure learning. J Am Med Inform Assoc 2013; 21:245-51. [PMID: 24334612 DOI: 10.1136/amiajnl-2013-002051] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022] Open
Abstract
OBJECTIVE Adverse drug reaction (ADR) can have dire consequences. However, our current understanding of the causes of drug-induced toxicity is still limited. Hence it is of paramount importance to determine molecular factors of adverse drug responses so that safer therapies can be designed. METHODS We propose a causality analysis model based on structure learning (CASTLE) for identifying factors that contribute significantly to ADRs from an integration of chemical and biological properties of drugs. This study aims to address two major limitations of the existing ADR prediction studies. First, ADR prediction is mostly performed by assessing the correlations between the input features and ADRs, and the identified associations may not indicate causal relations. Second, most predictive models lack biological interpretability. RESULTS CASTLE was evaluated in terms of prediction accuracy on 12 organ-specific ADRs using 830 approved drugs. The prediction was carried out by first extracting causal features with structure learning and then applying them to a support vector machine (SVM) for classification. Through rigorous experimental analyses, we observed significant increases in both macro and micro F1 scores compared with the traditional SVM classifier, from 0.88 to 0.89 and 0.74 to 0.81, respectively. Most importantly, identified links between the biological factors and organ-specific drug toxicities were partially supported by evidence in Online Mendelian Inheritance in Man. CONCLUSIONS The proposed CASTLE model not only performed better in prediction than the baseline SVM but also produced more interpretable results (ie, biological factors responsible for ADRs), which is critical to discovering molecular activators of ADRs.
Collapse
Affiliation(s)
- Mei Liu
- Department of Computer Science, New Jersey Institute of Technology, Newark, New Jersey, USA
| | | | | | | | | | | | | |
Collapse
|
22
|
Simcha DM, Younes L, Aryee MJ, Geman D. Identification of direction in gene networks from expression and methylation. BMC SYSTEMS BIOLOGY 2013; 7:118. [PMID: 24182195 PMCID: PMC4228359 DOI: 10.1186/1752-0509-7-118] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/17/2012] [Accepted: 10/17/2013] [Indexed: 01/27/2023]
Abstract
BACKGROUND Reverse-engineering gene regulatory networks from expression data is difficult, especially without temporal measurements or interventional experiments. In particular, the causal direction of an edge is generally not statistically identifiable, i.e., cannot be inferred as a statistical parameter, even from an unlimited amount of non-time series observational mRNA expression data. Some additional evidence is required and high-throughput methylation data can viewed as a natural multifactorial gene perturbation experiment. RESULTS We introduce IDEM (Identifying Direction from Expression and Methylation), a method for identifying the causal direction of edges by combining DNA methylation and mRNA transcription data. We describe the circumstances under which edge directions become identifiable and experiments with both real and synthetic data demonstrate that the accuracy of IDEM for inferring both edge placement and edge direction in gene regulatory networks is significantly improved relative to other methods. CONCLUSION Reverse-engineering directed gene regulatory networks from static observational data becomes feasible by exploiting the context provided by high-throughput DNA methylation data.An implementation of the algorithm described is available at http://code.google.com/p/idem/.
Collapse
Affiliation(s)
- David M Simcha
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA.
| | | | | | | |
Collapse
|
23
|
Michailidis G, d'Alché-Buc F. Autoregressive models for gene regulatory network inference: sparsity, stability and causality issues. Math Biosci 2013; 246:326-34. [PMID: 24176667 DOI: 10.1016/j.mbs.2013.10.003] [Citation(s) in RCA: 49] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2013] [Revised: 10/09/2013] [Accepted: 10/14/2013] [Indexed: 10/26/2022]
Abstract
Reconstructing gene regulatory networks from high-throughput measurements represents a key problem in functional genomics. It also represents a canonical learning problem and thus has attracted a lot of attention in both the informatics and the statistical learning literature. Numerous approaches have been proposed, ranging from simple clustering to rather involved dynamic Bayesian network modeling, as well as hybrid ones that combine a number of modeling steps, such as employing ordinary differential equations coupled with genome annotation. These approaches are tailored to the type of data being employed. Available data sources include static steady state data and time course data obtained either for wild type phenotypes or from perturbation experiments. This review focuses on the class of autoregressive models using time course data for inferring gene regulatory networks. The central themes of sparsity, stability and causality are discussed as well as the ability to integrate prior knowledge for successful use of these models for the learning task at hand.
Collapse
Affiliation(s)
- George Michailidis
- Department of Statistics, University of Michigan, Ann Arbor, MI 48109-1107, USA
| | | |
Collapse
|
24
|
Tam GHF, Chang C, Hung YS. Gene regulatory network discovery using pairwise Granger causality. IET Syst Biol 2013; 7:195-204. [PMID: 24067420 PMCID: PMC8687252 DOI: 10.1049/iet-syb.2012.0063] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2012] [Revised: 05/16/2013] [Accepted: 06/19/2013] [Indexed: 08/12/2023] Open
Abstract
Discovery of gene regulatory network from gene expression data can yield a useful insight to drug development. Among the methods applied to time‐series data, Granger causality (GC) has emerged as a powerful tool with several merits. Since gene expression data usually have a much larger number of genes than time points therefore a full model cannot be applied in a straightforward manner, GC is often applied to genes pairwisely. In this study, the authors first investigate with synthetic data how spurious causalities (false discoveries) may arise because of the use of pairwise rather than full‐model GC detection. Furthermore, spurious causalities may also arise if the order of the vector autoregressive model is not high enough. As a remedy, the authors demonstrate that model validation techniques can effectively reduce the number of false discoveries. Then, they apply pairwise GC with model validation to the real human HeLa cell‐cycle dataset. They find that Akaike information criterion is generally most suitable for determining model order, but precaution should be taken for extremely short time series. With the authors proposed implementation, degree distributions and network hubs are obtained and compared with existing results, giving a new observation that the hubs tend to act as sources rather than receivers of interactions.
Collapse
Affiliation(s)
- Gary Hak Fui Tam
- Department of Electrical and Electronic EngineeringThe University of Hong KongPokfulam RoadHong KongChina
| | - Chunqi Chang
- School of Electronic and Information Engineering, Soochow UniversitySuzhouJiangsu ProvinceChina
| | - Yeung Sam Hung
- Department of Electrical and Electronic EngineeringThe University of Hong KongPokfulam RoadHong KongChina
| |
Collapse
|
25
|
Cai R, Zhang Z, Hao Z. Causal gene identification using combinatorial V-structure search. Neural Netw 2013; 43:63-71. [PMID: 23500501 DOI: 10.1016/j.neunet.2013.01.025] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2012] [Revised: 01/23/2013] [Accepted: 01/31/2013] [Indexed: 10/27/2022]
Abstract
With the advances of biomedical techniques in the last decade, the costs of human genomic sequencing and genomic activity monitoring are coming down rapidly. To support the huge genome-based business in the near future, researchers are eager to find killer applications based on human genome information. Causal gene identification is one of the most promising applications, which may help the potential patients to estimate the risk of certain genetic diseases and locate the target gene for further genetic therapy. Unfortunately, existing pattern recognition techniques, such as Bayesian networks, cannot be directly applied to find the accurate causal relationship between genes and diseases. This is mainly due to the insufficient number of samples and the extremely high dimensionality of the gene space. In this paper, we present the first practical solution to causal gene identification, utilizing a new combinatorial formulation over V-Structures commonly used in conventional Bayesian networks, by exploring the combinations of significant V-Structures. We prove the NP-hardness of the combinatorial search problem under a general settings on the significance measure on the V-Structures, and present a greedy algorithm to find sub-optimal results. Extensive experiments show that our proposal is both scalable and effective, particularly with interesting findings on the causal genes over real human genome data.
Collapse
Affiliation(s)
- Ruichu Cai
- Faculty of Computer Science, Guangdong University of Technology, Guangzhou, PR China.
| | | | | |
Collapse
|
26
|
Yang B, Chen Y, Jiang M. Reverse engineering of gene regulatory networks using flexible neural tree models. Neurocomputing 2013. [DOI: 10.1016/j.neucom.2012.07.015] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
27
|
|
28
|
Functional clustering of time series gene expression data by Granger causality. BMC SYSTEMS BIOLOGY 2012; 6:137. [PMID: 23107425 PMCID: PMC3573927 DOI: 10.1186/1752-0509-6-137] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/14/2011] [Accepted: 10/17/2012] [Indexed: 12/04/2022]
Abstract
Background A common approach for time series gene expression data analysis includes the clustering of genes with similar expression patterns throughout time. Clustered gene expression profiles point to the joint contribution of groups of genes to a particular cellular process. However, since genes belong to intricate networks, other features, besides comparable expression patterns, should provide additional information for the identification of functionally similar genes. Results In this study we perform gene clustering through the identification of Granger causality between and within sets of time series gene expression data. Granger causality is based on the idea that the cause of an event cannot come after its consequence. Conclusions This kind of analysis can be used as a complementary approach for functional clustering, wherein genes would be clustered not solely based on their expression similarity but on their topological proximity built according to the intensity of Granger causality among them.
Collapse
|
29
|
Bar-Joseph Z, Gitter A, Simon I. Studying and modelling dynamic biological processes using time-series gene expression data. Nat Rev Genet 2012; 13:552-64. [PMID: 22805708 DOI: 10.1038/nrg3244] [Citation(s) in RCA: 318] [Impact Index Per Article: 24.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Biological processes are often dynamic, thus researchers must monitor their activity at multiple time points. The most abundant source of information regarding such dynamic activity is time-series gene expression data. These data are used to identify the complete set of activated genes in a biological process, to infer their rates of change, their order and their causal effects and to model dynamic systems in the cell. In this Review we discuss the basic patterns that have been observed in time-series experiments, how these patterns are combined to form expression programs, and the computational analysis, visualization and integration of these data to infer models of dynamic biological systems.
Collapse
Affiliation(s)
- Ziv Bar-Joseph
- Lane Center for Computational Biology and Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA.
| | | | | |
Collapse
|
30
|
De la Fuente IM, Cortes JM. Quantitative analysis of the effective functional structure in yeast glycolysis. PLoS One 2012; 7:e30162. [PMID: 22393350 PMCID: PMC3290614 DOI: 10.1371/journal.pone.0030162] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2011] [Accepted: 12/11/2011] [Indexed: 11/28/2022] Open
Abstract
The understanding of the effective functionality that governs the enzymatic self-organized processes in cellular conditions is a crucial topic in the post-genomic era. In recent studies, Transfer Entropy has been proposed as a rigorous, robust and self-consistent method for the causal quantification of the functional information flow among nonlinear processes. Here, in order to quantify the functional connectivity for the glycolytic enzymes in dissipative conditions we have analyzed different catalytic patterns using the technique of Transfer Entropy. The data were obtained by means of a yeast glycolytic model formed by three delay differential equations where the enzymatic rate equations of the irreversible stages have been explicitly considered. These enzymatic activity functions were previously modeled and tested experimentally by other different groups. The results show the emergence of a new kind of dynamical functional structure, characterized by changing connectivity flows and a metabolic invariant that constrains the activity of the irreversible enzymes. In addition to the classical topological structure characterized by the specific location of enzymes, substrates, products and feedback-regulatory metabolites, an effective functional structure emerges in the modeled glycolytic system, which is dynamical and characterized by notable variations of the functional interactions. The dynamical structure also exhibits a metabolic invariant which constrains the functional attributes of the enzymes. Finally, in accordance with the classical biochemical studies, our numerical analysis reveals in a quantitative manner that the enzyme phosphofructokinase is the key-core of the metabolic system, behaving for all conditions as the main source of the effective causal flows in yeast glycolysis.
Collapse
|
31
|
Shojaie A, Basu S, Michailidis G. Adaptive Thresholding for Reconstructing Regulatory Networks from Time-Course Gene Expression Data. STATISTICS IN BIOSCIENCES 2011. [DOI: 10.1007/s12561-011-9050-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
|
32
|
Kleinberg S, Hripcsak G. A review of causal inference for biomedical informatics. J Biomed Inform 2011; 44:1102-12. [PMID: 21782035 PMCID: PMC3219814 DOI: 10.1016/j.jbi.2011.07.001] [Citation(s) in RCA: 69] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2011] [Revised: 05/16/2011] [Accepted: 07/04/2011] [Indexed: 11/26/2022]
Abstract
Causality is an important concept throughout the health sciences and is particularly vital for informatics work such as finding adverse drug events or risk factors for disease using electronic health records. While philosophers and scientists working for centuries on formalizing what makes something a cause have not reached a consensus, new methods for inference show that we can make progress in this area in many practical cases. This article reviews core concepts in understanding and identifying causality and then reviews current computational methods for inference and explanation, focusing on inference from large-scale observational data. While the problem is not fully solved, we show that graphical models and Granger causality provide useful frameworks for inference and that a more recent approach based on temporal logic addresses some of the limitations of these methods.
Collapse
Affiliation(s)
- Samantha Kleinberg
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States.
| | | |
Collapse
|
33
|
FUJITA ANDRÉ, SATO JOÃORICARDO, KOJIMA KANAME, GOMES LUCIANARODRIGUES, NAGASAKI MASAO, SOGAYAR MARICLEIDE, MIYANO SATORU. IDENTIFICATION OF GRANGER CAUSALITY BETWEEN GENE SETS. J Bioinform Comput Biol 2011. [DOI: 10.1142/s0219720010004860] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Wiener and Granger have introduced an intuitive concept of causality (Granger causality) between two variables which is based on the idea that an effect never occurs before its cause. Later, Geweke generalized this concept to a multivariate Granger causality, i.e. n variables Granger-cause another variable. Although Granger causality is not "effective causality" in the Aristothelic sense, this concept is useful to infer directionality and information flow in observational data. Granger causality is usually identified by using VAR (Vector Autoregressive) models due to their simplicity. In the last few years, several VAR-based models were presented in order to model gene regulatory networks. Here, we generalize the multivariate Granger causality concept in order to identify Granger causalities between sets of gene expressions, i.e. whether a set of n genes Granger-causes another set of m genes, aiming at identifying the flow of information between gene networks (or pathways). The concept of Granger causality for sets of variables is presented. Moreover, a method for its identification with a bootstrap test is proposed. This method is applied in simulated and also in actual biological gene expression data in order to model regulatory networks. This concept may be useful for the understanding of the complete information flow from one network or pathway to the other, mainly in regulatory networks. Linking this concept to graph theory, sink and source can be generalized to node sets. Moreover, hub and centrality for sets of genes can be defined based on total information flow. Another application is in annotation, when the functionality of a set of genes is unknown, but this set is Granger-caused by another set of genes which is well studied. Therefore, this information may be useful to infer or construct some hypothesis about the unknown set of genes.
Collapse
Affiliation(s)
- ANDRÉ FUJITA
- Computational Science Research Program, RIKEN, 2-1, Hirosawa, Wako, Saitama, 351-0198, Japan
| | - JOÃO RICARDO SATO
- Center of Mathematics, Computation and Cognition, Universidade Federal do ABC, Rua Santa Adélia, 166 – Santo André, Brazil
| | - KANAME KOJIMA
- Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan
| | - LUCIANA RODRIGUES GOMES
- Chemistry Institute, University of São Paulo, Av. Lineu Prestes, 748 – São Paulo, 05508-900, Brazil
| | - MASAO NAGASAKI
- Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan
| | - MARI CLEIDE SOGAYAR
- Chemistry Institute, University of São Paulo, Av. Lineu Prestes, 748 – São Paulo, 05508-900, Brazil
| | - SATORU MIYANO
- Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan
| |
Collapse
|
34
|
Fuente IMDL, Cortes JM, Perez-Pinilla MB, Ruiz-Rodriguez V, Veguillas J. The metabolic core and catalytic switches are fundamental elements in the self-regulation of the systemic metabolic structure of cells. PLoS One 2011; 6:e27224. [PMID: 22125607 PMCID: PMC3220688 DOI: 10.1371/journal.pone.0027224] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2011] [Accepted: 10/12/2011] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Experimental observations and numerical studies with dissipative metabolic networks have shown that cellular enzymatic activity self-organizes spontaneously leading to the emergence of a metabolic core formed by a set of enzymatic reactions which are always active under all environmental conditions, while the rest of catalytic processes are only intermittently active. The reactions of the metabolic core are essential for biomass formation and to assure optimal metabolic performance. The on-off catalytic reactions and the metabolic core are essential elements of a Systemic Metabolic Structure which seems to be a key feature common to all cellular organisms. METHODOLOGY/PRINCIPAL FINDINGS In order to investigate the functional importance of the metabolic core we have studied different catalytic patterns of a dissipative metabolic network under different external conditions. The emerging biochemical data have been analysed using information-based dynamic tools, such as Pearson's correlation and Transfer Entropy (which measures effective functionality). Our results show that a functional structure of effective connectivity emerges which is dynamical and characterized by significant variations of bio-molecular information flows. CONCLUSIONS/SIGNIFICANCE We have quantified essential aspects of the metabolic core functionality. The always active enzymatic reactions form a hub--with a high degree of effective connectivity--exhibiting a wide range of functional information values being able to act either as a source or as a sink of bio-molecular causal interactions. Likewise, we have found that the metabolic core is an essential part of an emergent functional structure characterized by catalytic modules and metabolic switches which allow critical transitions in enzymatic activity. Both, the metabolic core and the catalytic switches in which also intermittently-active enzymes are involved seem to be fundamental elements in the self-regulation of the Systemic Metabolic Structure.
Collapse
|
35
|
Li Z, Li P, Krishnan A, Liu J. Large-scale dynamic gene regulatory network inference combining differential equation models with local dynamic Bayesian network analysis. ACTA ACUST UNITED AC 2011; 27:2686-91. [PMID: 21816876 DOI: 10.1093/bioinformatics/btr454] [Citation(s) in RCA: 69] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Reverse engineering gene regulatory networks, especially large size networks from time series gene expression data, remain a challenge to the systems biology community. In this article, a new hybrid algorithm integrating ordinary differential equation models with dynamic Bayesian network analysis, called Differential Equation-based Local Dynamic Bayesian Network (DELDBN), was proposed and implemented for gene regulatory network inference. RESULTS The performance of DELDBN was benchmarked with an in vivo dataset from yeast. DELDBN significantly improved the accuracy and sensitivity of network inference compared with other approaches. The local causal discovery algorithm implemented in DELDBN also reduced the complexity of the network inference algorithm and improved its scalability to infer larger networks. We have demonstrated the applicability of the approach to a network containing thousands of genes with a dataset from human HeLa cell time series experiments. The local network around BRCA1 was particularly investigated and validated with independent published studies. BRAC1 network was significantly enriched with the known BRCA1-relevant interactions, indicating that DELDBN can effectively infer large size gene regulatory network from time series data. AVAILABILITY The R scripts are provided in File 3 in Supplementary Material. CONTACT zheng.li@monsanto.com; jingdong.liu@monsanto.com SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zheng Li
- Monsanto Company, Mail zone CC1A, Chesterfield, MO 63017, USA.
| | | | | | | |
Collapse
|
36
|
Hempel S, Koseska A, Nikoloski Z, Kurths J. Unraveling gene regulatory networks from time-resolved gene expression data - a measures comparison study. BMC Bioinformatics 2011; 12:292. [PMID: 21771321 PMCID: PMC3161045 DOI: 10.1186/1471-2105-12-292] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2011] [Accepted: 07/19/2011] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND Inferring regulatory interactions between genes from transcriptomics time-resolved data, yielding reverse engineered gene regulatory networks, is of paramount importance to systems biology and bioinformatics studies. Accurate methods to address this problem can ultimately provide a deeper insight into the complexity, behavior, and functions of the underlying biological systems. However, the large number of interacting genes coupled with short and often noisy time-resolved read-outs of the system renders the reverse engineering a challenging task. Therefore, the development and assessment of methods which are computationally efficient, robust against noise, applicable to short time series data, and preferably capable of reconstructing the directionality of the regulatory interactions remains a pressing research problem with valuable applications. RESULTS Here we perform the largest systematic analysis of a set of similarity measures and scoring schemes within the scope of the relevance network approach which are commonly used for gene regulatory network reconstruction from time series data. In addition, we define and analyze several novel measures and schemes which are particularly suitable for short transcriptomics time series. We also compare the considered 21 measures and 6 scoring schemes according to their ability to correctly reconstruct such networks from short time series data by calculating summary statistics based on the corresponding specificity and sensitivity. Our results demonstrate that rank and symbol based measures have the highest performance in inferring regulatory interactions. In addition, the proposed scoring scheme by asymmetric weighting has shown to be valuable in reducing the number of false positive interactions. On the other hand, Granger causality as well as information-theoretic measures, frequently used in inference of regulatory networks, show low performance on the short time series analyzed in this study. CONCLUSIONS Our study is intended to serve as a guide for choosing a particular combination of similarity measures and scoring schemes suitable for reconstruction of gene regulatory networks from short time series data. We show that further improvement of algorithms for reverse engineering can be obtained if one considers measures that are rooted in the study of symbolic dynamics or ranks, in contrast to the application of common similarity measures which do not consider the temporal character of the employed data. Moreover, we establish that the asymmetric weighting scoring scheme together with symbol based measures (for low noise level) and rank based measures (for high noise level) are the most suitable choices.
Collapse
Affiliation(s)
- Sabrina Hempel
- Interdisciplinary Center for Dynamics of Complex Systems, University of Potsdam, Campus Golm, Karl-Liebknecht-Str. 24, D-14476 Potsdam, Germany
- Potsdam Institute for Climate Impact Research (PIK), Telegraphenberg A 31, D-14473 Potsdam, Germany
- Department of Physics, Humboldt University of Berlin, Campus Adlershof, Newtonstr. 15, D-12489 Berlin, Germany
| | - Aneta Koseska
- Interdisciplinary Center for Dynamics of Complex Systems, University of Potsdam, Campus Golm, Karl-Liebknecht-Str. 24, D-14476 Potsdam, Germany
| | - Zoran Nikoloski
- Systems Biology and Mathematical Modeling Group, Max Planck Institute for Molecular Plant Physiology, Am Mühlenberg 1, D-14476 Potsdam, Germany
- Institute of Biochemistry and Biology, University of Potsdam, Karl-Liebknecht-Str. 25, D-14476 Potsdam, Germany
| | - Jürgen Kurths
- Potsdam Institute for Climate Impact Research (PIK), Telegraphenberg A 31, D-14473 Potsdam, Germany
- Department of Physics, Humboldt University of Berlin, Campus Adlershof, Newtonstr. 15, D-12489 Berlin, Germany
- Institute for Complex Systems and Mathematical Biology, University of Aberdeen, Aberdeen AB243UE, UK
| |
Collapse
|
37
|
High dimensional data analysis using multivariate generalized spatial quantiles. J MULTIVARIATE ANAL 2011; 102:768-780. [DOI: 10.1016/j.jmva.2010.12.002] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
38
|
Ao S, Palade V. Ensemble of Elman neural networks and support vector machines for reverse engineering of gene regulatory networks. Appl Soft Comput 2011. [DOI: 10.1016/j.asoc.2010.05.014] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
39
|
Fujita A, Sato JR, Angelo M, Demasi A, Yamaguchi R, Shimamura T, Ferreira CE, Sogayar MC, Miyano S. Inferring contagion in regulatory networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:570-576. [PMID: 20479499 DOI: 10.1109/tcbb.2010.40] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Several gene regulatory network models containing concepts of directionality at the edges have been proposed. However, only a few reports have an interpretable definition of directionality. Here, differently from the standard causality concept defined by Pearl, we introduce the concept of contagion in order to infer directionality at the edges, i.e., asymmetries in gene expression dependences of regulatory networks. Moreover, we present a bootstrap algorithm in order to test the contagion concept. This technique was applied in simulated data and, also, in an actual large sample of biological data. Literature review has confirmed some genes identified by contagion as actually belonging to the TP53 pathway.
Collapse
Affiliation(s)
- André Fujita
- A. Fujita is with the Computational Science Research Program, RIKEN, 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan.
| | | | | | | | | | | | | | | | | |
Collapse
|
40
|
May P, Christian N, Ebenhöh O, Weckwerth W, Walther D. Integration of proteomic and metabolomic profiling as well as metabolic modeling for the functional analysis of metabolic networks. Methods Mol Biol 2011; 694:341-363. [PMID: 21082444 DOI: 10.1007/978-1-60761-977-2_21] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
The integrated analysis of different omics-level data sets is most naturally performed in the context of common process or pathway association. In this chapter, the two basic approaches for a metabolic pathway-centric integration of proteomics and metabolomics data are described: the knowledge-based approach relying on existing metabolic pathway information, and a data-driven approach that aims to deduce functional (pathway) associations directly from the data. Relevant algorithmic approaches for the generation of metabolic networks of model organisms, their functional analysis, database resources, visualization and analysis tools will be described. The use of proteomics data in the process of metabolic network reconstruction will be discussed.
Collapse
Affiliation(s)
- Patrick May
- Max-Planck-Institute for Molecular Plant Physiology, Potsdam-Golm, Germany
| | | | | | | | | |
Collapse
|
41
|
Nagarajan R, Upreti M. Inferring functional relationships and causal network structure from gene expression profiles. Methods Enzymol 2011; 487:133-46. [PMID: 21187224 DOI: 10.1016/b978-0-12-381270-4.00005-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Inferring functional relationships and network structure from the observed gene expression profiles can provide a novel insight into the working of the genes as a system or network as opposed to independent entities. Such networks may also represent possible causal relationships between a given set of genes, hence can prove to be a convenient abstraction of the underlying signaling mechanism. The discovery of functional relationships from the observed gene expression profiles does not rely on prior literature, hence useful in identifying undocumented relationships between a given set of genes. Several techniques have been proposed in the literature. The present study investigates the choice Granger causality (GC) and its extensions in modeling the network structure between a given pair of genes from their expression profiles. The impact of noise variance on GC relationships is investigated. VAR parameter estimation is proposed to obtain a finer insight into the functional relationships inferred using GC tests. The results are presented on synthetic networks generated from known vector-autoregressive (VAR) models and those from cell-cycle gene expression profiles that can be modeled as a first-order bivariate VAR.
Collapse
Affiliation(s)
- Radhakrishnan Nagarajan
- Division of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, Arkansas, USA
| | | |
Collapse
|
42
|
Abstract
Over the past 20 years, Omics technologies emerged as the consensual denomination of holistic molecular profiling. These techniques enable parallel measurements of biological -omes, or "all constituents considered collectively", and utilize the latest advancements in transcriptomics, proteomics, metabolomics, imaging, and bioinformatics. The technological accomplishments in increasing the sensitivity and throughput of the analytical devices, the standardization of the protocols and the widespread availability of reagents made the capturing of static molecular portraits of biological systems a routine task. The next generation of time course molecular profiling already allows for extensive molecular snapshots to be taken along the trajectory of time evolution of the investigated biological systems. Such datasets provide the basis for application of the inverse scientific approach. It consists in the inference of scientific hypotheses and theories about the structure and dynamics of the investigated biological system without any a priori knowledge, solely relying on data analysis to unveil the underlying patterns. However, most temporal Omics data still contain a limited number of time points, taken over arbitrary time intervals, through measurements on biological processes shifted in time. The analysis of the resulting short and noisy time series data sets is a challenge. Traditional statistical methods for the study of static Omics datasets are of limited relevance and new methods are required. This chapter discusses such algorithms which enable the application of the inverse analysis approach to short Omics time series.
Collapse
|
43
|
Gao S, Hartman JL, Carter JL, Hessner MJ, Wang X. Global analysis of phase locking in gene expression during cell cycle: the potential in network modeling. BMC SYSTEMS BIOLOGY 2010; 4:167. [PMID: 21129191 PMCID: PMC3017040 DOI: 10.1186/1752-0509-4-167] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/18/2010] [Accepted: 12/03/2010] [Indexed: 11/10/2022]
Abstract
Background In nonlinear dynamic systems, synchrony through oscillation and frequency modulation is a general control strategy to coordinate multiple modules in response to external signals. Conversely, the synchrony information can be utilized to infer interaction. Increasing evidence suggests that frequency modulation is also common in transcription regulation. Results In this study, we investigate the potential of phase locking analysis, a technique to study the synchrony patterns, in the transcription network modeling of time course gene expression data. Using the yeast cell cycle data, we show that significant phase locking exists between transcription factors and their targets, between gene pairs with prior evidence of physical or genetic interactions, and among cell cycle genes. When compared with simple correlation we found that the phase locking metric can identify gene pairs that interact with each other more efficiently. In addition, it can automatically address issues of arbitrary time lags or different dynamic time scales in different genes, without the need for alignment. Interestingly, many of the phase locked gene pairs exhibit higher order than 1:1 locking, and significant phase lags with respect to each other. Based on these findings we propose a new phase locking metric for network reconstruction using time course gene expression data. We show that it is efficient at identifying network modules of focused biological themes that are important to cell cycle regulation. Conclusions Our result demonstrates the potential of phase locking analysis in transcription network modeling. It also suggests the importance of understanding the dynamics underlying the gene expression patterns.
Collapse
Affiliation(s)
- Shouguo Gao
- Department of Physics, University of Alabama at Birmingham, Birmingham, Alabama 35294, USA
| | | | | | | | | |
Collapse
|
44
|
Shojaie A, Michailidis G. Discovering graphical Granger causality using the truncating lasso penalty. Bioinformatics 2010; 26:i517-23. [PMID: 20823316 PMCID: PMC2935442 DOI: 10.1093/bioinformatics/btq377] [Citation(s) in RCA: 74] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Components of biological systems interact with each other in order to carry out vital cell functions. Such information can be used to improve estimation and inference, and to obtain better insights into the underlying cellular mechanisms. Discovering regulatory interactions among genes is therefore an important problem in systems biology. Whole-genome expression data over time provides an opportunity to determine how the expression levels of genes are affected by changes in transcription levels of other genes, and can therefore be used to discover regulatory interactions among genes. RESULTS In this article, we propose a novel penalization method, called truncating lasso, for estimation of causal relationships from time-course gene expression data. The proposed penalty can correctly determine the order of the underlying time series, and improves the performance of the lasso-type estimators. Moreover, the resulting estimate provides information on the time lag between activation of transcription factors and their effects on regulated genes. We provide an efficient algorithm for estimation of model parameters, and show that the proposed method can consistently discover causal relationships in the large p, small n setting. The performance of the proposed model is evaluated favorably in simulated, as well as real, data examples. AVAILABILITY The proposed truncating lasso method is implemented in the R-package 'grangerTlasso' and is freely available at http://www.stat.lsa.umich.edu/~shojaie/.
Collapse
Affiliation(s)
- Ali Shojaie
- Department of Statistics, University of Michigan, Ann Arbor, Michigan 48109, USA.
| | | |
Collapse
|
45
|
Walther D, Strassburg K, Durek P, Kopka J. Metabolic pathway relationships revealed by an integrative analysis of the transcriptional and metabolic temperature stress-response dynamics in yeast. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2010; 14:261-74. [PMID: 20455750 DOI: 10.1089/omi.2010.0010] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
The integrated analysis of omics datasets covering different levels of molecular organization has become a central task of systems biology. We investigated the transcriptional and metabolic response of yeast exposed to increased (37 degrees C) and lowered (10 degrees C) temperatures relative to optimal reference conditions (28 degrees C) in the context of known metabolic pathways. Pairwise metabolite correlation levels were found to carry more pathway-related information and to extend to farther distances within the metabolic pathway network than associated transcript level correlations. Metabolites were detected to correlate stronger to their cognate transcripts (metabolite is reactant of the enzyme encoded by the transcript) than to more remote or randomly chosen transcripts reflecting their close metabolic relationship. We observed a pronounced temporal hierarchy between metabolic and transcriptional molecular responses under heat and cold stress. Changes of metabolites were most significantly correlated to transcripts encoding metabolic enzymes, when metabolites were considered leading in time-lagged correlation analyses. By applying the concept of Granger causality, we detected directed relationships between metabolites and their cognate transcripts. When interpreted as substrate-to-product directions, most of these directed Granger causality pairs agreed with the KEGG-annotated preferred reaction direction. Thus, the introduced Granger causality approach may prove useful for determining the preferred direction of metabolic reactions in cellular systems. The metabolites glutamic acid and serine were identified as central causative metabolites influencing transcript levels at later time points. Selected examples are presented illustrating the intertwined relationships between metabolites and transcripts in the yeast temperature stress adaptation process.
Collapse
Affiliation(s)
- Dirk Walther
- Max Planck Institute of Molecular Plant Physiology, Am Mühlenberg 1, 14476 Potsdam-Golm, Germany
| | | | | | | |
Collapse
|
46
|
Fujita A, Kojima K, Patriota AG, Sato JR, Severino P, Miyano S. A fast and robust statistical test based on likelihood ratio with Bartlett correction to identify Granger causality between gene sets. ACTA ACUST UNITED AC 2010; 26:2349-51. [PMID: 20660295 DOI: 10.1093/bioinformatics/btq427] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
UNLABELLED We propose a likelihood ratio test (LRT) with Bartlett correction in order to identify Granger causality between sets of time series gene expression data. The performance of the proposed test is compared to a previously published bootstrap-based approach. LRT is shown to be significantly faster and statistically powerful even within non-Normal distributions. An R package named gGranger containing an implementation for both Granger causality identification tests is also provided. AVAILABILITY http://dnagarden.ims.u-tokyo.ac.jp/afujita/en/doku.php?id=ggranger.
Collapse
Affiliation(s)
- André Fujita
- Computational Science Research Program, RIKEN, Wako, Saitama, Japan.
| | | | | | | | | | | |
Collapse
|
47
|
Zoppoli P, Morganella S, Ceccarelli M. TimeDelay-ARACNE: Reverse engineering of gene networks from time-course data by an information theoretic approach. BMC Bioinformatics 2010; 11:154. [PMID: 20338053 PMCID: PMC2862045 DOI: 10.1186/1471-2105-11-154] [Citation(s) in RCA: 139] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2009] [Accepted: 03/25/2010] [Indexed: 11/20/2022] Open
Abstract
BACKGROUND One of main aims of Molecular Biology is the gain of knowledge about how molecular components interact each other and to understand gene function regulations. Using microarray technology, it is possible to extract measurements of thousands of genes into a single analysis step having a picture of the cell gene expression. Several methods have been developed to infer gene networks from steady-state data, much less literature is produced about time-course data, so the development of algorithms to infer gene networks from time-series measurements is a current challenge into bioinformatics research area. In order to detect dependencies between genes at different time delays, we propose an approach to infer gene regulatory networks from time-series measurements starting from a well known algorithm based on information theory. RESULTS In this paper we show how the ARACNE (Algorithm for the Reconstruction of Accurate Cellular Networks) algorithm can be used for gene regulatory network inference in the case of time-course expression profiles. The resulting method is called TimeDelay-ARACNE. It just tries to extract dependencies between two genes at different time delays, providing a measure of these dependencies in terms of mutual information. The basic idea of the proposed algorithm is to detect time-delayed dependencies between the expression profiles by assuming as underlying probabilistic model a stationary Markov Random Field. Less informative dependencies are filtered out using an auto calculated threshold, retaining most reliable connections. TimeDelay-ARACNE can infer small local networks of time regulated gene-gene interactions detecting their versus and also discovering cyclic interactions also when only a medium-small number of measurements are available. We test the algorithm both on synthetic networks and on microarray expression profiles. Microarray measurements concern S. cerevisiae cell cycle, E. coli SOS pathways and a recently developed network for in vivo assessment of reverse engineering algorithms. Our results are compared with ARACNE itself and with the ones of two previously published algorithms: Dynamic Bayesian Networks and systems of ODEs, showing that TimeDelay-ARACNE has good accuracy, recall and F-score for the network reconstruction task. CONCLUSIONS Here we report the adaptation of the ARACNE algorithm to infer gene regulatory networks from time-course data, so that, the resulting network is represented as a directed graph. The proposed algorithm is expected to be useful in reconstruction of small biological directed networks from time course data.
Collapse
Affiliation(s)
- Pietro Zoppoli
- Department of Biological and Environmental Studies, University of Sannio, Benevento, I-82100, Italy
- Biogem s c a r l, Institute for Genetic Research "Gaetano Salvatore", Ariano Irpino (Avellino), I-83031, Italy
| | - Sandro Morganella
- Department of Biological and Environmental Studies, University of Sannio, Benevento, I-82100, Italy
- Biogem s c a r l, Institute for Genetic Research "Gaetano Salvatore", Ariano Irpino (Avellino), I-83031, Italy
| | - Michele Ceccarelli
- Department of Biological and Environmental Studies, University of Sannio, Benevento, I-82100, Italy
- Biogem s c a r l, Institute for Genetic Research "Gaetano Salvatore", Ariano Irpino (Avellino), I-83031, Italy
| |
Collapse
|
48
|
Hickman GJ, Hodgman TC. Inference of gene regulatory networks using boolean-network inference methods. J Bioinform Comput Biol 2010; 7:1013-29. [PMID: 20014476 DOI: 10.1142/s0219720009004448] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2009] [Revised: 08/14/2009] [Accepted: 08/15/2009] [Indexed: 02/03/2023]
Abstract
The modeling of genetic networks especially from microarray and related data has become an important aspect of the biosciences. This review takes a fresh look at a specific family of models used for constructing genetic networks, the so-called Boolean networks. The review outlines the various different types of Boolean network developed to date, from the original Random Boolean Network to the current Probabilistic Boolean Network. In addition, some of the different inference methods available to infer these genetic networks are also examined. Where possible, particular attention is paid to input requirements as well as the efficiency, advantages and drawbacks of each method. Though the Boolean network model is one of many models available for network inference today, it is well established and remains a topic of considerable interest in the field of genetic network inference. Hybrids of Boolean networks with other approaches may well be the way forward in inferring the most informative networks.
Collapse
|
49
|
Zhu J, Chen Y, Leonardson AS, Wang K, Lamb JR, Emilsson V, Schadt EE. Characterizing dynamic changes in the human blood transcriptional network. PLoS Comput Biol 2010; 6:e1000671. [PMID: 20168994 PMCID: PMC2820517 DOI: 10.1371/journal.pcbi.1000671] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2009] [Accepted: 01/07/2010] [Indexed: 11/29/2022] Open
Abstract
Gene expression data generated systematically in a given system over multiple time points provides a source of perturbation that can be leveraged to infer causal relationships among genes explaining network changes. Previously, we showed that food intake has a large impact on blood gene expression patterns and that these responses, either in terms of gene expression level or gene-gene connectivity, are strongly associated with metabolic diseases. In this study, we explored which genes drive the changes of gene expression patterns in response to time and food intake. We applied the Granger causality test and the dynamic Bayesian network to gene expression data generated from blood samples collected at multiple time points during the course of a day. The simulation result shows that combining many short time series together is as powerful to infer Granger causality as using a single long time series. Using the Granger causality test, we identified genes that were supported as the most likely causal candidates for the coordinated temporal changes in the network. These results show that PER1 is a key regulator of the blood transcriptional network, in which multiple biological processes are under circadian rhythm regulation. The fasted and fed dynamic Bayesian networks showed that over 72% of dynamic connections are self links. Finally, we show that different processes such as inflammation and lipid metabolism, which are disconnected in the static network, become dynamically linked in response to food intake, which would suggest that increasing nutritional load leads to coordinate regulation of these biological processes. In conclusion, our results suggest that food intake has a profound impact on the dynamic co-regulation of multiple biological processes, such as metabolism, immune response, apoptosis and circadian rhythm. The results could have broader implications for the design of studies of disease association and drug response in clinical trials. Peripheral blood is the most readily accessible human tissue for clinical studies and experimental research more generally. Large-scale molecular profiling technologies have enabled measurements of mRNA expression on the scale of whole genomes. Understanding the relationships between human blood gene expression profiles and clinical traits is extremely useful for inferring causal factors for human disease and for studying drug response. Biological pathways and the complex behaviors they induce are not static, but change dynamically in response to external factors such as intake/uptake of nutrients and administration of drugs. We employed a randomized, two-arm cross-over design to assess the effects of fasting and feeding on the dynamic changes of blood transcriptional network. Our work has convincingly shown that feeding or increasing nutritional load affects the human circadian rhythm system which connects to other biological processes including metabolic and immune responses. We believe this is a first step towards a more comprehensive population-based study that seeks to connect changes in the blood transcriptome to drug response, and to disease and biology more generally.
Collapse
Affiliation(s)
- Jun Zhu
- Department of Genetics, Rosetta Inpharmatics, LLC, a wholly owned subsidiary of Merck & Co., Inc., Seattle, Washington, United States of America
- * E-mail: (JZ); (EES)
| | - Yanqing Chen
- Department of Genetics, Rosetta Inpharmatics, LLC, a wholly owned subsidiary of Merck & Co., Inc., Seattle, Washington, United States of America
| | - Amy S. Leonardson
- Department of Genetics, Rosetta Inpharmatics, LLC, a wholly owned subsidiary of Merck & Co., Inc., Seattle, Washington, United States of America
| | - Kai Wang
- Department of Genetics, Rosetta Inpharmatics, LLC, a wholly owned subsidiary of Merck & Co., Inc., Seattle, Washington, United States of America
| | - John R. Lamb
- Department of Genetics, Rosetta Inpharmatics, LLC, a wholly owned subsidiary of Merck & Co., Inc., Seattle, Washington, United States of America
| | - Valur Emilsson
- Molecular Profiling and Research Informatics Department, Merck Research Laboratories, Rahway, New Jersey, United States of America
| | - Eric E. Schadt
- Department of Genetics, Rosetta Inpharmatics, LLC, a wholly owned subsidiary of Merck & Co., Inc., Seattle, Washington, United States of America
- * E-mail: (JZ); (EES)
| |
Collapse
|
50
|
Krishna R, Li CT, Buchanan-Wollaston V. A temporal precedence based clustering method for gene expression microarray data. BMC Bioinformatics 2010; 11:68. [PMID: 20113513 PMCID: PMC2841598 DOI: 10.1186/1471-2105-11-68] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2009] [Accepted: 01/30/2010] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Time-course microarray experiments can produce useful data which can help in understanding the underlying dynamics of the system. Clustering is an important stage in microarray data analysis where the data is grouped together according to certain characteristics. The majority of clustering techniques are based on distance or visual similarity measures which may not be suitable for clustering of temporal microarray data where the sequential nature of time is important. We present a Granger causality based technique to cluster temporal microarray gene expression data, which measures the interdependence between two time-series by statistically testing if one time-series can be used for forecasting the other time-series or not. RESULTS A gene-association matrix is constructed by testing temporal relationships between pairs of genes using the Granger causality test. The association matrix is further analyzed using a graph-theoretic technique to detect highly connected components representing interesting biological modules. We test our approach on synthesized datasets and real biological datasets obtained for Arabidopsis thaliana. We show the effectiveness of our approach by analyzing the results using the existing biological literature. We also report interesting structural properties of the association network commonly desired in any biological system. CONCLUSIONS Our experiments on synthesized and real microarray datasets show that our approach produces encouraging results. The method is simple in implementation and is statistically traceable at each step. The method can produce sets of functionally related genes which can be further used for reverse-engineering of gene circuits.
Collapse
Affiliation(s)
- Ritesh Krishna
- Department of Computer Science, Warwick University, Coventry CV4 7AL, UK
| | - Chang-Tsun Li
- Department of Computer Science, Warwick University, Coventry CV4 7AL, UK
| | | |
Collapse
|