51
|
Granger Causality in Systems Biology: Modeling Gene Networks in Time Series Microarray Data Using Vector Autoregressive Models. ADVANCES IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY 2010. [DOI: 10.1007/978-3-642-15060-9_2] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
|
52
|
Fujita A, Patriota AG, Sato JR, Miyano S. The impact of measurement errors in the identification of regulatory networks. BMC Bioinformatics 2009; 10:412. [PMID: 20003382 PMCID: PMC2811120 DOI: 10.1186/1471-2105-10-412] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2009] [Accepted: 12/13/2009] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND There are several studies in the literature depicting measurement error in gene expression data and also, several others about regulatory network models. However, only a little fraction describes a combination of measurement error in mathematical regulatory networks and shows how to identify these networks under different rates of noise. RESULTS This article investigates the effects of measurement error on the estimation of the parameters in regulatory networks. Simulation studies indicate that, in both time series (dependent) and non-time series (independent) data, the measurement error strongly affects the estimated parameters of the regulatory network models, biasing them as predicted by the theory. Moreover, when testing the parameters of the regulatory network models, p-values computed by ignoring the measurement error are not reliable, since the rate of false positives are not controlled under the null hypothesis. In order to overcome these problems, we present an improved version of the Ordinary Least Square estimator in independent (regression models) and dependent (autoregressive models) data when the variables are subject to noises. Moreover, measurement error estimation procedures for microarrays are also described. Simulation results also show that both corrected methods perform better than the standard ones (i.e., ignoring measurement error). The proposed methodologies are illustrated using microarray data from lung cancer patients and mouse liver time series data. CONCLUSIONS Measurement error dangerously affects the identification of regulatory network models, thus, they must be reduced or taken into account in order to avoid erroneous conclusions. This could be one of the reasons for high biological false positive rates identified in actual regulatory network models.
Collapse
Affiliation(s)
- André Fujita
- Computational Science Research Program, RIKEN, 2-1 Hirosawa, Wako, Saitama, 351-0198, Japan
| | - Alexandre G Patriota
- Institute of Mathematics and Statistics, University of São Paulo, Rua do Matão, 1010 - São Paulo, 05508-090, Brazil
| | - João R Sato
- Center of Mathematics, Computation and Cognition, Universidade Federal do ABC, Rua Santa Adélia, 166 - Santo André, 09210-170, Brazil
| | - Satoru Miyano
- Computational Science Research Program, RIKEN, 2-1 Hirosawa, Wako, Saitama, 351-0198, Japan
- Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan
| |
Collapse
|
53
|
Campiteli MG, Soriani FM, Malavazi I, Kinouchi O, Pereira CAB, Goldman GH. A reliable measure of similarity based on dependency for short time series: an application to gene expression networks. BMC Bioinformatics 2009; 10:270. [PMID: 19712487 PMCID: PMC2757031 DOI: 10.1186/1471-2105-10-270] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2008] [Accepted: 08/28/2009] [Indexed: 02/01/2023] Open
Abstract
Background Microarray techniques have become an important tool to the investigation of genetic relationships and the assignment of different phenotypes. Since microarrays are still very expensive, most of the experiments are performed with small samples. This paper introduces a method to quantify dependency between data series composed of few sample points. The method is used to construct gene co-expression subnetworks of highly significant edges. Results The results shown here are for an adapted subset of a Saccharomyces cerevisiae gene expression data set with low temporal resolution and poor statistics. The method reveals common transcription factors with a high confidence level and allows the construction of subnetworks with high biological relevance that reveals characteristic features of the processes driving the organism adaptations to specific environmental conditions. Conclusion Our method allows a reliable and sophisticated analysis of microarray data even under severe constraints. The utilization of systems biology improves the biologists ability to elucidate the mechanisms underlying celular processes and to formulate new hypotheses.
Collapse
|
54
|
Lozano AC, Abe N, Liu Y, Rosset S. Grouped graphical Granger modeling for gene expression regulatory networks discovery. Bioinformatics 2009; 25:i110-8. [PMID: 19477976 PMCID: PMC2687953 DOI: 10.1093/bioinformatics/btp199] [Citation(s) in RCA: 61] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
We consider the problem of discovering gene regulatory networks from time-series microarray data. Recently, graphical Granger modeling has gained considerable attention as a promising direction for addressing this problem. These methods apply graphical modeling methods on time-series data and invoke the notion of ‘Granger causality’ to make assertions on causality through inference on time-lagged effects. Existing algorithms, however, have neglected an important aspect of the problem—the group structure among the lagged temporal variables naturally imposed by the time series they belong to. Specifically, existing methods in computational biology share this shortcoming, as well as additional computational limitations, prohibiting their effective applications to the large datasets including a large number of genes and many data points. In the present article, we propose a novel methodology which we term ‘grouped graphical Granger modeling method’, which overcomes the limitations mentioned above by applying a regression method suited for high-dimensional and large data, and by leveraging the group structure among the lagged temporal variables according to the time series they belong to. We demonstrate the effectiveness of the proposed methodology on both simulated and actual gene expression data, specifically the human cancer cell (HeLa S3) cycle data. The simulation results show that the proposed methodology generally exhibits higher accuracy in recovering the underlying causal structure. Those on the gene expression data demonstrate that it leads to improved accuracy with respect to prediction of known links, and also uncovers additional causal relationships uncaptured by earlier works. Contact:aclozano@us.ibm.com
Collapse
Affiliation(s)
- Aurélie C Lozano
- Mathematical Sciences Department, IBM TJ Watson Research Center, Yorktown Heights, NY 10598, USA.
| | | | | | | |
Collapse
|
55
|
Hu J, Hu F. Estimating equation-based causality analysis with application to microarray time series data. Biostatistics 2009; 10:468-80. [PMID: 19329818 DOI: 10.1093/biostatistics/kxp005] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Microarray time-course data can be used to explore interactions among genes and infer gene network. The crucial step in constructing gene network is to develop an appropriate causality test. In this regard, the expression profile of each gene can be treated as a time series. A typical existing method establishes the Granger causality based on Wald type of test, which relies on the homoscedastic normality assumption of the data distribution. However, this assumption can be seriously violated in real microarray experiments and thus may lead to inconsistent test results and false scientific conclusions. To overcome the drawback, we propose an estimating equation-based method which is robust to both heteroscedasticity and nonnormality of the gene expression data. In fact, it only requires the residuals to be uncorrelated. We will use simulation studies and a real-data example to demonstrate the applicability of the proposed method.
Collapse
Affiliation(s)
- Jianhua Hu
- Department of Biostatistics, Division of Quantitative Sciences, University of Texas M. D. Anderson Cancer Center, Houston, TX, USA.
| | | |
Collapse
|
56
|
Abstract
Granger causality (GC) and its extension have been used widely to infer causal relationships from multivariate time series generated from biological systems. GC is ideally suited for causal inference in bivariate vector autoregressive process (VAR). A zero magnitude of the upper or lower off-diagonal element(s) in a bivariate VAR is indicative of lack of causal relationship in that direction resulting in true acyclic structures. However, in experimental settings, statistical tests, such as F-test that rely on the ratio of the mean-squared forecast errors, are used to infer significant GC relationships. The present study investigates acyclic approximations within the context of bi-directional two-gene network motifs modeled as bivariate VAR. The fine interplay between the model parameters in the bivariate VAR, namely: (i) transcriptional noise variance, (ii) autoregulatory feedback, and (iii) transcriptional coupling strength that can give rise to discrepancies in the ratio of the mean-squared forecast errors is investigated. Subsequently, their impact on statistical power is investigated using Monte Carlo simulations. More importantly, it is shown that one can arrive at acyclic approximations even for bi-directional networks for suitable choice of process parameters, significance level and sample size. While the results are discussed within the framework of transcriptional network, the analytical treatment provided is generic and likely to have significant impact across distinct paradigms.
Collapse
|
57
|
Fujita A, Sato JR, Garay-Malpartida HM, Sogayar MC, Ferreira CE, Miyano S. Modeling nonlinear gene regulatory networks from time series gene expression data. J Bioinform Comput Biol 2009; 6:961-79. [PMID: 18942161 DOI: 10.1142/s0219720008003746] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2007] [Revised: 02/27/2008] [Accepted: 02/27/2008] [Indexed: 01/13/2023]
Abstract
In cells, molecular networks such as gene regulatory networks are the basis of biological complexity. Therefore, gene regulatory networks have become the core of research in systems biology. Understanding the processes underlying the several extracellular regulators, signal transduction, protein-protein interactions, and differential gene expression processes requires detailed molecular description of the protein and gene networks involved. To understand better these complex molecular networks and to infer new regulatory associations, we propose a statistical method based on vector autoregressive models and Granger causality to estimate nonlinear gene regulatory networks from time series microarray data. Most of the models available in the literature assume linearity in the inference of gene connections; moreover, these models do not infer directionality in these connections. Thus, a priori biological knowledge is required. However, in pathological cases, no a priori biological information is available. To overcome these problems, we present the nonlinear vector autoregressive (NVAR) model. We have applied the NVAR model to estimate nonlinear gene regulatory networks based entirely on gene expression profiles obtained from DNA microarray experiments. We show the results obtained by NVAR through several simulations and by the construction of three actual gene regulatory networks (p53, NF-kappaB, and c-Myc) for HeLa cells.
Collapse
Affiliation(s)
- André Fujita
- Human Genome Center, University of Tokyo, 4-6-1 Shirokanedai, Tokyo 108-8639, Japan.
| | | | | | | | | | | |
Collapse
|
58
|
Wren JD, Wilkins D, Fuscoe JC, Bridges S, Winters-Hilt S, Gusev Y. Proceedings of the 2008 MidSouth Computational Biology and Bioinformatics Society (MCBIOS) Conference. BMC Bioinformatics 2008; 9 Suppl 9:S1. [PMID: 18793454 PMCID: PMC2537572 DOI: 10.1186/1471-2105-9-s9-s1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Affiliation(s)
- Jonathan D Wren
- Arthritis and Immunology Research Program, Oklahoma Medical Research Foundation; 825 N.E. 13th Street, Oklahoma City, OK 73104-5005, USA.
| | | | | | | | | | | |
Collapse
|
59
|
Nagarajan R, Upreti M. Comment on causality and pathway search in microarray time series experiment. Bioinformatics 2008; 24:1029-32; author reply 1033. [PMID: 18304931 DOI: 10.1093/bioinformatics/btm586] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
60
|
Dhamala M, Rangarajan G, Ding M. Estimating Granger causality from fourier and wavelet transforms of time series data. PHYSICAL REVIEW LETTERS 2008; 100:018701. [PMID: 18232831 DOI: 10.1103/physrevlett.100.018701] [Citation(s) in RCA: 190] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/05/2007] [Indexed: 05/12/2023]
Abstract
Experiments in many fields of science and engineering yield data in the form of time series. The Fourier and wavelet transform-based nonparametric methods are used widely to study the spectral characteristics of these time series data. Here, we extend the framework of nonparametric spectral methods to include the estimation of Granger causality spectra for assessing directional influences. We illustrate the utility of the proposed methods using synthetic data from network models consisting of interacting dynamical systems.
Collapse
Affiliation(s)
- Mukeshwar Dhamala
- Department of Physics and Astronomy, Brains and Behavior Program, Center for Behavioral Neuroscience, Georgia State University, Atlanta, GA 30303, USA
| | | | | |
Collapse
|
61
|
Abstract
Common human diseases like obesity and diabetes are driven by complex networks of genes and any number of environmental factors. To understand this complexity in hopes of identifying targets and developing drugs against disease, a systematic approach is required to elucidate the genetic and environmental factors and interactions among and between these factors, and to establish how these factors induce changes in gene networks that in turn lead to disease. The explosion of large-scale, high-throughput technologies in the biological sciences has enabled researchers to take a more systems biology approach to study complex traits like disease. Genotyping of hundreds of thousands of DNA markers and profiling tens of thousands of molecular phenotypes simultaneously in thousands of individuals is now possible, and this scale of data is making it possible for the first time to reconstruct whole gene networks associated with disease. In the following sections, we review different approaches for integrating genetic expression and clinical data to infer causal relationships among gene expression traits and between expression and disease traits. We further review methods to integrate these data in a more comprehensive manner to identify common pathways shared by the causal factors driving disease, including the reconstruction of association and probabilistic causal networks. Particular attention is paid to integrating diverse information to refine these types of networks so that they are more predictive. To highlight these different approaches in practice, we step through an example on how Insig2 was identified as a causal factor for plasma cholesterol levels in mice.
Collapse
|
62
|
Fujita A, Sato JR, Ferreira CE, Sogayar MC. GEDI: a user-friendly toolbox for analysis of large-scale gene expression data. BMC Bioinformatics 2007; 8:457. [PMID: 18021455 PMCID: PMC2194737 DOI: 10.1186/1471-2105-8-457] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2007] [Accepted: 11/19/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Several mathematical and statistical methods have been proposed in the last few years to analyze microarray data. Most of those methods involve complicated formulas, and software implementations that require advanced computer programming skills. Researchers from other areas may experience difficulties when they attempting to use those methods in their research. Here we present an user-friendly toolbox which allows large-scale gene expression analysis to be carried out by biomedical researchers with limited programming skills. RESULTS Here, we introduce an user-friendly toolbox called GEDI (Gene Expression Data Interpreter), an extensible, open-source, and freely-available tool that we believe will be useful to a wide range of laboratories, and to researchers with no background in Mathematics and Computer Science, allowing them to analyze their own data by applying both classical and advanced approaches developed and recently published by Fujita et al. CONCLUSION GEDI is an integrated user-friendly viewer that combines the state of the art SVR, DVAR and SVAR algorithms, previously developed by us. It facilitates the application of SVR, DVAR and SVAR, further than the mathematical formulas present in the corresponding publications, and allows one to better understand the results by means of available visualizations. Both running the statistical methods and visualizing the results are carried out within the graphical user interface, rendering these algorithms accessible to the broad community of researchers in Molecular Biology.
Collapse
Affiliation(s)
- André Fujita
- Chemistry Institute, University of São Paulo, Av, Lineu Prestes, 748 - São Paulo, 05508-900, SP, Brazil.
| | | | | | | |
Collapse
|
63
|
Fujita A, Sato JR, Garay-Malpartida HM, Yamaguchi R, Miyano S, Sogayar MC, Ferreira CE. Modeling gene expression regulatory networks with the sparse vector autoregressive model. BMC SYSTEMS BIOLOGY 2007; 1:39. [PMID: 17761000 PMCID: PMC2048982 DOI: 10.1186/1752-0509-1-39] [Citation(s) in RCA: 65] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/25/2007] [Accepted: 08/30/2007] [Indexed: 12/25/2022]
Abstract
BACKGROUND To understand the molecular mechanisms underlying important biological processes, a detailed description of the gene products networks involved is required. In order to define and understand such molecular networks, some statistical methods are proposed in the literature to estimate gene regulatory networks from time-series microarray data. However, several problems still need to be overcome. Firstly, information flow need to be inferred, in addition to the correlation between genes. Secondly, we usually try to identify large networks from a large number of genes (parameters) originating from a smaller number of microarray experiments (samples). Due to this situation, which is rather frequent in Bioinformatics, it is difficult to perform statistical tests using methods that model large gene-gene networks. In addition, most of the models are based on dimension reduction using clustering techniques, therefore, the resulting network is not a gene-gene network but a module-module network. Here, we present the Sparse Vector Autoregressive model as a solution to these problems. RESULTS We have applied the Sparse Vector Autoregressive model to estimate gene regulatory networks based on gene expression profiles obtained from time-series microarray experiments. Through extensive simulations, by applying the SVAR method to artificial regulatory networks, we show that SVAR can infer true positive edges even under conditions in which the number of samples is smaller than the number of genes. Moreover, it is possible to control for false positives, a significant advantage when compared to other methods described in the literature, which are based on ranks or score functions. By applying SVAR to actual HeLa cell cycle gene expression data, we were able to identify well known transcription factor targets. CONCLUSION The proposed SVAR method is able to model gene regulatory networks in frequent situations in which the number of samples is lower than the number of genes, making it possible to naturally infer partial Granger causalities without any a priori information. In addition, we present a statistical test to control the false discovery rate, which was not previously possible using other gene regulatory network models.
Collapse
Affiliation(s)
- André Fujita
- Institute of Mathematics and Statistics, University of São Paulo, Rua do Matão, 1010 – São Paulo, 05508-090, SP, Brazil
- Chemistry Institute, University of São Paulo, Av. Lineu Prestes, 748 – São Paulo, 05513-970, SP, Brazil
| | - João R Sato
- Institute of Mathematics and Statistics, University of São Paulo, Rua do Matão, 1010 – São Paulo, 05508-090, SP, Brazil
| | - Humberto M Garay-Malpartida
- Chemistry Institute, University of São Paulo, Av. Lineu Prestes, 748 – São Paulo, 05513-970, SP, Brazil
- School of Arts, Science and Humanities, University of São Paulo, Av. Arlindo Bettio, 1000 – São Paulo, 03828-000, SP, Brazil
| | - Rui Yamaguchi
- Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan
| | - Satoru Miyano
- Human Genome Center, Institute of Medical Science, University of Tokyo, 4-6-1 Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan
| | - Mari C Sogayar
- Chemistry Institute, University of São Paulo, Av. Lineu Prestes, 748 – São Paulo, 05513-970, SP, Brazil
| | - Carlos E Ferreira
- Institute of Mathematics and Statistics, University of São Paulo, Rua do Matão, 1010 – São Paulo, 05508-090, SP, Brazil
| |
Collapse
|
64
|
Fujita A, Sato JR, Garay-Malpartida HM, Morettin PA, Sogayar MC, Ferreira CE. Time-varying modeling of gene expression regulatory networks using the wavelet dynamic vector autoregressive method. ACTA ACUST UNITED AC 2007; 23:1623-30. [PMID: 17463021 DOI: 10.1093/bioinformatics/btm151] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION A variety of biological cellular processes are achieved through a variety of extracellular regulators, signal transduction, protein-protein interactions and differential gene expression. Understanding of the mechanisms underlying these processes requires detailed molecular description of the protein and gene networks involved. To better understand these molecular networks, we propose a statistical method to estimate time-varying gene regulatory networks from time series microarray data. One well known problem when inferring connectivity in gene regulatory networks is the fact that the relationships found constitute correlations that do not allow inferring causation, for which, a priori biological knowledge is required. Moreover, it is also necessary to know the time period at which this causation occurs. Here, we present the Dynamic Vector Autoregressive model as a solution to these problems. RESULTS We have applied the Dynamic Vector Autoregressive model to estimate time-varying gene regulatory networks based on gene expression profiles obtained from microarray experiments. The network is determined entirely based on gene expression profiles data, without any prior biological knowledge. Through construction of three gene regulatory networks (of p53, NF-kappaB and c-myc) for HeLa cells, we were able to predict the connectivity, Granger-causality and dynamics of the information flow in these networks. SUPPLEMENTARY INFORMATION Additional figures may be found at http://mariwork.iq.usp.br/dvar/.
Collapse
Affiliation(s)
- A Fujita
- Institute of Mathematics and Statistics, University of São Paulo, Rua do Matão, 1010-São Paulo, 05508-090, SP, Brazil
| | | | | | | | | | | |
Collapse
|