1
|
Causal Queries from Observational Data in Biological Systems via Bayesian Networks: An Empirical Study in Small Networks. Methods Mol Biol 2018. [PMID: 30547398 DOI: 10.1007/978-1-4939-8882-2_5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2023]
Abstract
Biological networks are a very convenient modeling and visualization tool to discover knowledge from modern high-throughput genomics and post-genomics data sets. Indeed, biological entities are not isolated but are components of complex multilevel systems. We go one step further and advocate for the consideration of causal representations of the interactions in living systems. We present the causal formalism and bring it out in the context of biological networks, when the data is observational. We also discuss its ability to decipher the causal information flow as observed in gene expression. We also illustrate our exploration by experiments on small simulated networks as well as on a real biological data set.
Collapse
|
2
|
Bayesian variable selection with graphical structure learning: Applications in integrative genomics. PLoS One 2018; 13:e0195070. [PMID: 30059495 PMCID: PMC6066211 DOI: 10.1371/journal.pone.0195070] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2016] [Accepted: 03/12/2018] [Indexed: 11/19/2022] Open
Abstract
Significant advances in biotechnology have allowed for simultaneous measurement of molecular data across multiple genomic, epigenomic and transcriptomic levels from a single tumor/patient sample. This has motivated systematic data-driven approaches to integrate multi-dimensional structured datasets, since cancer development and progression is driven by numerous co-ordinated molecular alterations and the interactions between them. We propose a novel multi-scale Bayesian approach that combines integrative graphical structure learning from multiple sources of data with a variable selection framework—to determine the key genomic drivers of cancer progression. The integrative structure learning is first accomplished through novel joint graphical models for heterogeneous (mixed scale) data, allowing for flexible and interpretable incorporation of prior existing knowledge. This subsequently informs a variable selection step to identify groups of co-ordinated molecular features within and across platforms associated with clinical outcomes of cancer progression, while according appropriate adjustments for multicollinearity and multiplicities. We evaluate our methods through rigorous simulations to establish superiority over existing methods that do not take the network and/or prior information into account. Our methods are motivated by and applied to a glioblastoma multiforme (GBM) dataset from The Cancer Genome Atlas to predict patient survival times integrating gene expression, copy number and methylation data. We find a high concordance between our selected prognostic gene network modules with known associations with GBM. In addition, our model discovers several novel cross-platform network interactions (both cis and trans acting) between gene expression, copy number variation associated gene dosing and epigenetic regulation through promoter methylation, some with known implications in the etiology of GBM. Our framework provides a useful tool for biomedical researchers, since clinical prediction using multi-platform genomic information is an important step towards personalized treatment of many cancers.
Collapse
|
3
|
Becker J, Pérot P, Cheynet V, Oriol G, Mugnier N, Mommert M, Tabone O, Textoris J, Veyrieras JB, Mallet F. A comprehensive hybridization model allows whole HERV transcriptome profiling using high density microarray. BMC Genomics 2017; 18:286. [PMID: 28390408 PMCID: PMC5385096 DOI: 10.1186/s12864-017-3669-7] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2016] [Accepted: 03/28/2017] [Indexed: 02/07/2023] Open
Abstract
Background Human endogenous retroviruses (HERVs) have received much attention for their implications in the etiology of many human diseases and their profound effect on evolution. Notably, recent studies have highlighted associations between HERVs expression and cancers (Yu et al., Int J Mol Med 32, 2013), autoimmunity (Balada et al., Int Rev Immunol 29:351–370, 2010) and neurological (Christensen, J Neuroimmune Pharmacol 5:326–335, 2010) conditions. Their repetitive nature makes their study particularly challenging, where expression studies have largely focused on individual loci (De Parseval et al., J Virol 77:10414–10422, 2003) or general trends within families (Forsman et al., J Virol Methods 129:16–30, 2005; Seifarth et al., J Virol 79:341–352, 2005; Pichon et al., Nucleic Acids Res 34:e46, 2006). Methods To refine our understanding of HERVs activity, we introduce here a new microarray, HERV-V3. This work was made possible by the careful detection and annotation of genomic HERV/MaLR sequences as well as the development of a new hybridization model, allowing the optimization of probe performances and the control of cross-reactions. Results HERV-V3 offers an almost complete coverage of HERVs and their ancestors (mammalian apparent LTR-retrotransposons, MaLRs) at the locus level along with four other repertoires (active LINE-1 elements, lncRNA, a selection of 1559 human genes and common infectious viruses). We demonstrate that HERV-V3 analytical performances are comparable with commercial Affymetrix arrays, and that for a selection of tissue/pathological specific loci, the patterns of expression measured on HERV-V3 is consistent with those reported in the literature. Conclusions Given its large HERVs/MaLRs coverage and additional repertoires, HERV-V3 opens the door to multiple applications such as enhancers and alternative promoters identification, biomarkers identification as well as the characterization of genes and HERVs/MaLRs modulation caused by viral infection. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3669-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jérémie Becker
- Joint research unit, Hospice Civils de Lyon, bioMerieux, Centre Hospitalier Lyon Sud, 165 Chemin du Grand Revoyet, 69310, Pierre-Benite, France
| | - Philippe Pérot
- Joint research unit, Hospice Civils de Lyon, bioMerieux, Centre Hospitalier Lyon Sud, 165 Chemin du Grand Revoyet, 69310, Pierre-Benite, France
| | - Valérie Cheynet
- Joint research unit, Hospice Civils de Lyon, bioMerieux, Centre Hospitalier Lyon Sud, 165 Chemin du Grand Revoyet, 69310, Pierre-Benite, France
| | - Guy Oriol
- Joint research unit, Hospice Civils de Lyon, bioMerieux, Centre Hospitalier Lyon Sud, 165 Chemin du Grand Revoyet, 69310, Pierre-Benite, France
| | - Nathalie Mugnier
- Bioinformatics Research Department, bioMerieux, 376 Chemin de l'Orme, 69280, Marcy l'Etoile, France
| | - Marine Mommert
- Joint research unit, Hospice Civils de Lyon, bioMerieux, Centre Hospitalier Lyon Sud, 165 Chemin du Grand Revoyet, 69310, Pierre-Benite, France.,EA 7426 Pathophysiology of Injury-induced Immunosuppression, University of Lyon1-Hospices Civils de Lyon-bioMérieux, Hôpital Edouard Herriot, 5 Place d'Arsonval, 69437, Lyon Cedex 3, France
| | - Olivier Tabone
- EA 7426 Pathophysiology of Injury-induced Immunosuppression, University of Lyon1-Hospices Civils de Lyon-bioMérieux, Hôpital Edouard Herriot, 5 Place d'Arsonval, 69437, Lyon Cedex 3, France
| | - Julien Textoris
- EA 7426 Pathophysiology of Injury-induced Immunosuppression, University of Lyon1-Hospices Civils de Lyon-bioMérieux, Hôpital Edouard Herriot, 5 Place d'Arsonval, 69437, Lyon Cedex 3, France
| | - Jean-Baptiste Veyrieras
- Bioinformatics Research Department, bioMerieux, 376 Chemin de l'Orme, 69280, Marcy l'Etoile, France
| | - François Mallet
- Joint research unit, Hospice Civils de Lyon, bioMerieux, Centre Hospitalier Lyon Sud, 165 Chemin du Grand Revoyet, 69310, Pierre-Benite, France. .,EA 7426 Pathophysiology of Injury-induced Immunosuppression, University of Lyon1-Hospices Civils de Lyon-bioMérieux, Hôpital Edouard Herriot, 5 Place d'Arsonval, 69437, Lyon Cedex 3, France.
| |
Collapse
|
4
|
Abstract
Motivation: Markov networks are undirected graphical models that are widely used to infer relations between genes from experimental data. Their state-of-the-art inference procedures assume the data arise from a Gaussian distribution. High-throughput omics data, such as that from next generation sequencing, often violates this assumption. Furthermore, when collected data arise from multiple related but otherwise nonidentical distributions, their underlying networks are likely to have common features. New principled statistical approaches are needed that can deal with different data distributions and jointly consider collections of datasets. Results: We present FuseNet, a Markov network formulation that infers networks from a collection of nonidentically distributed datasets. Our approach is computationally efficient and general: given any number of distributions from an exponential family, FuseNet represents model parameters through shared latent factors that define neighborhoods of network nodes. In a simulation study, we demonstrate good predictive performance of FuseNet in comparison to several popular graphical models. We show its effectiveness in an application to breast cancer RNA-sequencing and somatic mutation data, a novel application of graphical models. Fusion of datasets offers substantial gains relative to inference of separate networks for each dataset. Our results demonstrate that network inference methods for non-Gaussian data can help in accurate modeling of the data generated by emergent high-throughput technologies. Availability and implementation: Source code is at https://github.com/marinkaz/fusenet. Contact:blaz.zupan@fri.uni-lj.si Supplementary information:Supplementary information is available at Bioinformatics online.
Collapse
Affiliation(s)
- Marinka Žitnik
- Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia and Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Blaž Zupan
- Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia and Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia and Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| |
Collapse
|
5
|
Abstract
With the development of high-throughput genomic technologies, large, genome-wide datasets have been collected, and the integration of these datasets should provide large-scale, multidimensional, and insightful views of biological systems. We developed a method for gene association network construction based on gene expression data that integrate a variety of biological resources. Assuming gene expression data are from a multivariate Gaussian distribution, a graphical lasso (glasso) algorithm is able to estimate the sparse inverse covariance matrix by a lasso (L1) penalty. The inverse covariance matrix can be seen as direct correlation between gene pairs in the gene association network. In our work, instead of using a single penalty, different penalty values were applied for gene pairs based on a priori knowledge as to whether the two genes should be connected. The a priori information can be calculated or retrieved from other biological data, e.g., Gene Ontology similarity, protein-protein interaction, gene regulatory network. By incorporating prior knowledge, the weighted graphical lasso (wglasso) outperforms the original glasso both on simulations and on data from Arabidopsis. Simulation studies show that even when some prior knowledge is not correct, the overall quality of the wglasso network was still greater than when not incorporating that information, e.g., glasso.
Collapse
|
6
|
Champion M, Cierco-Ayrolles C, Gadat S, Vignes M. Sparse regression and support recovery with L2-Boosting algorithms. J Stat Plan Inference 2014. [DOI: 10.1016/j.jspi.2014.07.006] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
7
|
Talluri R, Baladandayuthapani V, Mallick BK. Bayesian sparse graphical models and their mixtures. Stat (Int Stat Inst) 2014; 3:109-125. [PMID: 24948842 DOI: 10.1002/sta4.49] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
We propose Bayesian methods for Gaussian graphical models that lead to sparse and adaptively shrunk estimators of the precision (inverse covariance) matrix. Our methods are based on lasso-type regularization priors leading to parsimonious parameterization of the precision matrix, which is essential in several applications involving learning relationships among the variables. In this context, we introduce a novel type of selection prior that develops a sparse structure on the precision matrix by making most of the elements exactly zero, in addition to ensuring positive definiteness - thus conducting model selection and estimation simultaneously. More importantly, we extend these methods to analyze clustered data using finite mixtures of Gaussian graphical model and infinite mixtures of Gaussian graphical models. We discuss appropriate posterior simulation schemes to implement posterior inference in the proposed models, including the evaluation of normalizing constants that are functions of parameters of interest, which result from the restriction of positive definiteness on the correlation matrix. We evaluate the operating characteristics of our method via several simulations and demonstrate the application to real data examples in genomics.
Collapse
Affiliation(s)
- Rajesh Talluri
- The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | | | | |
Collapse
|
8
|
Gao F, Kou P, Gao L, Guan X. Boosting regression methods based on a geometric conversion approach: Using SVMs base learners. Neurocomputing 2013. [DOI: 10.1016/j.neucom.2013.01.031] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
9
|
Lim N, Senbabaoglu Y, Michailidis G, d'Alché-Buc F. OKVAR-Boost: a novel boosting algorithm to infer nonlinear dynamics and interactions in gene regulatory networks. ACTA ACUST UNITED AC 2013; 29:1416-23. [PMID: 23574736 PMCID: PMC3661057 DOI: 10.1093/bioinformatics/btt167] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Abstract
MOTIVATION Reverse engineering of gene regulatory networks remains a central challenge in computational systems biology, despite recent advances facilitated by benchmark in silico challenges that have aided in calibrating their performance. A number of approaches using either perturbation (knock-out) or wild-type time-series data have appeared in the literature addressing this problem, with the latter using linear temporal models. Nonlinear dynamical models are particularly appropriate for this inference task, given the generation mechanism of the time-series data. In this study, we introduce a novel nonlinear autoregressive model based on operator-valued kernels that simultaneously learns the model parameters, as well as the network structure. RESULTS A flexible boosting algorithm (OKVAR-Boost) that shares features from L2-boosting and randomization-based algorithms is developed to perform the tasks of parameter learning and network inference for the proposed model. Specifically, at each boosting iteration, a regularized Operator-valued Kernel-based Vector AutoRegressive model (OKVAR) is trained on a random subnetwork. The final model consists of an ensemble of such models. The empirical estimation of the ensemble model's Jacobian matrix provides an estimation of the network structure. The performance of the proposed algorithm is first evaluated on a number of benchmark datasets from the DREAM3 challenge and then on real datasets related to the In vivo Reverse-Engineering and Modeling Assessment (IRMA) and T-cell networks. The high-quality results obtained strongly indicate that it outperforms existing approaches. AVAILABILITY The OKVAR-Boost Matlab code is available as the archive: http://amis-group.fr/sourcecode-okvar-boost/OKVARBoost-v1.0.zip. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Néhémy Lim
- IBISC EA 4526, Université d'Évry-Val d'Essonne, Évry, France
| | | | | | | |
Collapse
|
10
|
Abstract
High-throughput gene profiling studies have been extensively conducted, searching for markers associated with cancer development and progression. In this study, we analyse cancer prognosis studies with right censored survival responses. With gene expression data, we adopt the weighted gene co-expression network analysis (WGCNA) to describe the interplay among genes. In network analysis, nodes represent genes. There are subsets of nodes, called modules, which are tightly connected to each other. Genes within the same modules tend to have co-regulated biological functions. For cancer prognosis data with gene expression measurements, our goal is to identify cancer markers, while properly accounting for the network module structure. A two-step sparse boosting approach, called Network Sparse Boosting (NSBoost), is proposed for marker selection. In the first step, for each module separately, we use a sparse boosting approach for within-module marker selection and construct module-level 'super markers'. In the second step, we use the super markers to represent the effects of all genes within the same modules and conduct module-level selection using a sparse boosting approach. Simulation study shows that NSBoost can more accurately identify cancer-associated genes and modules than alternatives. In the analysis of breast cancer and lymphoma prognosis studies, NSBoost identifies genes with important biological implications. It outperforms alternatives including the boosting and penalization approaches by identifying a smaller number of genes/modules and/or having better prediction performance.
Collapse
|
11
|
Nan X, Fu G, Zhao Z, Liu S, Patel RY, Liu H, Daga PR, Doerksen RJ, Dang X, Chen Y, Wilkins D. Leveraging domain information to restructure biological prediction. BMC Bioinformatics 2011; 12 Suppl 10:S22. [PMID: 22166097 PMCID: PMC3236845 DOI: 10.1186/1471-2105-12-s10-s22] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background It is commonly believed that including domain knowledge in a prediction model is desirable. However, representing and incorporating domain information in the learning process is, in general, a challenging problem. In this research, we consider domain information encoded by discrete or categorical attributes. A discrete or categorical attribute provides a natural partition of the problem domain, and hence divides the original problem into several non-overlapping sub-problems. In this sense, the domain information is useful if the partition simplifies the learning task. The goal of this research is to develop an algorithm to identify discrete or categorical attributes that maximally simplify the learning task. Results We consider restructuring a supervised learning problem via a partition of the problem space using a discrete or categorical attribute. A naive approach exhaustively searches all the possible restructured problems. It is computationally prohibitive when the number of discrete or categorical attributes is large. We propose a metric to rank attributes according to their potential to reduce the uncertainty of a classification task. It is quantified as a conditional entropy achieved using a set of optimal classifiers, each of which is built for a sub-problem defined by the attribute under consideration. To avoid high computational cost, we approximate the solution by the expected minimum conditional entropy with respect to random projections. This approach is tested on three artificial data sets, three cheminformatics data sets, and two leukemia gene expression data sets. Empirical results demonstrate that our method is capable of selecting a proper discrete or categorical attribute to simplify the problem, i.e., the performance of the classifier built for the restructured problem always beats that of the original problem. Conclusions The proposed conditional entropy based metric is effective in identifying good partitions of a classification problem, hence enhancing the prediction performance.
Collapse
Affiliation(s)
- Xiaofei Nan
- Department of Computer and Information Science, University of Mississippi, USA
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
12
|
Logsdon BA, Mezey J. Gene expression network reconstruction by convex feature selection when incorporating genetic perturbations. PLoS Comput Biol 2010; 6:e1001014. [PMID: 21152011 PMCID: PMC2996324 DOI: 10.1371/journal.pcbi.1001014] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2010] [Accepted: 10/27/2010] [Indexed: 01/18/2023] Open
Abstract
Cellular gene expression measurements contain regulatory information that can be used to discover novel network relationships. Here, we present a new algorithm for network reconstruction powered by the adaptive lasso, a theoretically and empirically well-behaved method for selecting the regulatory features of a network. Any algorithms designed for network discovery that make use of directed probabilistic graphs require perturbations, produced by either experiments or naturally occurring genetic variation, to successfully infer unique regulatory relationships from gene expression data. Our approach makes use of appropriately selected cis-expression Quantitative Trait Loci (cis-eQTL), which provide a sufficient set of independent perturbations for maximum network resolution. We compare the performance of our network reconstruction algorithm to four other approaches: the PC-algorithm, QTLnet, the QDG algorithm, and the NEO algorithm, all of which have been used to reconstruct directed networks among phenotypes leveraging QTL. We show that the adaptive lasso can outperform these algorithms for networks of ten genes and ten cis-eQTL, and is competitive with the QDG algorithm for networks with thirty genes and thirty cis-eQTL, with rich topologies and hundreds of samples. Using this novel approach, we identify unique sets of directed relationships in Saccharomyces cerevisiae when analyzing genome-wide gene expression data for an intercross between a wild strain and a lab strain. We recover novel putative network relationships between a tyrosine biosynthesis gene (TYR1), and genes involved in endocytosis (RCY1), the spindle checkpoint (BUB2), sulfonate catabolism (JLP1), and cell-cell communication (PRM7). Our algorithm provides a synthesis of feature selection methods and graphical model theory that has the potential to reveal new directed regulatory relationships from the analysis of population level genetic and gene expression data. Determining a unique set of regulatory relationships underlying the observed expression of genes is a challenging problem, not only because of the many possible regulatory relationships, but also because highly distinct regulatory relationships can fit data equally well. In addition, most expression data-sets have relatively small sample sizes compared to the number of genes measured, causing high sampling variability that leads to a significant reduction in power and inflation of the false positive rate for any network reconstruction method. We propose a novel algorithm for network reconstruction that uses a theoretically and empirically well-behaved method for selecting regulatory features, while leveraging genetic perturbations arising from cis-expression Quantitative Trait Loci (cis-eQTL) to maximally resolve a network. Our algorithm has good performance for realistic samples sizes and can be used to identify a unique set of acyclic or cyclic regulatory relationships that explain observed gene expression.
Collapse
Affiliation(s)
- Benjamin A. Logsdon
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, United States of America
| | - Jason Mezey
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York, United States of America
- Department of Genetic Medicine, Weill Cornell Medical College, New York, New York, United States of America
- * E-mail:
| |
Collapse
|