1
|
Pačínková A, Popovici V. Using empirical biological knowledge to infer regulatory networks from multi-omics data. BMC Bioinformatics 2022; 23:351. [PMID: 35996085 PMCID: PMC9396869 DOI: 10.1186/s12859-022-04891-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Accepted: 08/08/2022] [Indexed: 12/13/2022] Open
Abstract
Background Integration of multi-omics data can provide a more complex view of the biological system consisting of different interconnected molecular components, the crucial aspect for developing novel personalised therapeutic strategies for complex diseases. Various tools have been developed to integrate multi-omics data. However, an efficient multi-omics framework for regulatory network inference at the genome level that incorporates prior knowledge is still to emerge. Results We present IntOMICS, an efficient integrative framework based on Bayesian networks. IntOMICS systematically analyses gene expression, DNA methylation, copy number variation and biological prior knowledge to infer regulatory networks. IntOMICS complements the missing biological prior knowledge by so-called empirical biological knowledge, estimated from the available experimental data. Regulatory networks derived from IntOMICS provide deeper insights into the complex flow of genetic information on top of the increasing accuracy trend compared to a published algorithm designed exclusively for gene expression data. The ability to capture relevant crosstalks between multi-omics modalities is verified using known associations in microsatellite stable/instable colon cancer samples. Additionally, IntOMICS performance is compared with two algorithms for multi-omics regulatory network inference that can also incorporate prior knowledge in the inference framework. IntOMICS is also applied to detect potential predictive biomarkers in microsatellite stable stage III colon cancer samples. Conclusions We provide IntOMICS, a framework for multi-omics data integration using a novel approach to biological knowledge discovery. IntOMICS is a powerful resource for exploratory systems biology and can provide valuable insights into the complex mechanisms of biological processes that have a vital role in personalised medicine. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04891-9.
Collapse
Affiliation(s)
- Anna Pačínková
- RECETOX, Faculty of Science, Masaryk University, Kotlarska 2, Brno, Czech Republic. .,Faculty of Informatics, Masaryk University, Botanicka 68a, Brno, Czech Republic.
| | - Vlad Popovici
- RECETOX, Faculty of Science, Masaryk University, Kotlarska 2, Brno, Czech Republic
| |
Collapse
|
2
|
Jiang D, Armour CR, Hu C, Mei M, Tian C, Sharpton TJ, Jiang Y. Microbiome Multi-Omics Network Analysis: Statistical Considerations, Limitations, and Opportunities. Front Genet 2019; 10:995. [PMID: 31781153 PMCID: PMC6857202 DOI: 10.3389/fgene.2019.00995] [Citation(s) in RCA: 88] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2019] [Accepted: 09/18/2019] [Indexed: 12/21/2022] Open
Abstract
The advent of large-scale microbiome studies affords newfound analytical opportunities to understand how these communities of microbes operate and relate to their environment. However, the analytical methodology needed to model microbiome data and integrate them with other data constructs remains nascent. This emergent analytical toolset frequently ports over techniques developed in other multi-omics investigations, especially the growing array of statistical and computational techniques for integrating and representing data through networks. While network analysis has emerged as a powerful approach to modeling microbiome data, oftentimes by integrating these data with other types of omics data to discern their functional linkages, it is not always evident if the statistical details of the approach being applied are consistent with the assumptions of microbiome data or how they impact data interpretation. In this review, we overview some of the most important network methods for integrative analysis, with an emphasis on methods that have been applied or have great potential to be applied to the analysis of multi-omics integration of microbiome data. We compare advantages and disadvantages of various statistical tools, assess their applicability to microbiome data, and discuss their biological interpretability. We also highlight on-going statistical challenges and opportunities for integrative network analysis of microbiome data.
Collapse
Affiliation(s)
- Duo Jiang
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| | - Courtney R Armour
- Department of Microbiology, Oregon State University, Corvallis, OR, United States
| | - Chenxiao Hu
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| | - Meng Mei
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| | - Chuan Tian
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| | - Thomas J Sharpton
- Department of Statistics, Oregon State University, Corvallis, OR, United States
- Department of Microbiology, Oregon State University, Corvallis, OR, United States
| | - Yuan Jiang
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| |
Collapse
|
3
|
Abstract
BACKGROUND Computational network biology is an emerging interdisciplinary research area. Among many other network approaches, probabilistic graphical models provide a comprehensive probabilistic characterization of interaction patterns between molecules and the associated uncertainties. RESULTS In this article, we first review graphical models, including directed, undirected, and reciprocal graphs (RG), with an emphasis on the RG models that are curiously under-utilized in biostatistics and bioinformatics literature. RG's strictly contain chain graphs as a special case and are suitable to model reciprocal causality such as feedback mechanism in molecular networks. We then extend the RG approach to modeling molecular networks by integrating DNA-, RNA- and protein-level data. We apply the extended RG method to The Cancer Genome Atlas multi-platform ovarian cancer data and reveal several interesting findings. CONCLUSIONS This study aims to review the basics of different probabilistic graphical models as well as recent development in RG approaches for network modeling. The extension presented in this paper provides a principled and efficient way of integrating DNA copy number, DNA methylation, mRNA gene expression and protein expression.
Collapse
Affiliation(s)
- Yang Ni
- Department of Statistics and Data Sciences, The University of Texas at Austin, Austin, 78712 TX USA
| | - Peter Müller
- Department of Mathematics, The University of Texas at Austin, Austin, 78712 TX USA
| | - Lin Wei
- NorthShore University HealthSystem, Evanston, 60201 IL USA
| | - Yuan Ji
- NorthShore University HealthSystem, Evanston, 60201 IL USA
- Department of Public Health Sciences, The University of Chicago, Chicago, 60637 IL USA
| |
Collapse
|
4
|
Morris JS, Baladandayuthapani V. Statistical Contributions to Bioinformatics: Design, Modeling, Structure Learning, and Integration. STAT MODEL 2017; 17:245-289. [PMID: 29129969 PMCID: PMC5679480 DOI: 10.1177/1471082x17698255] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
The advent of high-throughput multi-platform genomics technologies providing whole-genome molecular summaries of biological samples has revolutionalized biomedical research. These technologiees yield highly structured big data, whose analysis poses significant quantitative challenges. The field of Bioinformatics has emerged to deal with these challenges, and is comprised of many quantitative and biological scientists working together to effectively process these data and extract the treasure trove of information they contain. Statisticians, with their deep understanding of variability and uncertainty quantification, play a key role in these efforts. In this article, we attempt to summarize some of the key contributions of statisticians to bioinformatics, focusing on four areas: (1) experimental design and reproducibility, (2) preprocessing and feature extraction, (3) unified modeling, and (4) structure learning and integration. In each of these areas, we highlight some key contributions and try to elucidate the key statistical principles underlying these methods and approaches. Our goals are to demonstrate major ways in which statisticians have contributed to bioinformatics, encourage statisticians to get involved early in methods development as new technologies emerge, and to stimulate future methodological work based on the statistical principles elucidated in this article and utilizing all availble information to uncover new biological insights.
Collapse
Affiliation(s)
- Jeffrey S Morris
- Department of Biostatistics, The University of Texas M.D. Anderson Cancer Center, Houston, Texas, USA
| | | |
Collapse
|
5
|
Madsen T, Hobolth A, Jensen JL, Pedersen JS. Significance evaluation in factor graphs. BMC Bioinformatics 2017; 18:199. [PMID: 28359297 PMCID: PMC5374669 DOI: 10.1186/s12859-017-1614-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2016] [Accepted: 03/24/2017] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Factor graphs provide a flexible and general framework for specifying probability distributions. They can capture a range of popular and recent models for analysis of both genomics data as well as data from other scientific fields. Owing to the ever larger data sets encountered in genomics and the multiple-testing issues accompanying them, accurate significance evaluation is of great importance. We here address the problem of evaluating statistical significance of observations from factor graph models. RESULTS Two novel numerical approximations for evaluation of statistical significance are presented. First a method using importance sampling. Second a saddlepoint approximation based method. We develop algorithms to efficiently compute the approximations and compare them to naive sampling and the normal approximation. The individual merits of the methods are analysed both from a theoretical viewpoint and with simulations. A guideline for choosing between the normal approximation, saddle-point approximation and importance sampling is also provided. Finally, the applicability of the methods is demonstrated with examples from cancer genomics, motif-analysis and phylogenetics. CONCLUSIONS The applicability of saddlepoint approximation and importance sampling is demonstrated on known models in the factor graph framework. Using the two methods we can substantially improve computational cost without compromising accuracy. This contribution allows analyses of large datasets in the general factor graph framework.
Collapse
Affiliation(s)
- Tobias Madsen
- Department of Molecular Medicine, Aarhus University, Palle Juul-Jensens Boulevard 99, Aarhus, Denmark. .,Bioinformatics Research Center, Aarhus University, C.F. Møllers Allé 8, Aarhus, Denmark.
| | - Asger Hobolth
- Bioinformatics Research Center, Aarhus University, C.F. Møllers Allé 8, Aarhus, Denmark
| | - Jens Ledet Jensen
- Department of Mathematics, Aarhus University, Ny Munkegade 118, Aarhus, Denmark
| | - Jakob Skou Pedersen
- Department of Molecular Medicine, Aarhus University, Palle Juul-Jensens Boulevard 99, Aarhus, Denmark.,Bioinformatics Research Center, Aarhus University, C.F. Møllers Allé 8, Aarhus, Denmark
| |
Collapse
|
6
|
An X, Hu J, Do KA. SIFORM: shared informative factor models for integration of multi-platform bioinformatic data. Bioinformatics 2016; 32:3279-3290. [PMID: 27381342 DOI: 10.1093/bioinformatics/btw295] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2015] [Accepted: 04/28/2016] [Indexed: 12/11/2022] Open
Abstract
MOTIVATION High-dimensional omic data derived from different technological platforms have been extensively used to facilitate comprehensive understanding of disease mechanisms and to determine personalized health treatments. Numerous studies have integrated multi-platform omic data; however, few have efficiently and simultaneously addressed the problems that arise from high dimensionality and complex correlations. RESULTS We propose a statistical framework of shared informative factor models that can jointly analyze multi-platform omic data and explore their associations with a disease phenotype. The common disease-associated sample characteristics across different data types can be captured through the shared structure space, while the corresponding weights of genetic variables directly index the strengths of their association with the phenotype. Extensive simulation studies demonstrate the performance of the proposed method in terms of biomarker detection accuracy via comparisons with three popular regularized regression methods. We also apply the proposed method to The Cancer Genome Atlas lung adenocarcinoma dataset to jointly explore associations of mRNA expression and protein expression with smoking status. Many of the identified biomarkers belong to key pathways for lung tumorigenesis, some of which are known to show differential expression across smoking levels. We discover potential biomarkers that reveal different mechanisms of lung tumorigenesis between light smokers and heavy smokers. AVAILABILITY AND IMPLEMENTATION R code to implement the new method can be downloaded from http://odin.mdacc.tmc.edu/jhhu/ CONTACT: jhu@mdanderson.org.
Collapse
Affiliation(s)
- Xuebei An
- Department of Biostatistics, the University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Jianhua Hu
- Department of Biostatistics, the University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Kim-Anh Do
- Department of Biostatistics, the University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| |
Collapse
|
7
|
Wang T, Ren Z, Ding Y, Fang Z, Sun Z, MacDonald ML, Sweet RA, Wang J, Chen W. FastGGM: An Efficient Algorithm for the Inference of Gaussian Graphical Model in Biological Networks. PLoS Comput Biol 2016; 12:e1004755. [PMID: 26872036 PMCID: PMC4752261 DOI: 10.1371/journal.pcbi.1004755] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2015] [Accepted: 01/14/2016] [Indexed: 11/19/2022] Open
Abstract
Biological networks provide additional information for the analysis of human diseases, beyond the traditional analysis that focuses on single variables. Gaussian graphical model (GGM), a probability model that characterizes the conditional dependence structure of a set of random variables by a graph, has wide applications in the analysis of biological networks, such as inferring interaction or comparing differential networks. However, existing approaches are either not statistically rigorous or are inefficient for high-dimensional data that include tens of thousands of variables for making inference. In this study, we propose an efficient algorithm to implement the estimation of GGM and obtain p-value and confidence interval for each edge in the graph, based on a recent proposal by Ren et al., 2015. Through simulation studies, we demonstrate that the algorithm is faster by several orders of magnitude than the current implemented algorithm for Ren et al. without losing any accuracy. Then, we apply our algorithm to two real data sets: transcriptomic data from a study of childhood asthma and proteomic data from a study of Alzheimer’s disease. We estimate the global gene or protein interaction networks for the disease and healthy samples. The resulting networks reveal interesting interactions and the differential networks between cases and controls show functional relevance to the diseases. In conclusion, we provide a computationally fast algorithm to implement a statistically sound procedure for constructing Gaussian graphical model and making inference with high-dimensional biological data. The algorithm has been implemented in an R package named “FastGGM”. Gaussian graphical model (GGM), a probability model for characterizing conditional dependence among a set of random variables, has been widely used in studying biological networks. It is important and practical to make inference with rigorous statistical properties and high efficiency under a high-dimensional setting, which is common in biological systems that usually contain tens of thousands of molecular elements, such as genes and proteins. This work proposes a novel efficient algorithm, FastGGM, to implement asymptotically normal estimation of large GGM established by Ren et al [1]. It quickly estimates the precision matrix, partial correlations, as well as p-values and confidence intervals for the graph. Simulation studies demonstrate our algorithm outperforms the current algorithm for Ren et al. and algorithms for some other estimation methods, and real data analyses further prove its efficiency in studying biological networks. In conclusion, FastGGM is a statistically sound and computationally fast algorithm for constructing GGM with high-dimensional data. An R package for implementation can be downloaded from http://www.pitt.edu/~wec47/FastGGM.html.
Collapse
Affiliation(s)
- Ting Wang
- Division of Pulmonary Medicine, Allergy and Immunology; Department of Pediatrics, Children’s Hospital of Pittsburgh of UPMC, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Zhao Ren
- Department of Statistics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Ying Ding
- Department of Biostatistics, University of Pittsburgh Graduate School of Public Health, Pittsburgh, Pennsylvania, United States of America
| | - Zhou Fang
- Department of Biostatistics, University of Pittsburgh Graduate School of Public Health, Pittsburgh, Pennsylvania, United States of America
| | - Zhe Sun
- Department of Biostatistics, University of Pittsburgh Graduate School of Public Health, Pittsburgh, Pennsylvania, United States of America
| | - Matthew L. MacDonald
- Department of Psychiatry, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Robert A. Sweet
- Department of Psychiatry, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
- Department of Neurology, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
- VISN 4 Mental Illness Research, Education and Clinical Center (MIRECC), VA Pittsburgh Healthcare System, Pittsburgh, Pennsylvania, United States of America
| | - Jieru Wang
- Division of Pulmonary Medicine, Allergy and Immunology; Department of Pediatrics, Children’s Hospital of Pittsburgh of UPMC, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Wei Chen
- Division of Pulmonary Medicine, Allergy and Immunology; Department of Pediatrics, Children’s Hospital of Pittsburgh of UPMC, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
- Department of Biostatistics, University of Pittsburgh Graduate School of Public Health, Pittsburgh, Pennsylvania, United States of America
- Department of Human Genetics, University of Pittsburgh Graduate School of Public Health, Pittsburgh, Pennsylvania, United States of America
- * E-mail:
| |
Collapse
|