1
|
Zhang Z, Nishimura A, Trovão NS, Cherry JL, Holbrook AJ, Ji X, Lemey P, Suchard MA. Accelerating Bayesian inference of dependency between mixed-type biological traits. PLoS Comput Biol 2023; 19:e1011419. [PMID: 37639445 PMCID: PMC10491301 DOI: 10.1371/journal.pcbi.1011419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 09/08/2023] [Accepted: 08/09/2023] [Indexed: 08/31/2023] Open
Abstract
Inferring dependencies between mixed-type biological traits while accounting for evolutionary relationships between specimens is of great scientific interest yet remains infeasible when trait and specimen counts grow large. The state-of-the-art approach uses a phylogenetic multivariate probit model to accommodate binary and continuous traits via a latent variable framework, and utilizes an efficient bouncy particle sampler (BPS) to tackle the computational bottleneck-integrating many latent variables from a high-dimensional truncated normal distribution. This approach breaks down as the number of specimens grows and fails to reliably characterize conditional dependencies between traits. Here, we propose an inference pipeline for phylogenetic probit models that greatly outperforms BPS. The novelty lies in 1) a combination of the recent Zigzag Hamiltonian Monte Carlo (Zigzag-HMC) with linear-time gradient evaluations and 2) a joint sampling scheme for highly correlated latent variables and correlation matrix elements. In an application exploring HIV-1 evolution from 535 viruses, the inference requires joint sampling from an 11,235-dimensional truncated normal and a 24-dimensional covariance matrix. Our method yields a 5-fold speedup compared to BPS and makes it possible to learn partial correlations between candidate viral mutations and virulence. Computational speedup now enables us to tackle even larger problems: we study the evolution of influenza H1N1 glycosylations on around 900 viruses. For broader applicability, we extend the phylogenetic probit model to incorporate categorical traits, and demonstrate its use to study Aquilegia flower and pollinator co-evolution.
Collapse
Affiliation(s)
- Zhenyu Zhang
- Department of Biostatistics, Fielding School of Public Health, University of California Los Angeles, Los Angeles, California, United States of America
| | - Akihiko Nishimura
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Nídia S. Trovão
- Division of International Epidemiology and Population Studies, Fogarty International Center, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Joshua L. Cherry
- Division of International Epidemiology and Population Studies, Fogarty International Center, National Institutes of Health, Bethesda, Maryland, United States of America
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Andrew J. Holbrook
- Department of Biostatistics, Fielding School of Public Health, University of California Los Angeles, Los Angeles, California, United States of America
| | - Xiang Ji
- Department of Mathematics, Tulane University, New Orleans, Louisiana, United States of America
| | - Philippe Lemey
- Department of Microbiology, Immunology and Transplantation, Rega Institute, KU Leuven, Leuven, Belgium
| | - Marc A. Suchard
- Department of Biostatistics, Fielding School of Public Health, University of California Los Angeles, Los Angeles, California, United States of America
- Department of Biomathematics, University of California Los Angeles, Los Angeles, California, United States of America
- Department of Human Genetics, University of California Los Angeles, Los Angeles, California, United States of America
| |
Collapse
|
2
|
Gogoshin G, Branciamore S, Rodin AS. Synthetic data generation with probabilistic Bayesian Networks. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2021; 18:8603-8621. [PMID: 34814315 PMCID: PMC8848551 DOI: 10.3934/mbe.2021426] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
Bayesian Network (BN) modeling is a prominent and increasingly popular computational systems biology method. It aims to construct network graphs from the large heterogeneous biological datasets that reflect the underlying biological relationships. Currently, a variety of strategies exist for evaluating BN methodology performance, ranging from utilizing artificial benchmark datasets and models, to specialized biological benchmark datasets, to simulation studies that generate synthetic data from predefined network models. The last is arguably the most comprehensive approach; however, existing implementations often rely on explicit and implicit assumptions that may be unrealistic in a typical biological data analysis scenario, or are poorly equipped for automated arbitrary model generation. In this study, we develop a purely probabilistic simulation framework that addresses the demands of statistically sound simulations studies in an unbiased fashion. Additionally, we expand on our current understanding of the theoretical notions of causality and dependence / conditional independence in BNs and the Markov Blankets within.
Collapse
Affiliation(s)
- Grigoriy Gogoshin
- Department of Computational and Quantitative Medicine, Beckman Research Institute, and Diabetes and Metabolism Research Institute, City of Hope National Medical Center, 1500 East Duarte Road, Duarte, CA 91010 USA
| | - Sergio Branciamore
- Department of Computational and Quantitative Medicine, Beckman Research Institute, and Diabetes and Metabolism Research Institute, City of Hope National Medical Center, 1500 East Duarte Road, Duarte, CA 91010 USA
| | - Andrei S. Rodin
- Department of Computational and Quantitative Medicine, Beckman Research Institute, and Diabetes and Metabolism Research Institute, City of Hope National Medical Center, 1500 East Duarte Road, Duarte, CA 91010 USA
| |
Collapse
|
3
|
Trilla-Fuertes L, Gámez-Pozo A, Arevalillo JM, López-Vacas R, López-Camacho E, Prado-Vázquez G, Zapater-Moros A, Díaz-Almirón M, Ferrer-Gómez M, Navarro H, Nanni P, Zamora P, Espinosa E, Maín P, Fresno Vara JÁ. Bayesian networks established functional differences between breast cancer subtypes. PLoS One 2020; 15:e0234752. [PMID: 32525929 PMCID: PMC7289386 DOI: 10.1371/journal.pone.0234752] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2020] [Accepted: 06/01/2020] [Indexed: 12/15/2022] Open
Abstract
Breast cancer is a heterogeneous disease. In clinical practice, tumors are classified as hormonal receptor positive, Her2 positive and triple negative tumors. In previous works, our group defined a new hormonal receptor positive subgroup, the TN-like subtype, which had a prognosis and a molecular profile more similar to triple negative tumors. In this study, proteomics and Bayesian networks were used to characterize protein relationships in 96 breast tumor samples. Components obtained by these methods had a clear functional structure. The analysis of these components suggested differences in processes such as mitochondrial function or extracellular matrix between breast cancer subtypes, including our new defined subtype TN-like. In addition, one of the components, mainly related with extracellular matrix processes, had prognostic value in this cohort. Functional approaches allow to build hypotheses about regulatory mechanisms and to establish new relationships among proteins in the breast cancer context.
Collapse
Affiliation(s)
| | - Angelo Gámez-Pozo
- Biomedica Molecular Medicine SL, Madrid, Spain
- Molecular Oncology & Pathology Lab, Institute of Medical and Molecular Genetics-INGEMM, La Paz University Hospital-IdiPAZ, Madrid, Spain
| | - Jorge M. Arevalillo
- Operational Research and Numerical Analysis, National Distance Education University (UNED), Madrid, Spain
| | - Rocío López-Vacas
- Molecular Oncology & Pathology Lab, Institute of Medical and Molecular Genetics-INGEMM, La Paz University Hospital-IdiPAZ, Madrid, Spain
| | | | - Guillermo Prado-Vázquez
- Biomedica Molecular Medicine SL, Madrid, Spain
- Molecular Oncology & Pathology Lab, Institute of Medical and Molecular Genetics-INGEMM, La Paz University Hospital-IdiPAZ, Madrid, Spain
| | - Andrea Zapater-Moros
- Biomedica Molecular Medicine SL, Madrid, Spain
- Molecular Oncology & Pathology Lab, Institute of Medical and Molecular Genetics-INGEMM, La Paz University Hospital-IdiPAZ, Madrid, Spain
| | | | - María Ferrer-Gómez
- Molecular Oncology & Pathology Lab, Institute of Medical and Molecular Genetics-INGEMM, La Paz University Hospital-IdiPAZ, Madrid, Spain
| | - Hilario Navarro
- Operational Research and Numerical Analysis, National Distance Education University (UNED), Madrid, Spain
| | - Paolo Nanni
- Functional Genomics Centre Zurich, University of Zurich/ETH Zurich, Zurich, Switzerland
| | - Pilar Zamora
- Medical Oncology Service, La Paz University Hospital-IdiPAZ, Madrid, Spain
| | - Enrique Espinosa
- Medical Oncology Service, La Paz University Hospital-IdiPAZ, Madrid, Spain
- Biomedical Research Networking Center on Oncology-CIBERONC, ISCIII, Madrid, Spain
| | - Paloma Maín
- Department of Statistics and Operations Research, Faculty of Mathematics, Complutense University of Madrid, Madrid, Spain
| | - Juan Ángel Fresno Vara
- Biomedica Molecular Medicine SL, Madrid, Spain
- Molecular Oncology & Pathology Lab, Institute of Medical and Molecular Genetics-INGEMM, La Paz University Hospital-IdiPAZ, Madrid, Spain
- Biomedical Research Networking Center on Oncology-CIBERONC, ISCIII, Madrid, Spain
- * E-mail:
| |
Collapse
|
4
|
Dos Santos JPR, Fernandes SB, McCoy S, Lozano R, Brown PJ, Leakey ADB, Buckler ES, Garcia AAF, Gore MA. Novel Bayesian Networks for Genomic Prediction of Developmental Traits in Biomass Sorghum. G3 (BETHESDA, MD.) 2020; 10:769-781. [PMID: 31852730 PMCID: PMC7003104 DOI: 10.1534/g3.119.400759] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/22/2019] [Accepted: 12/15/2019] [Indexed: 11/23/2022]
Abstract
The ability to connect genetic information between traits over time allow Bayesian networks to offer a powerful probabilistic framework to construct genomic prediction models. In this study, we phenotyped a diversity panel of 869 biomass sorghum (Sorghum bicolor (L.) Moench) lines, which had been genotyped with 100,435 SNP markers, for plant height (PH) with biweekly measurements from 30 to 120 days after planting (DAP) and for end-of-season dry biomass yield (DBY) in four environments. We evaluated five genomic prediction models: Bayesian network (BN), Pleiotropic Bayesian network (PBN), Dynamic Bayesian network (DBN), multi-trait GBLUP (MTr-GBLUP), and multi-time GBLUP (MTi-GBLUP) models. In fivefold cross-validation, prediction accuracies ranged from 0.46 (PBN) to 0.49 (MTr-GBLUP) for DBY and from 0.47 (DBN, DAP120) to 0.75 (MTi-GBLUP, DAP60) for PH. Forward-chaining cross-validation further improved prediction accuracies of the DBN, MTi-GBLUP and MTr-GBLUP models for PH (training slice: 30-45 DAP) by 36.4-52.4% relative to the BN and PBN models. Coincidence indices (target: biomass, secondary: PH) and a coincidence index based on lines (PH time series) showed that the ranking of lines by PH changed minimally after 45 DAP. These results suggest a two-level indirect selection method for PH at harvest (first-level target trait) and DBY (second-level target trait) could be conducted earlier in the season based on ranking of lines by PH at 45 DAP (secondary trait). With the advance of high-throughput phenotyping technologies, our proposed two-level indirect selection framework could be valuable for enhancing genetic gain per unit of time when selecting on developmental traits.
Collapse
Affiliation(s)
- Jhonathan P R Dos Santos
- Plant Breeding and Genetics Section, School of Integrative Plant Science
- Department of Genetics, Luiz de Queiroz College of Agriculture, University of São Paulo, Piracicaba, SP, Brazil
| | | | | | - Roberto Lozano
- Plant Breeding and Genetics Section, School of Integrative Plant Science
| | - Patrick J Brown
- Section of Agricultural Plant Biology, Department of Plant Sciences, University of California Davis, 95616, and
| | - Andrew D B Leakey
- Department of Crop Science
- Institute for Genomic Biology
- Department of Plant Biology, University of Illinois at Urbana Champaign, 61801
| | - Edward S Buckler
- Plant Breeding and Genetics Section, School of Integrative Plant Science
- United States Department of Agriculture, Agricultural Research Service, R. W. Holley Center, Ithaca, New York 14853
- Institute for Genomic Diversity, Cornell University, Ithaca, New York 14853
| | - Antonio A F Garcia
- Department of Genetics, Luiz de Queiroz College of Agriculture, University of São Paulo, Piracicaba, SP, Brazil,
| | - Michael A Gore
- Plant Breeding and Genetics Section, School of Integrative Plant Science,
| |
Collapse
|
5
|
Modeling miRNA-mRNA interactions that cause phenotypic abnormality in breast cancer patients. PLoS One 2017; 12:e0182666. [PMID: 28793339 PMCID: PMC5549916 DOI: 10.1371/journal.pone.0182666] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2017] [Accepted: 07/13/2017] [Indexed: 01/04/2023] Open
Abstract
Background The dysregulation of microRNAs (miRNAs) alters expression level of pro-oncogenic or tumor suppressive mRNAs in breast cancer, and in the long run, causes multiple biological abnormalities. Identification of such interactions of miRNA-mRNA requires integrative analysis of miRNA-mRNA expression profile data. However, current approaches have limitations to consider the regulatory relationship between miRNAs and mRNAs and to implicate the relationship with phenotypic abnormality and cancer pathogenesis. Methodology/Findings We modeled causal relationships between genomic expression and clinical data using a Bayesian Network (BN), with the goal of discovering miRNA-mRNA interactions that are associated with cancer pathogenesis. The Multiple Beam Search (MBS) algorithm learned interactions from data and discovered that hsa-miR-21, hsa-miR-10b, hsa-miR-448, and hsa-miR-96 interact with oncogenes, such as, CCND2, ESR1, MET, NOTCH1, TGFBR2 and TGFB1 that promote tumor metastasis, invasion, and cell proliferation. We also calculated Bayesian network posterior probability (BNPP) for the models discovered by the MBS algorithm to validate true models with high likelihood. Conclusion/Significance The MBS algorithm successfully learned miRNA and mRNA expression profile data using a BN, and identified miRNA-mRNA interactions that probabilistically affect breast cancer pathogenesis. The MBS algorithm is a potentially useful tool for identifying interacting gene pairs implicated by the deregulation of expression.
Collapse
|
6
|
Gogoshin G, Boerwinkle E, Rodin AS. New Algorithm and Software (BNOmics) for Inferring and Visualizing Bayesian Networks from Heterogeneous Big Biological and Genetic Data. J Comput Biol 2016; 24:340-356. [PMID: 27681505 PMCID: PMC5372779 DOI: 10.1089/cmb.2016.0100] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Bayesian network (BN) reconstruction is a prototypical systems biology data analysis approach that has been successfully used to reverse engineer and model networks reflecting different layers of biological organization (ranging from genetic to epigenetic to cellular pathway to metabolomic). It is especially relevant in the context of modern (ongoing and prospective) studies that generate heterogeneous high-throughput omics datasets. However, there are both theoretical and practical obstacles to the seamless application of BN modeling to such big data, including computational inefficiency of optimal BN structure search algorithms, ambiguity in data discretization, mixing data types, imputation and validation, and, in general, limited scalability in both reconstruction and visualization of BNs. To overcome these and other obstacles, we present BNOmics, an improved algorithm and software toolkit for inferring and analyzing BNs from omics datasets. BNOmics aims at comprehensive systems biology—type data exploration, including both generating new biological hypothesis and testing and validating the existing ones. Novel aspects of the algorithm center around increasing scalability and applicability to varying data types (with different explicit and implicit distributional assumptions) within the same analysis framework. An output and visualization interface to widely available graph-rendering software is also included. Three diverse applications are detailed. BNOmics was originally developed in the context of genetic epidemiology data and is being continuously optimized to keep pace with the ever-increasing inflow of available large-scale omics datasets. As such, the software scalability and usability on the less than exotic computer hardware are a priority, as well as the applicability of the algorithm and software to the heterogeneous datasets containing many data types—single-nucleotide polymorphisms and other genetic/epigenetic/transcriptome variables, metabolite levels, epidemiological variables, endpoints, and phenotypes, etc.
Collapse
Affiliation(s)
- Grigoriy Gogoshin
- 1 Diabetes and Metabolism Research Institute , City of Hope, Duarte, California
| | - Eric Boerwinkle
- 2 Human Genetics Center, School of Public Health, University of Texas Health Science Center , Houston, Texas.,3 Institute of Molecular Medicine, University of Texas Health Science Center , Houston, Texas
| | - Andrei S Rodin
- 1 Diabetes and Metabolism Research Institute , City of Hope, Duarte, California
| |
Collapse
|
7
|
Cai B, Jiang X. Revealing Biological Pathways Implicated in Lung Cancer from TCGA Gene Expression Data Using Gene Set Enrichment Analysis. Cancer Inform 2014; 13:113-21. [PMID: 25520551 PMCID: PMC4251186 DOI: 10.4137/cin.s13882] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2014] [Revised: 09/05/2014] [Accepted: 09/09/2014] [Indexed: 12/11/2022] Open
Abstract
Analyzing biological system abnormalities in cancer patients based on measures of biological entities, such as gene expression levels, is an important and challenging problem. This paper applies existing methods, Gene Set Enrichment Analysis and Signaling Pathway Impact Analysis, to pathway abnormality analysis in lung cancer using microarray gene expression data. Gene expression data from studies of Lung Squamous Cell Carcinoma (LUSC) in The Cancer Genome Atlas project, and pathway gene set data from the Kyoto Encyclopedia of Genes and Genomes were used to analyze the relationship between pathways and phenotypes. Results, in the form of pathway rankings, indicate that some pathways may behave abnormally in LUSC. For example, both the cell cycle and viral carcinogenesis pathways ranked very high in LUSC. Furthermore, some pathways that are known to be associated with cancer, such as the p53 and the PI3K-Akt signal transduction pathways, were found to rank high in LUSC. Other pathways, such as bladder cancer and thyroid cancer pathways, were also ranked high in LUSC.
Collapse
Affiliation(s)
- Binghuang Cai
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| | - Xia Jiang
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| |
Collapse
|
8
|
Neapolitan R, Jiang X. Inferring Aberrant Signal Transduction Pathways in Ovarian Cancer from TCGA Data. Cancer Inform 2014; 13:29-36. [PMID: 25392681 PMCID: PMC4216062 DOI: 10.4137/cin.s13881] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2014] [Revised: 03/10/2014] [Accepted: 03/10/2014] [Indexed: 12/12/2022] Open
Abstract
This paper concerns a new method for identifying aberrant signal transduction pathways (STPs) in cancer using case/control gene expression-level datasets, and applying that method and an existing method to an ovarian carcinoma dataset. Both methods identify STPs that are plausibly linked to all cancers based on current knowledge. Thus, the paper is most appropriate for the cancer informatics community. Our hypothesis is that STPs that are altered in tumorous tissue can be identified by applying a new Bayesian network (BN)-based method (causal analysis of STP aberration (CASA)) and an existing method (signaling pathway impact analysis (SPIA)) to the cancer genome atlas (TCGA) gene expression-level datasets. To test this hypothesis, we analyzed 20 cancer-related STPs and 6 randomly chosen STPs using the 591 cases in the TCGA ovarian carcinoma dataset, and the 102 controls in all 5 TCGA cancer datasets. We identified all the genes related to each of the 26 pathways, and developed separate gene expression datasets for each pathway. The results of the two methods were highly correlated. Furthermore, many of the STPs that ranked highest according to both methods are plausibly linked to all cancers based on current knowledge. Finally, CASA ranked the cancer-related STPs over the randomly selected STPs at a significance level below 0.05 (P = 0.047), but SPIA did not (P = 0.083).
Collapse
Affiliation(s)
- Richard Neapolitan
- Department of Preventive Medicine, Northwestern University, Feinberg School of Medicine, Chicago, IL, USA
| | - Xia Jiang
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
| |
Collapse
|