1
|
Balasubramanian JB, Boes RD, Gopalakrishnan V. A novel approach to modeling multifactorial diseases using Ensemble Bayesian Rule classifiers. J Biomed Inform 2020; 107:103455. [PMID: 32497685 DOI: 10.1016/j.jbi.2020.103455] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2019] [Revised: 03/26/2020] [Accepted: 05/10/2020] [Indexed: 10/24/2022]
Abstract
Modeling factors influencing disease phenotypes, from biomarker profiling study datasets, is a critical task in biomedicine. Such datasets are typically generated from high-throughput 'omic' technologies, which help examine disease mechanisms at an unprecedented resolution. These datasets are challenging because they are high-dimensional. The disease mechanisms they study are also complex because many diseases are multifactorial, resulting from the collective activity of several factors, each with a small effect. Bayesian rule learning (BRL) is a rule model inferred from learning Bayesian networks from data, and has been shown to be effective in modeling high-dimensional datasets. However, BRL is not efficient at modeling multifactorial diseases since it suffers from data fragmentation during learning. In this paper, we overcome this limitation by implementing and evaluating three types of ensemble model combination strategies with BRL- uniform combination (UC; same as Bagging), Bayesian model averaging (BMA), and Bayesian model combination (BMC)- collectively called Ensemble Bayesian Rule Learning (EBRL). We also introduce a novel method to visualize EBRL models, called the Bayesian Rule Ensemble Visualizing tool (BREVity), which helps extract interpret the most important rule patterns guiding the predictions made by the ensemble model. Our results using twenty-five public, high-dimensional, gene expression datasets of multifactorial diseases, suggest that, both EBRL models using UC and BMC achieve better predictive performance than BMA and other classic machine learning methods. Furthermore, BMC is found to be more reliable than UC, when the ensemble includes sub-optimal models resulting from the stochasticity of the model search process. Together, EBRL and BREVity provides researchers a promising and novel tool for modeling multifactorial diseases from high-dimensional datasets that leverages strengths of ensemble methods for predictive performance, while also providing interpretable explanations for its predictions.
Collapse
Affiliation(s)
- Jeya Balaji Balasubramanian
- School of Computing and Information, Intelligent Systems Program, University of Pittsburgh, 135 N Bellefield Ave, Pittsburgh, PA 15213, United States
| | - Rebecca D Boes
- Department of Biomedical Informatics, University of Pittsburgh, 5607 Baum Boulevard, Suite 500, Pittsburgh, PA15206, United States
| | - Vanathi Gopalakrishnan
- School of Computing and Information, Intelligent Systems Program, University of Pittsburgh, 135 N Bellefield Ave, Pittsburgh, PA 15213, United States; Department of Biomedical Informatics, University of Pittsburgh, 5607 Baum Boulevard, Suite 500, Pittsburgh, PA15206, United States
| |
Collapse
|
2
|
Lustgarten JL, Zehnder A, Shipman W, Gancher E, Webb TL. Veterinary informatics: forging the future between veterinary medicine, human medicine, and One Health initiatives-a joint paper by the Association for Veterinary Informatics (AVI) and the CTSA One Health Alliance (COHA). JAMIA Open 2020; 3:306-317. [PMID: 32734172 PMCID: PMC7382640 DOI: 10.1093/jamiaopen/ooaa005] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2019] [Revised: 12/26/2019] [Accepted: 02/26/2020] [Indexed: 12/25/2022] Open
Abstract
Objectives This manuscript reviews the current state of veterinary medical electronic health records and the ability to aggregate and analyze large datasets from multiple organizations and clinics. We also review analytical techniques as well as research efforts into veterinary informatics with a focus on applications relevant to human and animal medicine. Our goal is to provide references and context for these resources so that researchers can identify resources of interest and translational opportunities to advance the field. Methods and Results This review covers various methods of veterinary informatics including natural language processing and machine learning techniques in brief and various ongoing and future projects. After detailing techniques and sources of data, we describe some of the challenges and opportunities within veterinary informatics as well as providing reviews of common One Health techniques and specific applications that affect both humans and animals. Discussion Current limitations in the field of veterinary informatics include limited sources of training data for developing machine learning and artificial intelligence algorithms, siloed data between academic institutions, corporate institutions, and many small private practices, and inconsistent data formats that make many integration problems difficult. Despite those limitations, there have been significant advancements in the field in the last few years and continued development of a few, key, large data resources that are available for interested clinicians and researchers. These real-world use cases and applications show current and significant future potential as veterinary informatics grows in importance. Veterinary informatics can forge new possibilities within veterinary medicine and between veterinary medicine, human medicine, and One Health initiatives.
Collapse
Affiliation(s)
- Jonathan L Lustgarten
- Association for Veterinary Informatics, Dixon, California, USA.,VCA Inc., Health Technology & Informatics, Los Angeles, California, USA
| | | | - Wayde Shipman
- Veterinary Medical Databases, Columbia, Missouri, USA
| | - Elizabeth Gancher
- Department of Infectious diseases and HIV medicine, Drexel University College of Medicine, Philadelphia, Pennsylvania, USA
| | - Tracy L Webb
- Department of Clinical Sciences, Colorado State University, Fort Collins, Colorado, USA
| |
Collapse
|
3
|
Cai C, Cooper GF, Lu KN, Ma X, Xu S, Zhao Z, Chen X, Xue Y, Lee AV, Clark N, Chen V, Lu S, Chen L, Yu L, Hochheiser HS, Jiang X, Wang QJ, Lu X. Systematic discovery of the functional impact of somatic genome alterations in individual tumors through tumor-specific causal inference. PLoS Comput Biol 2019; 15:e1007088. [PMID: 31276486 PMCID: PMC6650088 DOI: 10.1371/journal.pcbi.1007088] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2019] [Revised: 07/23/2019] [Accepted: 05/09/2019] [Indexed: 02/07/2023] Open
Abstract
Cancer is mainly caused by somatic genome alterations (SGAs). Precision oncology involves identifying and targeting tumor-specific aberrations resulting from causative SGAs. We developed a novel tumor-specific computational framework that finds the likely causative SGAs in an individual tumor and estimates their impact on oncogenic processes, which suggests the disease mechanisms that are acting in that tumor. This information can be used to guide precision oncology. We report a tumor-specific causal inference (TCI) framework, which estimates causative SGAs by modeling causal relationships between SGAs and molecular phenotypes (e.g., transcriptomic, proteomic, or metabolomic changes) within an individual tumor. We applied the TCI algorithm to tumors from The Cancer Genome Atlas (TCGA) and estimated for each tumor the SGAs that causally regulate the differentially expressed genes (DEGs) in that tumor. Overall, TCI identified 634 SGAs that are predicted to cause cancer-related DEGs in a significant number of tumors, including most of the previously known drivers and many novel candidate cancer drivers. The inferred causal relationships are statistically robust and biologically sensible, and multiple lines of experimental evidence support the predicted functional impact of both the well-known and the novel candidate drivers that are predicted by TCI. TCI provides a unified framework that integrates multiple types of SGAs and molecular phenotypes to estimate which genome perturbations are causally influencing one or more molecular/cellular phenotypes in an individual tumor. By identifying major candidate drivers and revealing their functional impact in an individual tumor, TCI sheds light on the disease mechanisms of that tumor, which can serve to advance our basic knowledge of cancer biology and to support precision oncology that provides tailored treatment of individual tumors.
Collapse
Affiliation(s)
- Chunhui Cai
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America
- Center for Causal Discovery, Pittsburgh, PA, United States of America
| | - Gregory F. Cooper
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America
- Center for Causal Discovery, Pittsburgh, PA, United States of America
| | - Kevin N. Lu
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America
- Center for Causal Discovery, Pittsburgh, PA, United States of America
| | - Xiaojun Ma
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America
| | - Shuping Xu
- Department of Pharmacology and Chemical Biology, University of Pittsburgh, Pittsburgh, PA, United States of America
| | - Zhenlong Zhao
- Department of Pharmacology and Chemical Biology, University of Pittsburgh, Pittsburgh, PA, United States of America
| | - Xueer Chen
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America
- Center for Causal Discovery, Pittsburgh, PA, United States of America
| | - Yifan Xue
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America
- Center for Causal Discovery, Pittsburgh, PA, United States of America
| | - Adrian V. Lee
- Center for Causal Discovery, Pittsburgh, PA, United States of America
- Department of Pharmacology and Chemical Biology, University of Pittsburgh, Pittsburgh, PA, United States of America
- Magee Women’s Cancer Research Center, Pittsburgh, PA, United States of America
- UPMC Hillman Cancer Center, University of Pittsburgh Medical Center, Pittsburgh, PA, United States of America
| | - Nathan Clark
- Center for Causal Discovery, Pittsburgh, PA, United States of America
- Department of Computational Biology and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America
| | - Vicky Chen
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America
- Center for Causal Discovery, Pittsburgh, PA, United States of America
| | - Songjian Lu
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America
- Center for Causal Discovery, Pittsburgh, PA, United States of America
| | - Lujia Chen
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America
- Center for Causal Discovery, Pittsburgh, PA, United States of America
| | - Liyue Yu
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America
- Center for Causal Discovery, Pittsburgh, PA, United States of America
| | - Harry S. Hochheiser
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America
- Center for Causal Discovery, Pittsburgh, PA, United States of America
| | - Xia Jiang
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America
- Center for Causal Discovery, Pittsburgh, PA, United States of America
| | - Q. Jane Wang
- Department of Pharmacology and Chemical Biology, University of Pittsburgh, Pittsburgh, PA, United States of America
- * E-mail: (QJW); (XL)
| | - Xinghua Lu
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States of America
- Center for Causal Discovery, Pittsburgh, PA, United States of America
- UPMC Hillman Cancer Center, University of Pittsburgh Medical Center, Pittsburgh, PA, United States of America
- * E-mail: (QJW); (XL)
| |
Collapse
|
4
|
Balasubramanian JB, Gopalakrishnan V. Tunable structure priors for Bayesian rule learning for knowledge integrated biomarker discovery. World J Clin Oncol 2018; 9:98-109. [PMID: 30254965 PMCID: PMC6153126 DOI: 10.5306/wjco.v9.i5.98] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/26/2018] [Revised: 07/24/2018] [Accepted: 08/05/2018] [Indexed: 02/06/2023] Open
Abstract
AIM To develop a framework to incorporate background domain knowledge into classification rule learning for knowledge discovery in biomedicine.
METHODS Bayesian rule learning (BRL) is a rule-based classifier that uses a greedy best-first search over a space of Bayesian belief-networks (BN) to find the optimal BN to explain the input dataset, and then infers classification rules from this BN. BRL uses a Bayesian score to evaluate the quality of BNs. In this paper, we extended the Bayesian score to include informative structure priors, which encodes our prior domain knowledge about the dataset. We call this extension of BRL as BRLp. The structure prior has a λ hyperparameter that allows the user to tune the degree of incorporation of the prior knowledge in the model learning process. We studied the effect of λ on model learning using a simulated dataset and a real-world lung cancer prognostic biomarker dataset, by measuring the degree of incorporation of our specified prior knowledge. We also monitored its effect on the model predictive performance. Finally, we compared BRLp to other state-of-the-art classifiers commonly used in biomedicine.
RESULTS We evaluated the degree of incorporation of prior knowledge into BRLp, with simulated data by measuring the Graph Edit Distance between the true data-generating model and the model learned by BRLp. We specified the true model using informative structure priors. We observed that by increasing the value of λ we were able to increase the influence of the specified structure priors on model learning. A large value of λ of BRLp caused it to return the true model. This also led to a gain in predictive performance measured by area under the receiver operator characteristic curve (AUC). We then obtained a publicly available real-world lung cancer prognostic biomarker dataset and specified a known biomarker from literature [the epidermal growth factor receptor (EGFR) gene]. We again observed that larger values of λ led to an increased incorporation of EGFR into the final BRLp model. This relevant background knowledge also led to a gain in AUC.
CONCLUSION BRLp enables tunable structure priors to be incorporated during Bayesian classification rule learning that integrates data and knowledge as demonstrated using lung cancer biomarker data.
Collapse
Affiliation(s)
- Jeya Balaji Balasubramanian
- Intelligent Systems Program, School of Computing and Information, University of Pittsburgh, Pittsburgh, PA 15260, United States
| | - Vanathi Gopalakrishnan
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15206, United States
| |
Collapse
|