1
|
Baizer L, Bures R, Nadkarni G, Reyes-Guzman C, Ladwa S, Cade B, Westover MB, Durmer J, de Zambotti M, Desai M, Parekh A, Si B, Fernandez-Mendoza J, Minor K, Mazzotti DR, Lee S, Katabi D, Kiss O, Spira AP, Morris J, Seixas A, Kioumourtzoglou MA, Bridges JFP, Brown M, Hale L, Purcell S. Big data approaches for novel mechanistic insights on sleep and circadian rhythms: a workshop summary. Sleep 2025; 48:zsaf035. [PMID: 39945146 PMCID: PMC12163129 DOI: 10.1093/sleep/zsaf035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2024] [Revised: 02/01/2025] [Indexed: 02/19/2025] Open
Abstract
The National Center on Sleep Disorders Research of the National Heart, Lung, and Blood Institute at the National Institutes of Health hosted a 2-day virtual workshop titled Big Data Approaches for Novel Mechanistic Insights on Disorders of Sleep and Circadian Rhythms on May 2nd and 3rd, 2024. The goals of this workshop were to establish a comprehensive understanding of the current state of sleep and circadian rhythm disorders research to identify opportunities to advance the field by using approaches based on artificial intelligence and machine learning. The workshop showcased rapidly developing technologies for sensitive and comprehensive remote analysis of sleep and its disorders that can account for physiological, environmental, and social influences, potentially leading to novel insights on long-term health consequences of sleep disorders and disparities of these health problems in specific populations.
Collapse
Affiliation(s)
- Lawrence Baizer
- National Center on Sleep Disorders Research, National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, MD, USA
| | - Regina Bures
- National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, MD, USA
| | - Girish Nadkarni
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | | | - Sweta Ladwa
- National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, MD, USA
| | - Brian Cade
- Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA
| | | | - Jeffrey Durmer
- Sleep & Circadian Science, Absolute Rest, Denver, CO, USA
| | | | - Manisha Desai
- Quantitative Sciences Unit, Stanford University Medical School, Stanford, CA, USA
| | - Ankit Parekh
- Division of Pulmonary, Critical Care and Sleep Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai New York, NY, USA
| | - Bing Si
- School of Computing and Augmented Intelligence, Arizona State University, Tempe, AZ, USA
| | - Julio Fernandez-Mendoza
- Penn State College of Medicine Sleep Research and Treatment Center, Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Kelton Minor
- Department of Environmental Health Sciences, Columbia University Mailman School of Public Health, New York, NY, USA
| | - Diego R Mazzotti
- Division of Medical Informatics, University of Kansas Medical Center, Kansas City, KS, USA
| | - Soomi Lee
- Department of Human Development and Family Studies, Center for Healthy Aging, Pennsylvania State University, University Park, PA, USA
| | - Dina Katabi
- MIT Center for Wireless Networks and Mobile Computing, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Orsolya Kiss
- Center for Health Sciences, SRI International, Menlo Park, CA, USA
| | - Adam P Spira
- Department of Mental Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Jonna Morris
- School of Nursing, University of Pittsburgh, Pittsburgh, PA, USA
| | - Azizi Seixas
- Department of Psychiatry and Behavioral Sciences, University of Miami, Miami, FL, USA
| | | | | | - Marishka Brown
- National Center on Sleep Disorders Research, National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, MD, USA
| | - Lauren Hale
- Renaissance School of Medicine, Stony Brook University, Stony Brook, NY, USA
| | - Shaun Purcell
- Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
2
|
Tran D, Nguyen H, Pham VD, Nguyen P, Nguyen Luu H, Minh Phan L, Blair DeStefano C, Jim Yeung SC, Nguyen T. A comprehensive review of cancer survival prediction using multi-omics integration and clinical variables. Brief Bioinform 2025; 26:bbaf150. [PMID: 40221959 PMCID: PMC11994034 DOI: 10.1093/bib/bbaf150] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2024] [Revised: 01/29/2025] [Accepted: 03/19/2025] [Indexed: 04/15/2025] Open
Abstract
Cancer is an umbrella term that includes a wide spectrum of disease severity, from those that are malignant, metastatic, and aggressive to benign lesions with very low potential for progression or death. The ability to prognosticate patient outcomes would facilitate management of various malignancies: patients whose cancer is likely to advance quickly would receive necessary treatment that is commensurate with the predicted biology of the disease. Former prognostic models based on clinical variables (age, gender, cancer stage, tumor grade, etc.), though helpful, cannot account for genetic differences, molecular etiology, tumor heterogeneity, and important host biological mechanisms. Therefore, recent prognostic models have shifted toward the integration of complementary information available in both molecular data and clinical variables to better predict patient outcomes: vital status (overall survival), metastasis (metastasis-free survival), and recurrence (progression-free survival). In this article, we review 20 survival prediction approaches that integrate multi-omics and clinical data to predict patient outcomes. We discuss their strategies for modeling survival time (continuous and discrete), the incorporation of molecular measurements and clinical variables into risk models (clinical and multi-omics data), how to cope with censored patient records, the effectiveness of data integration techniques, prediction methodologies, model validation, and assessment metrics. The goal is to inform life scientists of available resources, and to provide a complete review of important building blocks in survival prediction. At the same time, we thoroughly describe the pros and cons of each methodology, and discuss in depth the outstanding challenges that need to be addressed in future method development.
Collapse
Affiliation(s)
- Dao Tran
- Department of Computer Science and Software Engineering, Auburn University, 345 W Magnolia Avenue, Auburn, AL 36849, United States
| | - Ha Nguyen
- Department of Computer Science and Software Engineering, Auburn University, 345 W Magnolia Avenue, Auburn, AL 36849, United States
| | - Van-Dung Pham
- Department of Computer Science and Software Engineering, Auburn University, 345 W Magnolia Avenue, Auburn, AL 36849, United States
| | - Phuong Nguyen
- Department of Computer Science and Software Engineering, Auburn University, 345 W Magnolia Avenue, Auburn, AL 36849, United States
| | - Hung Nguyen Luu
- UPMC Hillman Cancer Center, University of Pittsburgh Medical Center, 5150 Centre Avenue, Pittsburgh, PA 15232, United States
- Department of Epidemiology, School of Public Health, University of Pittsburgh, 130 De Soto Street, Pittsburgh, PA 15261, United States
| | - Liem Minh Phan
- David Grant USAF Medical Center—Clinical Investigation Facility, 60 Medical Group, Defense Health Agency, 101 Bodin Circle, Travis Air Force Base, CA 94535, United States
| | - Christin Blair DeStefano
- Walter Reed National Military Medical Center, Defense Health Agency, 8901 Rockville Pike, Bethesda, MD 20889, United States
| | - Sai-Ching Jim Yeung
- Department of Emergency Medicine, The University of Texas MD Anderson Cancer Center, 1400 Pressler Street, Houston, TX 77030, United States
| | - Tin Nguyen
- Department of Computer Science and Software Engineering, Auburn University, 345 W Magnolia Avenue, Auburn, AL 36849, United States
| |
Collapse
|
3
|
Goedhart JM, Klausch T, Janssen J, van de Wiel MA. Adaptive Use of Co-Data Through Empirical Bayes for Bayesian Additive Regression Trees. Stat Med 2025; 44:e70004. [PMID: 39964672 PMCID: PMC11834989 DOI: 10.1002/sim.70004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Revised: 11/26/2024] [Accepted: 01/07/2025] [Indexed: 02/20/2025]
Abstract
For clinical prediction applications, we are often faced with small sample size data compared to the number of covariates. Such data pose problems for variable selection and prediction, especially when the covariate-response relationship is complicated. To address these challenges, we propose to incorporate external information on the covariates into Bayesian additive regression trees (BART), a sum-of-trees prediction model that utilizes priors on the tree parameters to prevent overfitting. To incorporate external information, an empirical Bayes (EB) framework is developed that estimates, assisted by a model, prior covariate weights in the BART model. The proposed EB framework enables the estimation of the other prior parameters of BART as well, rendering an appealing and computationally efficient alternative to cross-validation. We show that the method finds relevant covariates and that it improves prediction compared to default BART in simulations. If the covariate-response relationship is non-linear, the method benefits from the flexibility of BART to outperform regression-based learners. Finally, the benefit of incorporating external information is shown in an application to diffuse large B-cell lymphoma prognosis based on clinical covariates, gene mutations, DNA translocations, and DNA copy number data.
Collapse
Affiliation(s)
- Jeroen M. Goedhart
- Department of Epidemiology & Data Science, Amsterdam Public Health Research InstituteAmsterdam University Medical Centers Location AMCNoord HollandThe Netherlands
| | - Thomas Klausch
- Department of Epidemiology & Data Science, Amsterdam Public Health Research InstituteAmsterdam University Medical Centers Location AMCNoord HollandThe Netherlands
| | - Jurriaan Janssen
- Department of Pathology, Cancer Center AmsterdamAmsterdam University Medical Centers Location VUMCNoord HollandThe Netherlands
| | - Mark A. van de Wiel
- Department of Epidemiology & Data Science, Amsterdam Public Health Research InstituteAmsterdam University Medical Centers Location AMCNoord HollandThe Netherlands
| |
Collapse
|
4
|
Sparapani RA, Maiers M, Spellman SR, Shaw BE, Laud PW, Devine SM, Logan BR. Optimal Donor Selection Across Multiple Outcomes For Hematopoietic Stem Cell Transplantation By Bayesian Nonparametric Machine Learning. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.05.09.24307134. [PMID: 38766030 PMCID: PMC11100939 DOI: 10.1101/2024.05.09.24307134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2024]
Abstract
Allogeneic hematopoietic cell transplantation (HCT) is one of the only curative treatment options for patients suffering from life-threatening hematologic malignancies; yet, the possible adverse complications can be serious even fatal. Matching between donor and recipient for 4 of the HLA genes is widely accepted and supported by the literature. However, among 8/8 allele matched unrelated donors, there is less agreement among centers and transplant physicians about how to prioritize donor characteristics like additional HLA loci (DPB1 and DQB1), donor sex/parity, CMV status, and age to optimize transplant outcomes. This leads to varying donor selection practice from patient to patient or via center protocols. Furthermore, different donor characteristics may impact different post transplant outcomes beyond mortality, including disease relapse, graft failure/rejection, and chronic graft-versus-host disease (components of event-free survival, EFS). We develop a general methodology to identify optimal treatment decisions by considering the trade-offs on multiple outcomes modeled using Bayesian nonparametric machine learning. We apply the proposed approach to the problem of donor selection to optimize overall survival and event-free survival, using a large outcomes registry of HCT recipients and their actual and potential donors from the Center for International Blood and Marrow Transplant Research (CIBMTR). Our approach leads to a donor selection strategy that favors the youngest male donor, except when there is a female donor that is substantially younger.
Collapse
Affiliation(s)
- Rodney A Sparapani
- Division of Biostatistics, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Martin Maiers
- CIBMTR (Center for International Blood and Marrow Transplant Research), NMDP, Minneapolis, MN, USA
| | - Stephen R Spellman
- CIBMTR (Center for International Blood and Marrow Transplant Research), NMDP, Minneapolis, MN, USA
| | - Bronwen E Shaw
- CIBMTR, Department of Medicine, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Purushottam W Laud
- Division of Biostatistics, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Steven M Devine
- CIBMTR (Center for International Blood and Marrow Transplant Research), NMDP, Minneapolis, MN, USA
| | - Brent R Logan
- Division of Biostatistics, Medical College of Wisconsin, Milwaukee, WI, USA
| |
Collapse
|
5
|
Wang S, Puggioni G, Wu J, Meador KJ, Caffrey A, Wyss R, Slaughter JL, Suzuki E, Ward KE, Lewkowitz AK, Wen X. Prenatal Exposure to Opioids and Neurodevelopmental Disorders in Children: A Bayesian Mediation Analysis. Am J Epidemiol 2024; 193:308-322. [PMID: 37671942 PMCID: PMC11484615 DOI: 10.1093/aje/kwad183] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Revised: 06/08/2023] [Accepted: 09/02/2023] [Indexed: 09/07/2023] Open
Abstract
This study explores natural direct and joint natural indirect effects (JNIE) of prenatal opioid exposure on neurodevelopmental disorders (NDDs) in children mediated through pregnancy complications, major and minor congenital malformations, and adverse neonatal outcomes, using Medicaid claims linked to vital statistics in Rhode Island, United States, 2008-2018. A Bayesian mediation analysis with elastic net shrinkage prior was developed to estimate mean time to NDD diagnosis ratio using posterior mean and 95% credible intervals (CrIs) from Markov chain Monte Carlo algorithms. Simulation studies showed desirable model performance. Of 11,176 eligible pregnancies, 332 had ≥2 dispensations of prescription opioids anytime during pregnancy, including 200 (1.8%) having ≥1 dispensation in the first trimester (T1), 169 (1.5%) in the second (T2), and 153 (1.4%) in the third (T3). A significant JNIE of opioid exposure was observed in each trimester (T1, JNIE = 0.97, 95% CrI: 0.95, 0.99; T2, JNIE = 0.97, 95% CrI: 0.95, 0.99; T3, JNIE = 0.96, 95% CrI: 0.94, 0.99). The proportion of JNIE in each trimester was 17.9% (T1), 22.4% (T2), and 56.3% (T3). In conclusion, adverse pregnancy and birth outcomes jointly mediated the association between prenatal opioid exposure and accelerated time to NDD diagnosis. The proportion of JNIE increased as the timing of opioid exposure approached delivery.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | - Xuerong Wen
- Correspondence to Dr. Xuerong Wen, Department of Pharmacy Practice, College of Pharmacy, University of Rhode Island, 7 Greenhouse Road, Kingston, RI 02881 (e-mail: )
| |
Collapse
|
6
|
Payne RD, Guha N, Mallick BK. A Bayesian survival treed hazards model using latent Gaussian processes. Biometrics 2024; 80:ujad009. [PMID: 38364805 DOI: 10.1093/biomtc/ujad009] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2022] [Revised: 06/27/2023] [Accepted: 11/12/2023] [Indexed: 02/18/2024]
Abstract
Survival models are used to analyze time-to-event data in a variety of disciplines. Proportional hazard models provide interpretable parameter estimates, but proportional hazard assumptions are not always appropriate. Non-parametric models are more flexible but often lack a clear inferential framework. We propose a Bayesian treed hazards partition model that is both flexible and inferential. Inference is obtained through the posterior tree structure and flexibility is preserved by modeling the log-hazard function in each partition using a latent Gaussian process. An efficient reversible jump Markov chain Monte Carlo algorithm is accomplished by marginalizing the parameters in each partition element via a Laplace approximation. Consistency properties for the estimator are established. The method can be used to help determine subgroups as well as prognostic and/or predictive biomarkers in time-to-event data. The method is compared with some existing methods on simulated data and a liver cirrhosis dataset.
Collapse
Affiliation(s)
- Richard D Payne
- Eli Lilly & Company, Lilly Corporate Center, Indianapolis, IN, 46285, United States
| | - Nilabja Guha
- Department of Mathematical Sciences, University of Massachusetts Lowell, One University Avenue, Lowell, Massachusetts, 01852, United States
| | - Bani K Mallick
- Department of Statistics, Texas A&M University, 3143 TAMU, College Station, TX, 77843-3143, United States
| |
Collapse
|
7
|
Li X, Logan BR, Hossain SMF, Moodie EEM. Dynamic Treatment Regimes Using Bayesian Additive Regression Trees for Censored Outcomes. LIFETIME DATA ANALYSIS 2024; 30:181-212. [PMID: 37659991 PMCID: PMC10764602 DOI: 10.1007/s10985-023-09605-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/15/2022] [Accepted: 07/16/2023] [Indexed: 09/04/2023]
Abstract
To achieve the goal of providing the best possible care to each individual under their care, physicians need to customize treatments for individuals with the same health state, especially when treating diseases that can progress further and require additional treatments, such as cancer. Making decisions at multiple stages as a disease progresses can be formalized as a dynamic treatment regime (DTR). Most of the existing optimization approaches for estimating dynamic treatment regimes including the popular method of Q-learning were developed in a frequentist context. Recently, a general Bayesian machine learning framework that facilitates using Bayesian regression modeling to optimize DTRs has been proposed. In this article, we adapt this approach to censored outcomes using Bayesian additive regression trees (BART) for each stage under the accelerated failure time modeling framework, along with simulation studies and a real data example that compare the proposed approach with Q-learning. We also develop an R wrapper function that utilizes a standard BART survival model to optimize DTRs for censored outcomes. The wrapper function can easily be extended to accommodate any type of Bayesian machine learning model.
Collapse
Affiliation(s)
- Xiao Li
- Division of Biostatistics, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Brent R Logan
- Division of Biostatistics, Medical College of Wisconsin, Milwaukee, WI, USA
| | | | | |
Collapse
|
8
|
Sparapani R, Logan B, Maiers M, Laud P, McCulloch R. Nonparametric failure time: Time-to-event machine learning with heteroskedastic Bayesian additive regression trees and low information omnibus Dirichlet process mixtures. Biometrics 2023; 79:3023-3037. [PMID: 36932826 PMCID: PMC10505620 DOI: 10.1111/biom.13857] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2022] [Accepted: 02/22/2023] [Indexed: 03/19/2023]
Abstract
Many popular survival models rely on restrictive parametric, or semiparametric, assumptions that could provide erroneous predictions when the effects of covariates are complex. Modern advances in computational hardware have led to an increasing interest in flexible Bayesian nonparametric methods for time-to-event data such as Bayesian additive regression trees (BART). We propose a novel approach that we call nonparametric failure time (NFT) BART in order to increase the flexibility beyond accelerated failure time (AFT) and proportional hazard models. NFT BART has three key features: (1) a BART prior for the mean function of the event time logarithm; (2) a heteroskedastic BART prior to deduce a covariate-dependent variance function; and (3) a flexible nonparametric error distribution using Dirichlet process mixtures (DPM). Our proposed approach widens the scope of hazard shapes including nonproportional hazards, can be scaled up to large sample sizes, naturally provides estimates of uncertainty via the posterior and can be seamlessly employed for variable selection. We provide convenient, user-friendly, computer software that is freely available as a reference implementation. Simulations demonstrate that NFT BART maintains excellent performance for survival prediction especially when AFT assumptions are violated by heteroskedasticity. We illustrate the proposed approach on a study examining predictors for mortality risk in patients undergoing hematopoietic stem cell transplant (HSCT) for blood-borne cancer, where heteroskedasticity and nonproportional hazards are likely present.
Collapse
Affiliation(s)
- R.A. Sparapani
- Medical College of Wisconsin, Division of Biostatistics, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA
| | - B.R. Logan
- Medical College of Wisconsin, Division of Biostatistics, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA
| | - M.J. Maiers
- National Marrow Donor Program, 500 N 5th St., Minneapolis, MN 55401, USA
| | - P.W. Laud
- Medical College of Wisconsin, Division of Biostatistics, 8701 Watertown Plank Road, Milwaukee, WI 53226, USA
| | - R.E. McCulloch
- Arizona State University, School of Mathematical and Statistical Sciences, 528 Wexler Hall, Tempe, AZ 85281, USA
| |
Collapse
|
9
|
Zhang L, Arabameri A, Santosh M, Pal SC. Land subsidence susceptibility mapping: comparative assessment of the efficacy of the five models. ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL 2023:10.1007/s11356-023-27799-0. [PMID: 37266775 DOI: 10.1007/s11356-023-27799-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Accepted: 05/17/2023] [Indexed: 06/03/2023]
Abstract
Land subsidence (LS) as a major geological and hydrological hazard poses a major threat to safety and security. The various triggers of LS include intense extraction of aquifer bodies. In this study, we present an LS inventory map of the Daumeghan plain of Iran using 123 LS and 123 non-LS locations which were identified through field survey. Fourteen LS causative factors related to topography, geology, hydrology, and anthropogenic characteristics were selected based on multi-collinearity test. Based on the results, five susceptibility maps were generated employing models and input data. The LS susceptibility models were evaluated and validated using the receiver operating characteristic (ROC) curve and statistical indices. The results indicate that the LS susceptibility maps produced have good accuracy in predicting the spatial distribution of LS in the study area. The result showed that the optimization models BA and GWO were better than the other machine learning algorithm (MLA). In addition, The BA model has 96.6% area under of ROC (AUROC) followed by GWO (95.8%), BART (94.5%), BRT (93.1%), and SVR (92.7%). The LS susceptibility maps formulated in our study can serve as a useful tool for formulating mitigation strategies and for better land-use planning.
Collapse
Affiliation(s)
- Lei Zhang
- Yantai Nanshan University, Yantai, 265713, China.
- China University of Mining and Technology( Beijing), Beijing, 100083, China.
| | - Alireza Arabameri
- Department of Geomorphology, Tarbiat Modares University, Tarbiat Modares University, Tehran, 14117-13116, Iran
| | - M Santosh
- School of Earth Sciences and Resources, China University of Geosciences Beijing, Beijing, China
- Department of Earth Sciences, University of Adelaide, Adelaide, South Australia, Australia
| | - Subodh Chandra Pal
- Department of Geography, The University of Burdwan, Bardhaman, West Bengal, 713104, India
| |
Collapse
|
10
|
Salerno S, Li Y. High-Dimensional Survival Analysis: Methods and Applications. ANNUAL REVIEW OF STATISTICS AND ITS APPLICATION 2023; 10:25-49. [PMID: 36968638 PMCID: PMC10038209 DOI: 10.1146/annurev-statistics-032921-022127] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/18/2023]
Abstract
In the era of precision medicine, time-to-event outcomes such as time to death or progression are routinely collected, along with high-throughput covariates. These high-dimensional data defy classical survival regression models, which are either infeasible to fit or likely to incur low predictability due to over-fitting. To overcome this, recent emphasis has been placed on developing novel approaches for feature selection and survival prognostication. We will review various cutting-edge methods that handle survival outcome data with high-dimensional predictors, highlighting recent innovations in machine learning approaches for survival prediction. We will cover the statistical intuitions and principles behind these methods and conclude with extensions to more complex settings, where competing events are observed. We exemplify these methods with applications to the Boston Lung Cancer Survival Cohort study, one of the largest cancer epidemiology cohorts investigating the complex mechanisms of lung cancer.
Collapse
Affiliation(s)
- Stephen Salerno
- Department of Biostatistics, University of Michigan, Ann Arbor, United States, 48109
| | - Yi Li
- Department of Biostatistics, University of Michigan, Ann Arbor, United States, 48109
| |
Collapse
|
11
|
Dorie V, Perrett G, Hill JL, Goodrich B. Stan and BART for Causal Inference: Estimating Heterogeneous Treatment Effects Using the Power of Stan and the Flexibility of Machine Learning. ENTROPY (BASEL, SWITZERLAND) 2022; 24:1782. [PMID: 36554187 PMCID: PMC9778579 DOI: 10.3390/e24121782] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Revised: 10/22/2022] [Accepted: 11/06/2022] [Indexed: 06/17/2023]
Abstract
A wide range of machine-learning-based approaches have been developed in the past decade, increasing our ability to accurately model nonlinear and nonadditive response surfaces. This has improved performance for inferential tasks such as estimating average treatment effects in situations where standard parametric models may not fit the data well. These methods have also shown promise for the related task of identifying heterogeneous treatment effects. However, the estimation of both overall and heterogeneous treatment effects can be hampered when data are structured within groups if we fail to correctly model the dependence between observations. Most machine learning methods do not readily accommodate such structure. This paper introduces a new algorithm, stan4bart, that combines the flexibility of Bayesian Additive Regression Trees (BART) for fitting nonlinear response surfaces with the computational and statistical efficiencies of using Stan for the parametric components of the model. We demonstrate how stan4bart can be used to estimate average, subgroup, and individual-level treatment effects with stronger performance than other flexible approaches that ignore the multilevel structure of the data as well as multilevel approaches that have strict parametric forms.
Collapse
Affiliation(s)
| | - George Perrett
- Department of Applied Statistics, Social Science, and the Humanities, New York University, New York, NY 10003, USA
| | - Jennifer L. Hill
- Department of Applied Statistics, Social Science, and the Humanities, New York University, New York, NY 10003, USA
| | - Benjamin Goodrich
- Department of Political Science, Columbia University, New York, NY 10025, USA
| |
Collapse
|
12
|
Semiparametric Survival Analysis of 30-Day Hospital Readmissions with Bayesian Additive Regression Kernel Model. STATS 2022. [DOI: 10.3390/stats5030038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
In this paper, we introduce a kernel-based nonlinear Bayesian model for a right-censored survival outcome data set. Our kernel-based approach provides a flexible nonparametric modeling framework to explore nonlinear relationships between predictors with right-censored survival outcome data. Our proposed kernel-based model is shown to provide excellent predictive performance via several simulation studies and real-life examples. Unplanned hospital readmissions greatly impair patients’ quality of life and have imposed a significant economic burden on American society. In this paper, we focus our application on predicting 30-day readmissions of patients. Our survival Bayesian additive regression kernel model (survival BARK or sBARK) improves the timeliness of readmission preventive intervention through a data-driven approach.
Collapse
|
13
|
Chu J, Sun N, Hu W, Chen X, Yi N, Shen Y. Bayesian hierarchical lasso Cox model: A 9-gene prognostic signature for overall survival in gastric cancer in an Asian population. PLoS One 2022; 17:e0266805. [PMID: 35421138 PMCID: PMC9009599 DOI: 10.1371/journal.pone.0266805] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2021] [Accepted: 03/29/2022] [Indexed: 12/24/2022] Open
Abstract
Objective
Gastric cancer (GC) is one of the most common tumour diseases worldwide and has poor survival, especially in the Asian population. Exploration based on biomarkers would be efficient for better diagnosis, prediction, and targeted therapy.
Methods
Expression profiles were downloaded from the Gene Expression Omnibus (GEO) database. Survival-related genes were identified by gene set enrichment analysis (GSEA) and univariate Cox. Then, we applied a Bayesian hierarchical lasso Cox model for prognostic signature screening. Protein-protein interaction and Spearman analysis were performed. Kaplan–Meier and receiver operating characteristic (ROC) curve analysis were applied to evaluate the prediction performance. Multivariate Cox regression was used to identify prognostic factors, and a prognostic nomogram was constructed for clinical application.
Results
With the Bayesian lasso Cox model, a 9-gene signature included TNFRSF11A, NMNAT1, EIF5A, NOTCH3, TOR2A, E2F8, PSMA5, TPMT, and KIF11 was established to predict overall survival in GC. Protein-protein interaction analysis indicated that E2F8 was likely related to KIF11. Kaplan-Meier analysis showed a significant difference between the high-risk and low-risk groups (P<0.001). Multivariate analysis demonstrated that the 9-gene signature was an independent predictor (HR = 2.609, 95% CI 2.017–3.370), and the C-index of the integrative model reached 0.75. Function enrichment analysis for different risk groups revealed the most significant enrichment pathway/term, including pyrimidine metabolism and respiratory electron transport chain.
Conclusion
Our findings suggested that a novel prognostic model based on a 9-gene signature was developed to predict GC patients in high-risk and improve prediction performance. We hope our model could provide a reference for risk classification and clinical decision-making.
Collapse
Affiliation(s)
- Jiadong Chu
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, P.R. China
| | - Na Sun
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, P.R. China
| | - Wei Hu
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, P.R. China
| | - Xuanli Chen
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, P.R. China
| | - Nengjun Yi
- Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, Alabama, United States of America
| | - Yueping Shen
- Department of Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, P.R. China
- * E-mail:
| |
Collapse
|
14
|
Yan D, Cai S, Bai L, Du Z, Li H, Sun P, Cao J, Yi N, Liu SB, Tang Z. Integration of immune and hypoxia gene signatures improves the prediction of radiosensitivity in breast cancer. Am J Cancer Res 2022; 12:1222-1240. [PMID: 35411250 PMCID: PMC8984882] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2021] [Accepted: 02/22/2022] [Indexed: 06/14/2023] Open
Abstract
Immunity and hypoxia are two important factors that affect the response of cancer patients to radiotherapy. At the same time, considering the limited predictive value of a single predictive model and the uncertainty of grouping patients near the cutoff value, we developed and validated a combined model based on immune- and hypoxia-related gene expression profiles to predict the radiosensitivity of breast cancer patients. This study was based on breast cancer data from The Cancer Genome Atlas (TCGA). Spike-and-slab Lasso regression analysis was performed to select three immune-related genes and develop a radiosensitivity model. Lasso Cox regression modeling selected 11 hypoxia-related genes for development of radiosensitivity model. Three independent datasets (Molecular Taxonomy of Breast Cancer International Consortium [METABRIC], E-TABM-158, GSE103746) were used to validate the predictive value of radiosensitivity signatures. In the TCGA dataset, the 10-year survival probabilities of the immune radioresistant (IRR) and hypoxia radioresistant (HRR) groups were 0.189 (0.037, 0.973) and 0.477 (0.293, 0.776), respectively. The 10-year survival probabilities of the immune radiosensitive (IRS) and hypoxia radiosensitive (HRS) groups were 0.778 (0.676, 0.895) and 0.824 (0.723, 0.939), respectively. Based on these two gene signatures, we further constructed a combined model and divided all patients into three groups (IRS/HRS, mixed, IRR/HRR). We identified the IRS/HRS patients most likely to benefit from radiotherapy; the 10-year survival probability was 0.886 (0.806, 0.976). The 10-year survival probability of the IRR/HRR group was 0. In conclusion, a combined model integrating immune- and hypoxia-related gene signatures could effectively predict the radiosensitivity of breast cancer and more accurately identify radiosensitive and radioresistant patients than a single model.
Collapse
Affiliation(s)
- Derui Yan
- Department of Biostatistics, School of Public Health, Medical College of Soochow UniversitySuzhou 215123, Jiangsu, China
- Suzhou Key Laboratory of Medical Biotechnology, Suzhou Vocational Health CollegeSuzhou 215009, Jiangsu, China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow UniversitySuzhou 215123, Jiangsu, China
| | - Shang Cai
- Department of Radiotherapy & Oncology, The Second Affiliated Hospital of Soochow UniversitySuzhou 215004, Jiangsu, China
| | - Lu Bai
- Department of Biostatistics, School of Public Health, Medical College of Soochow UniversitySuzhou 215123, Jiangsu, China
- Suzhou Key Laboratory of Medical Biotechnology, Suzhou Vocational Health CollegeSuzhou 215009, Jiangsu, China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow UniversitySuzhou 215123, Jiangsu, China
| | - Zixuan Du
- Department of Biostatistics, School of Public Health, Medical College of Soochow UniversitySuzhou 215123, Jiangsu, China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow UniversitySuzhou 215123, Jiangsu, China
| | - Huijun Li
- Department of Biostatistics, School of Public Health, Medical College of Soochow UniversitySuzhou 215123, Jiangsu, China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow UniversitySuzhou 215123, Jiangsu, China
| | - Peng Sun
- Department of Otolaryngology, The First Affiliated Hospital of Soochow UniversitySuzhou 215006, Jiangsu, China
| | - Jianping Cao
- School of Radiation Medicine and Protection and Collaborative Innovation Center of Radiation Medicine of Jiangsu Higher Education Institutions, Soochow UniversitySuzhou 215031, Jiangsu, China
| | - Nengjun Yi
- Department of Biostatistics, University of Alabama at BirminghamBirmingham, AL 35294, USA
| | - Song-Bai Liu
- Suzhou Key Laboratory of Medical Biotechnology, Suzhou Vocational Health CollegeSuzhou 215009, Jiangsu, China
| | - Zaixiang Tang
- Department of Biostatistics, School of Public Health, Medical College of Soochow UniversitySuzhou 215123, Jiangsu, China
- Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, Medical College of Soochow UniversitySuzhou 215123, Jiangsu, China
| |
Collapse
|
15
|
Alkindi KM, Mukherjee K, Pandey M, Arora A, Janizadeh S, Pham QB, Anh DT, Ahmadi K. Prediction of groundwater nitrate concentration in a semiarid region using hybrid Bayesian artificial intelligence approaches. ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL 2022; 29:20421-20436. [PMID: 34735705 DOI: 10.1007/s11356-021-17224-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/22/2021] [Accepted: 10/21/2021] [Indexed: 06/13/2023]
Abstract
Nitrate is a major pollutant in groundwater whose main source is municipal wastewater and agricultural activities. In the present study, Bayesian approaches such as Bayesian generalized linear model (BGLM), Bayesian regularized neural network (BRNN), Bayesian additive regression tree (BART), and Bayesian ridge regression (BRR) were used to model groundwater nitrate contamination in a semiarid region Marvdasht watershed, Fars province, Iran. Eleven groundwater (GW) nitrate conditioning factors have been taken as input parameters for predictive modeling. The results showed that the Bayesian models used in this study were all competent to model groundwater nitrate and the BART model with R2 = 0.83 was more efficient than the other models. The result of variable importance showed that potassium (K) has the highest importance in the models followed by rainfall, altitude, groundwater depth, and distance from the residential area. The results of the study can support the decision-making process to control and reduce the sources of nitrate pollution.
Collapse
Affiliation(s)
- Khalifa M Alkindi
- UNESCO Chair on Aflaj Studies, Archaeohydrology, University of Nizwa, Nizwa, Oman
| | - Kaustuv Mukherjee
- Department of Geography, Chandidas Mahavidyalaya, Birbhum, WB, 731215, India
| | - Manish Pandey
- University Center for Research & Development (UCRD), Chandigarh University, Mohali, 140413, Punjab, India
- Department of Civil Engineering, University Institute of Engineering, Chandigarh University, Mohali, 140413, Punjab, India
| | - Aman Arora
- Department of Geography, Faculty of Natural Sciences, Jamia Millia Islamia, New Delhi, 10025, Delhi, India
| | - Saeid Janizadeh
- Department of Watershed Management Engineering and Sciences, Faculty in Natural Resources and Marine Science, Tarbiat Modares University, 14115-111, Tehran, Iran
| | - Quoc Bao Pham
- Institute of Applied Technology, Thu Dau Mot University, Binh Duong Province, Vietnam
| | - Duong Tran Anh
- Ho Chi Minh City University of Technology (HUTECH) 475A, Dien Bien Phu, Ward 25, Binh Thanh District, Ho Chi Minh City, Vietnam.
| | - Kourosh Ahmadi
- Department of Forestry, Faculty in Natural Resources and Marine Science, Tarbiat Modares University, 14115-111, Tehran, Iran
| |
Collapse
|
16
|
Chu J, Sun NA, Hu W, Chen X, Yi N, Shen Y. The Application of Bayesian Methods in Cancer Prognosis and Prediction. Cancer Genomics Proteomics 2022; 19:1-11. [PMID: 34949654 PMCID: PMC8717957 DOI: 10.21873/cgp.20298] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Revised: 11/24/2021] [Accepted: 11/30/2021] [Indexed: 11/10/2022] Open
Abstract
With the development of high-throughput biological techniques, high-dimensional omics data have emerged. These molecular data provide a solid foundation for precision medicine and prognostic prediction of cancer. Bayesian methods contribute to constructing prognostic models with complex relationships in omics and improving performance by introducing different prior distribution, which is suitable for modelling the high-dimensional data involved. Using different omics, several Bayesian hierarchical approaches have been proposed for variable selection and model construction. In particular, the Bayesian methods of multi-omics integration have also been consistently proposed in recent years. Compared with single-omics, multi-omics integration modelling will contribute to improving predictive performance, gaining insights into the underlying mechanisms of tumour occurrence and development, and the discovery of more reliable biomarkers. In this work, we present a review of current proposed Bayesian approaches in prognostic prediction modelling in cancer.
Collapse
Affiliation(s)
- Jiadong Chu
- Department of Epidemiology and Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, P.R. China
| | - N A Sun
- Department of Epidemiology and Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, P.R. China
| | - Wei Hu
- Department of Epidemiology and Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, P.R. China
| | - Xuanli Chen
- Department of Epidemiology and Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, P.R. China
| | - Nengjun Yi
- Department of Biostatistics, School of Public Health, University of Alabama at Birmingham, Birmingham, AL, U.S.A
| | - Yueping Shen
- Department of Epidemiology and Biostatistics, School of Public Health, Medical College of Soochow University, Suzhou, P.R. China;
| |
Collapse
|
17
|
Maity AK, Lee SC, Hu L, Bell-Pedersen D, Mallick BK, Sarkar TR. Circadian Gene Selection for Time-to-event Phenotype by Integrating CNV and RNAseq Data. CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS : AN INTERNATIONAL JOURNAL SPONSORED BY THE CHEMOMETRICS SOCIETY 2021; 212:104276. [PMID: 35068632 PMCID: PMC8775911 DOI: 10.1016/j.chemolab.2021.104276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
BACKGROUND The endogenous circadian clock, which controls daily rhythms in the expression of at least half of the mammalian genome, has a major influence on cell physiology. Consequently, disruption of the circadian system is associated with wide range of diseases including cancer. While several circadian clock genes have been associated with cancer progression, little is known about the survival when two or more platforms are considered together. Our goal was to determine if survival outcomes are associated with circadian clock function. To accomplish this goal, we developed a Bayesian hierarchical survival model coupled with the global local shrinkage prior and applied this model to available RNASeq and Copy Number Variation data to select significant circadian genes associates with cancer progression. RESULTS Using a Bayesian shrinkage approach with the Bayesian accelerated failure time (AFT) model we showed the circadian clock associated gene DEC1 is positively correlated to survival outcome in breast cancer patients. The R package circgene implementing the methodology is available at https://github.com/MAITYA02/circgene. CONCLUSIONS The proposed Bayesian hierarchical model is the first shrinkage prior based model in its kind which integrates two omics platforms to identify the significant circadian gene for cancer survival.
Collapse
Affiliation(s)
- Arnab Kumar Maity
- Early Clinical Development Oncology Statistics, Pfizer Inc., 10777 Science Center Drive, 92121 San Diego, USA
| | - Sang Chan Lee
- Department of Statistics, Texas A&M University, 3143 TAMU, 77843 College Station, USA
| | - Linhan Hu
- Department of Statistics, Texas A&M University, 3143 TAMU, 77843 College Station, USA
| | | | - Bani K. Mallick
- Department of Statistics, Texas A&M University, 3143 TAMU, 77843 College Station, USA
| | - Tapasree Roy Sarkar
- Department of Statistics, Texas A&M University, 3143 TAMU, 77843 College Station, USA
- Department of Biology, Texas A&M University, 3258 TAMU, 77843 College Station, USA
| |
Collapse
|
18
|
Basak P, Linero A, Sinha D, Lipsitz S. Semiparametric analysis of clustered interval-censored survival data using soft Bayesian additive regression trees (SBART). Biometrics 2021; 78:880-893. [PMID: 33864633 DOI: 10.1111/biom.13478] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2020] [Revised: 03/10/2021] [Accepted: 04/01/2021] [Indexed: 11/30/2022]
Abstract
Popular parametric and semiparametric hazards regression models for clustered survival data are inappropriate and inadequate when the unknown effects of different covariates and clustering are complex. This calls for a flexible modeling framework to yield efficient survival prediction. Moreover, for some survival studies involving time to occurrence of some asymptomatic events, survival times are typically interval censored between consecutive clinical inspections. In this article, we propose a robust semiparametric model for clustered interval-censored survival data under a paradigm of Bayesian ensemble learning, called soft Bayesian additive regression trees or SBART (Linero and Yang, 2018), which combines multiple sparse (soft) decision trees to attain excellent predictive accuracy. We develop a novel semiparametric hazards regression model by modeling the hazard function as a product of a parametric baseline hazard function and a nonparametric component that uses SBART to incorporate clustering, unknown functional forms of the main effects, and interaction effects of various covariates. In addition to being applicable for left-censored, right-censored, and interval-censored survival data, our methodology is implemented using a data augmentation scheme which allows for existing Bayesian backfitting algorithms to be used. We illustrate the practical implementation and advantages of our method via simulation studies and an analysis of a prostate cancer surgery study where dependence on the experience and skill level of the physicians leads to clustering of survival times. We conclude by discussing our method's applicability in studies involving high-dimensional data with complex underlying associations.
Collapse
|
19
|
Spanbauer C, Sparapani R. Nonparametric machine learning for precision medicine with longitudinal clinical trials and Bayesian additive regression trees with mixed models. Stat Med 2021; 40:2665-2691. [PMID: 33751659 DOI: 10.1002/sim.8924] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2020] [Revised: 12/14/2020] [Accepted: 02/07/2021] [Indexed: 11/11/2022]
Abstract
Precision medicine is an active area of research that could offer an analytic paradigm shift for clinical trials and the subsequent treatment decisions based on them. Clinical trials are typically analyzed with the intent of discovering beneficial treatments if the same treatment is applied to the entire population under study. But, such a treatment strategy could be suboptimal if subsets of the population exhibit varying treatment effects. Identifying subsets of the population experiencing differential treatment effect and forming individualized treatment rules is a task well-suited to modern machine learning methods such as tree-based ensemble predictive models. Specifically, Bayesian additive regression trees (BART) has shown promise in this regard because of its exceptional performance in out-of-sample prediction. Due to the unique inferential needs of precision medicine for clinical trials, we have proposed the BART extensions explicated here. We incorporate random effects for longitudinal repeated measures and subject clustering within medical centers. The addition of a novel interaction detection prior to identify treatment heterogeneity among clinical trial patients and its association with patient characteristics. These extensions are unified under a framework that we call mixedBART. Simulation studies and applications of precision medicine based on real randomized clinical trials data examples are presented.
Collapse
Affiliation(s)
- Charles Spanbauer
- Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Rodney Sparapani
- Division of Biostatistics, Medical College of Wisconsin, Milwaukee, Wisconsin, USA
| |
Collapse
|
20
|
Yu X, Yang Q, Wang D, Li Z, Chen N, Kong DX. Predicting lung adenocarcinoma disease progression using methylation-correlated blocks and ensemble machine learning classifiers. PeerJ 2021; 9:e10884. [PMID: 33628643 PMCID: PMC7894106 DOI: 10.7717/peerj.10884] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2020] [Accepted: 01/12/2021] [Indexed: 01/20/2023] Open
Abstract
Applying the knowledge that methyltransferases and demethylases can modify adjacent cytosine-phosphorothioate-guanine (CpG) sites in the same DNA strand, we found that combining multiple CpGs into a single block may improve cancer diagnosis. However, survival prediction remains a challenge. In this study, we developed a pipeline named "stacked ensemble of machine learning models for methylation-correlated blocks" (EnMCB) that combined Cox regression, support vector regression (SVR), and elastic-net models to construct signatures based on DNA methylation-correlated blocks for lung adenocarcinoma (LUAD) survival prediction. We used methylation profiles from the Cancer Genome Atlas (TCGA) as the training set, and profiles from the Gene Expression Omnibus (GEO) as validation and testing sets. First, we partitioned the genome into blocks of tightly co-methylated CpG sites, which we termed methylation-correlated blocks (MCBs). After partitioning and feature selection, we observed different diagnostic capacities for predicting patient survival across the models. We combined the multiple models into a single stacking ensemble model. The stacking ensemble model based on the top-ranked block had the area under the receiver operating characteristic curve of 0.622 in the TCGA training set, 0.773 in the validation set, and 0.698 in the testing set. When stratified by clinicopathological risk factors, the risk score predicted by the top-ranked MCB was an independent prognostic factor. Our results showed that our pipeline was a reliable tool that may facilitate MCB selection and survival prediction.
Collapse
Affiliation(s)
- Xin Yu
- State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, Wuhan, Hubei, China
- Agricultural Bioinformatics Key Laboratory of Hubei Province, College of Informatics, Huazhong Agricultural University, Wuhan, Hubei, China
| | - Qian Yang
- Agricultural Bioinformatics Key Laboratory of Hubei Province, College of Informatics, Huazhong Agricultural University, Wuhan, Hubei, China
| | - Dong Wang
- State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, Wuhan, Hubei, China
- Agricultural Bioinformatics Key Laboratory of Hubei Province, College of Informatics, Huazhong Agricultural University, Wuhan, Hubei, China
| | - Zhaoyang Li
- Agricultural Bioinformatics Key Laboratory of Hubei Province, College of Informatics, Huazhong Agricultural University, Wuhan, Hubei, China
| | - Nianhang Chen
- Agricultural Bioinformatics Key Laboratory of Hubei Province, College of Informatics, Huazhong Agricultural University, Wuhan, Hubei, China
| | - De-Xin Kong
- State Key Laboratory of Agricultural Microbiology, Huazhong Agricultural University, Wuhan, Hubei, China
- Agricultural Bioinformatics Key Laboratory of Hubei Province, College of Informatics, Huazhong Agricultural University, Wuhan, Hubei, China
| |
Collapse
|
21
|
Maity AK, Carroll RJ, Mallick BK. Integration of Survival and Binary Data for Variable Selection and Prediction: A Bayesian Approach. J R Stat Soc Ser C Appl Stat 2020; 68:1577-1595. [PMID: 33311813 DOI: 10.1111/rssc.12377] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
We consider the problem where the data consist of a survival time and a binary outcome measurement for each individual, as well as corresponding predictors. The goal is to select the common set of predictors which affect both the responses, and not just only one of them. In addition, we develop a survival prediction model based on data integration. This article is motivated by the Cancer Genomic Atlas (TCGA) databank, which is currently the largest genomics and transcriptomics database. The data contain cancer survival information along with cancer stages for each patient. Furthermore, it contains Reverse-phase Protein Array (RPPA) measurements for each individual, which are the predictors associated with these responses. The biological motivation is to identify the major actionable proteins associated with both survival outcomes and cancer stages. We develop a Bayesian hierarchical model to jointly model the survival time and the classification of the cancer stages. Moreover, to deal with the high dimensionality of the RPPA measurements, we use a shrinkage prior to identify significant proteins. Simulations and TCGA data analysis show that the joint integrated modeling approach improves survival prediction.
Collapse
Affiliation(s)
- Arnab Kumar Maity
- Early Clinical Development Oncology Statistics, 10777 Science Center Drive, Pfizer Inc., San Diego, CA 92121
| | - Raymond J Carroll
- Department of Statistics, Texas A&M University, 3143 TAMU, College Station, TX, 77843-3143, and School of Mathematical and Physical Sciences, University of Technology, Sydney, Broadway NSW 2007, Australia
| | - Bani K Mallick
- Department of Statistics, Texas A&M University, 3143 TAMU, College Station, TX, 77843-3143
| |
Collapse
|
22
|
Maity AK, Lee SC, Mallick BK, Sarkar TR. Bayesian structural equation modeling in multiple omics data with application to circadian genes. Bioinformatics 2020; 36:3951-3958. [PMID: 32369552 DOI: 10.1093/bioinformatics/btaa286] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Revised: 03/30/2020] [Accepted: 04/27/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION It is well known that the integration among different data-sources is reliable because of its potential of unveiling new functionalities of the genomic expressions, which might be dormant in a single-source analysis. Moreover, different studies have justified the more powerful analyses of multi-platform data. Toward this, in this study, we consider the circadian genes' omics profile, such as copy number changes and RNA-sequence data along with their survival response. We develop a Bayesian structural equation modeling coupled with linear regressions and log normal accelerated failure-time regression to integrate the information between these two platforms to predict the survival of the subjects. We place conjugate priors on the regression parameters and derive the Gibbs sampler using the conditional distributions of them. RESULTS Our extensive simulation study shows that the integrative model provides a better fit to the data than its closest competitor. The analyses of glioblastoma cancer data and the breast cancer data from TCGA, the largest genomics and transcriptomics database, support our findings. AVAILABILITY AND IMPLEMENTATION The developed method is wrapped in R package available at https://github.com/MAITYA02/semmcmc. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Arnab Kumar Maity
- Early Clinical Development Oncology Statistics, Pfizer Inc., San Diego, CA 92121, USA
| | | | | | - Tapasree Roy Sarkar
- Department of Statistics.,Department of Biology, Texas A&M University, College Station, TX 77843, USA
| |
Collapse
|
23
|
Henderson NC, Louis TA, Rosner GL, Varadhan R. Individualized treatment effects with censored data via fully nonparametric Bayesian accelerated failure time models. Biostatistics 2020; 21:50-68. [PMID: 30052809 PMCID: PMC8972560 DOI: 10.1093/biostatistics/kxy028] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2017] [Revised: 05/24/2018] [Accepted: 06/14/2018] [Indexed: 09/04/2023] Open
Abstract
Individuals often respond differently to identical treatments, and characterizing such variability in treatment response is an important aim in the practice of personalized medicine. In this article, we describe a nonparametric accelerated failure time model that can be used to analyze heterogeneous treatment effects (HTE) when patient outcomes are time-to-event. By utilizing Bayesian additive regression trees and a mean-constrained Dirichlet process mixture model, our approach offers a flexible model for the regression function while placing few restrictions on the baseline hazard. Our nonparametric method leads to natural estimates of individual treatment effect and has the flexibility to address many major goals of HTE assessment. Moreover, our method requires little user input in terms of model specification for treatment covariate interactions or for tuning parameter selection. Our procedure shows strong predictive performance while also exhibiting good frequentist properties in terms of parameter coverage and mitigation of spurious findings of HTE. We illustrate the merits of our proposed approach with a detailed analysis of two large clinical trials (N = 6769) for the prevention and treatment of congestive heart failure using an angiotensin-converting enzyme inhibitor. The analysis revealed considerable evidence for the presence of HTE in both trials as demonstrated by substantial estimated variation in treatment effect and by high proportions of patients exhibiting strong evidence of having treatment effects which differ from the overall treatment effect.
Collapse
Affiliation(s)
- Nicholas C Henderson
- Oncology Biostatistics and Bioinformatics, Sidney Kimmel
Comprehensive Cancer Center at Johns Hopkins, 550 N. Broadway, Suite 1101,
Baltimore, MD, USA
| | - Thomas A Louis
- Department of Biostatistics, Johns Hopkins University, 615 North
Wolfe Street, Baltimore, MD, USA
| | - Gary L Rosner
- Department of Biostatistics, Johns Hopkins University, 615 North
Wolfe Street, Baltimore, MD, USA
- Oncology Biostatistics and Bioinformatics, Sidney Kimmel Comprehensive Cancer
Center at Johns Hopkins, 550 N. Broadway, Suite 1103, Baltimore, MD,
USA
| | - Ravi Varadhan
- Department of Biostatistics, Johns Hopkins University, 615 North
Wolfe Street, Baltimore, MD, USA
- Oncology Biostatistics and Bioinformatics, Sidney Kimmel Comprehensive Cancer
Center at Johns Hopkins, 550 N. Broadway, Suite 1103, Baltimore, MD,
USA
| |
Collapse
|
24
|
Tan YV, Roy J. Bayesian additive regression trees and the General BART model. Stat Med 2019; 38:5048-5069. [PMID: 31460678 PMCID: PMC6800811 DOI: 10.1002/sim.8347] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2019] [Revised: 07/05/2019] [Accepted: 07/23/2019] [Indexed: 11/06/2022]
Abstract
Bayesian additive regression trees (BART) is a flexible prediction model/machine learning approach that has gained widespread popularity in recent years. As BART becomes more mainstream, there is an increased need for a paper that walks readers through the details of BART, from what it is to why it works. This tutorial is aimed at providing such a resource. In addition to explaining the different components of BART using simple examples, we also discuss a framework, the General BART model that unifies some of the recent BART extensions, including semiparametric models, correlated outcomes, and statistical matching problems in surveys, and models with weaker distributional assumptions. By showing how these models fit into a single framework, we hope to demonstrate a simple way of applying BART to research problems that go beyond the original independent continuous or binary outcomes framework.
Collapse
Affiliation(s)
- Yaoyuan Vincent Tan
- Department of Biostatistics and Epidemiology, Rutgers School of Public Health, 683 Hoes Lane West, Piscataway, New Jersey 08854, USA
| | - Jason Roy
- Department of Biostatistics and Epidemiology, Rutgers School of Public Health, 683 Hoes Lane West, Piscataway, New Jersey 08854, USA
| |
Collapse
|
25
|
Maity AK, Bhattacharya A, Mallick BK, Baladandayuthapani V. Bayesian data integration and variable selection for pan-cancer survival prediction using protein expression data. Biometrics 2019; 76:316-325. [PMID: 31393003 DOI: 10.1111/biom.13132] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2018] [Accepted: 07/19/2019] [Indexed: 12/20/2022]
Abstract
Accurate prognostic prediction using molecular information is a challenging area of research, which is essential to develop precision medicine. In this paper, we develop translational models to identify major actionable proteins that are associated with clinical outcomes, like the survival time of patients. There are considerable statistical and computational challenges due to the large dimension of the problems. Furthermore, data are available for different tumor types; hence data integration for various tumors is desirable. Having censored survival outcomes escalates one more level of complexity in the inferential procedure. We develop Bayesian hierarchical survival models, which accommodate all the challenges mentioned here. We use the hierarchical Bayesian accelerated failure time model for survival regression. Furthermore, we assume sparse horseshoe prior distribution for the regression coefficients to identify the major proteomic drivers. We borrow strength across tumor groups by introducing a correlation structure among the prior distributions. The proposed methods have been used to analyze data from the recently curated "The Cancer Proteome Atlas" (TCPA), which contains reverse-phase protein arrays-based high-quality protein expression data as well as detailed clinical annotation, including survival times. Our simulation and the TCPA data analysis illustrate the efficacy of the proposed integrative model, which links different tumors with the correlated prior structures.
Collapse
Affiliation(s)
- Arnab Kumar Maity
- Early Clinical Development Oncology Statistics, Pfizer Inc., San Diego, California
| | | | - Bani K Mallick
- Department of Statistics, Texas A&M University, College Station, Texas
| | | |
Collapse
|
26
|
Nethery RC, Mealli F, Dominici F. ESTIMATING POPULATION AVERAGE CAUSAL EFFECTS IN THE PRESENCE OF NON-OVERLAP: THE EFFECT OF NATURAL GAS COMPRESSOR STATION EXPOSURE ON CANCER MORTALITY. Ann Appl Stat 2019; 13:1242-1267. [PMID: 31346355 PMCID: PMC6658123 DOI: 10.1214/18-aoas1231] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
Most causal inference studies rely on the assumption of overlap to estimate population or sample average causal effects. When data suffer from non-overlap, estimation of these estimands requires reliance on model specifications, due to poor data support. All existing methods to address non-overlap, such as trimming or down-weighting data in regions of poor data support, change the estimand so that inference cannot be made on the sample or the underlying population. In environmental health research settings, where study results are often intended to influence policy, population-level inference may be critical, and changes in the estimand can diminish the impact of the study results, because estimates may not be representative of effects in the population of interest to policymakers. Researchers may be willing to make additional, minimal modeling assumptions in order to preserve the ability to estimate population average causal effects. We seek to make two contributions on this topic. First, we propose a flexible, data-driven definition of propensity score overlap and non-overlap regions. Second, we develop a novel Bayesian framework to estimate population average causal effects with minor model dependence and appropriately large uncertainties in the presence of non-overlap and causal effect heterogeneity. In this approach, the tasks of estimating causal effects in the overlap and non-overlap regions are delegated to two distinct models, suited to the degree of data support in each region. Tree ensembles are used to non-parametrically estimate individual causal effects in the overlap region, where the data can speak for themselves. In the non-overlap region, where insufficient data support means reliance on model specification is necessary, individual causal effects are estimated by extrapolating trends from the overlap region via a spline model. The promising performance of our method is demonstrated in simulations. Finally, we utilize our method to perform a novel investigation of the causal effect of natural gas compressor station exposure on cancer outcomes. Code and data to implement the method and reproduce all simulations and analyses is available on Github (https://github.com/rachelnethery/overlap).
Collapse
Affiliation(s)
- Rachel C Nethery
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
| | - Fabrizia Mealli
- Department of Statistics, Informatics, Applications, University of Florence, Florence, Italy
| | - Francesca Dominici
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
| |
Collapse
|
27
|
Hsu JBK, Chang TH, Lee GA, Lee TY, Chen CY. Identification of potential biomarkers related to glioma survival by gene expression profile analysis. BMC Med Genomics 2019; 11:34. [PMID: 30894197 PMCID: PMC7402580 DOI: 10.1186/s12920-019-0479-6] [Citation(s) in RCA: 40] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2018] [Accepted: 02/06/2019] [Indexed: 01/11/2023] Open
Abstract
BACKGROUND Recent studies have proposed several gene signatures as biomarkers for different grades of gliomas from various perspectives. However, most of these genes can only be used appropriately for patients with specific grades of gliomas. METHODS In this study, we aimed to identify survival-relevant genes shared between glioblastoma multiforme (GBM) and lower-grade glioma (LGG), which could be used as potential biomarkers to classify patients into different risk groups. Cox proportional hazard regression model (Cox model) was used to extract relative genes, and effectiveness of genes was estimated against random forest regression. Finally, risk models were constructed with logistic regression. RESULTS We identified 104 key genes that were shared between GBM and LGG, which could be significantly correlated with patients' survival based on next-generation sequencing data obtained from The Cancer Genome Atlas for gene expression analysis. The effectiveness of these genes in the survival prediction of GBM and LGG was evaluated, and the average receiver operating characteristic curve (ROC) area under the curve values ranged from 0.7 to 0.8. Gene set enrichment analysis revealed that these genes were involved in eight significant pathways and 23 molecular functions. Moreover, the expressions of ten (CTSZ, EFEMP2, ITGA5, KDELR2, MDK, MICALL2, MAP 2 K3, PLAUR, SERPINE1, and SOCS3) of these genes were significantly higher in GBM than in LGG, and comparing their expression levels to those of the proposed control genes (TBP, IPO8, and SDHA) could have the potential capability to classify patients into high- and low- risk groups, which differ significantly in the overall survival. Signatures of candidate genes were validated, by multiple microarray datasets from Gene Expression Omnibus, to increase the robustness of using these potential prognostic factors. In both the GBM and LGG cohort study, most of the patients in the high-risk group had the IDH1 wild-type gene, and those in the low-risk group had IDH1 mutations. Moreover, most of the high-risk patients with LGG possessed a 1p/19q-noncodeletion. CONCLUSION In this study, we identified survival relevant genes which were shared between GBM and LGG, and those enabled to classify patients into high- and low-risk groups based on expression level analysis. Both the risk groups could be correlated with the well-known genetic variants, thus suggesting their potential prognostic value in clinical application.
Collapse
Affiliation(s)
- Justin Bo-Kai Hsu
- Department of Medical Research, Taipei Medical University Hospital, Taipei, 110, Taiwan
| | - Tzu-Hao Chang
- Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei, 110, Taiwan
| | - Gilbert Aaron Lee
- Department of Medical Research, Taipei Medical University Hospital, Taipei, 110, Taiwan
| | - Tzong-Yi Lee
- Warshel Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen, 518172, China.,School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, 518172, China.,School of Life and Health Science, The Chinese University of Hong Kong, Shenzhen, 518172, China
| | - Cheng-Yu Chen
- Research Center of Translational Imaging, College of Medicine, Taipei Medical University, Taipei, 110, Taiwan. .,Department of Radiology, School of Medicine, College of Medicine, Taipei Medical University, Taipei, 110, Taiwan. .,Department of Medical Imaging and Imaging Research Center, Taipei Medical University Hospital, Taipei Medical University, Taipei, 110, Taiwan. .,Department of Radiology, Tri-Service General Hospital, Taipei, 114, Taiwan. .,Department of Radiology, National Defense Medical Center, Taipei, 114, Taiwan.
| |
Collapse
|
28
|
Risser MD, Calder CA, Berrocal VJ, Berrett C. NONSTATIONARY SPATIAL PREDICTION OF SOIL ORGANIC CARBON: IMPLICATIONS FOR STOCK ASSESSMENT DECISION MAKING. Ann Appl Stat 2019; 13:165-188. [PMID: 39220174 PMCID: PMC11364347 DOI: 10.1214/18-aoas1204] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/04/2024]
Abstract
The Rapid Carbon Assessment (RaCA) project was conducted by the US Department of Agriculture's National Resources Conservation Service between 2010-2012 in order to provide contemporaneous measurements of soil organic carbon (SOC) across the US. Despite the broad extent of the RaCA data collection effort, direct observations of SOC are not available at the high spatial resolution needed for studying carbon storage in soil and its implications for important problems in climate science and agriculture. As a result, there is a need for predicting SOC at spatial locations not included as part of the RaCA project. In this paper, we compare spatial prediction of SOC using a subset of the RaCA data for a variety of statistical methods. We investigate the performance of methods with off-the-shelf software available (both stationary and nonstationary) as well as a novel nonstationary approach based on partitioning relevant spatially-varying covariate processes. Our new method addresses open questions regarding (1) how to partition the spatial domain for segmentation-based nonstationary methods, (2) incorporating partially observed covariates into a spatial model, and (3) accounting for uncertainty in the partitioning. In applying the various statistical methods we find that there are minimal differences in out-of-sample criteria for this particular data set, however, there are major differences in maps of uncertainty in SOC predictions. We argue that the spatially-varying measures of prediction uncertainty produced by our new approach are valuable to decision makers, as they can be used to better benchmark mechanistic models, identify target areas for soil restoration projects, and inform carbon sequestration projects.
Collapse
|
29
|
Bellot A, van der Schaar M. A Hierarchical Bayesian Model for Personalized Survival Predictions. IEEE J Biomed Health Inform 2019; 23:72-80. [DOI: 10.1109/jbhi.2018.2832599] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
|
30
|
Tang Z, Shen Y, Zhang X, Yi N. The spike-and-slab lasso Cox model for survival prediction and associated genes detection. Bioinformatics 2018; 33:2799-2807. [PMID: 28472220 DOI: 10.1093/bioinformatics/btx300] [Citation(s) in RCA: 45] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2017] [Accepted: 05/05/2017] [Indexed: 12/20/2022] Open
Abstract
Motivation Large-scale molecular profiling data have offered extraordinary opportunities to improve survival prediction of cancers and other diseases and to detect disease associated genes. However, there are considerable challenges in analyzing large-scale molecular data. Results We propose new Bayesian hierarchical Cox proportional hazards models, called the spike-and-slab lasso Cox, for predicting survival outcomes and detecting associated genes. We also develop an efficient algorithm to fit the proposed models by incorporating Expectation-Maximization steps into the extremely fast cyclic coordinate descent algorithm. The performance of the proposed method is assessed via extensive simulations and compared with the lasso Cox regression. We demonstrate the proposed procedure on two cancer datasets with censored survival outcomes and thousands of molecular features. Our analyses suggest that the proposed procedure can generate powerful prognostic models for predicting cancer survival and can detect associated genes. Availability and implementation The methods have been implemented in a freely available R package BhGLM ( http://www.ssg.uab.edu/bhglm/ ). Contact nyi@uab.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zaixiang Tang
- Department of Biostatistics, School of Public Health.,Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, and Center for Genetic Epidemiology and Genomics, Medical College of Soochow University, Suzhou 215123, China.,Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL 35294, USA
| | - Yueping Shen
- Department of Biostatistics, School of Public Health.,Jiangsu Key Laboratory of Preventive and Translational Medicine for Geriatric Diseases, and Center for Genetic Epidemiology and Genomics, Medical College of Soochow University, Suzhou 215123, China
| | - Xinyan Zhang
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL 35294, USA
| | - Nengjun Yi
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL 35294, USA
| |
Collapse
|
31
|
Cui Y, Li B, Li R. Decentralized Learning Framework of Meta-Survival Analysis for Developing Robust Prognostic Signatures. JCO Clin Cancer Inform 2017; 1:1-13. [PMID: 30657395 PMCID: PMC6873986 DOI: 10.1200/cci.17.00077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
PURPOSE A significant hurdle in developing reliable gene expression-based prognostic models has been the limited sample size, which can cause overfitting and false discovery. Combining data from multiple studies can enhance statistical power and reduce spurious findings, but how to address the biologic heterogeneity across different datasets remains a major challenge. Better meta-survival analysis approaches are needed. MATERIAL AND METHODS We presented a decentralized learning framework for meta-survival analysis without the need for data aggregation. Our method consisted of a series of proposals that together alleviated the influence of data heterogeneity and improved the performance of survival prediction. First, we transformed the gene expression profile of every sample into normalized percentile ranks to obtain platform-agnostic features. Second, we used Stouffer's meta-z approach in combination with Harrell's concordance index to prioritize and select genes to be included in the model. Third, we used survival discordance as a scale-independent model loss function. Instead of generating a merged dataset and training the model therein, we avoided comparing patients across datasets and individually evaluated the loss function on each dataset. Finally, we optimized the model by minimizing the joint loss function. RESULTS Through comprehensive evaluation on 31 public microarray datasets containing 6,724 samples of several cancer types, we demonstrated that the proposed method has outperformed (1) single prognostic genes identified using conventional meta-analysis, (2) multigene signatures trained on single datasets, (3) multigene signatures trained on merged datasets as well as by other existing meta-analysis methods, and (4) clinically applicable, established multigene signatures. CONCLUSION The decentralized learning approach can be used to effectively perform meta-analysis of gene expression data and to develop robust multigene prognostic signatures.
Collapse
Affiliation(s)
- Yi Cui
- Yi Cui, Bailiang Li, and Ruijiang Li, Stanford University School of Medicine, Stanford, CA; Yi Cui, Global Institution for Collaborative Research and Education, Hokkaido University, Sapporo, Japan
| | - Bailiang Li
- Yi Cui, Bailiang Li, and Ruijiang Li, Stanford University School of Medicine, Stanford, CA; Yi Cui, Global Institution for Collaborative Research and Education, Hokkaido University, Sapporo, Japan
| | - Ruijiang Li
- Yi Cui, Bailiang Li, and Ruijiang Li, Stanford University School of Medicine, Stanford, CA; Yi Cui, Global Institution for Collaborative Research and Education, Hokkaido University, Sapporo, Japan
| |
Collapse
|
32
|
Morris JS, Baladandayuthapani V. Statistical Contributions to Bioinformatics: Design, Modeling, Structure Learning, and Integration. STAT MODEL 2017; 17:245-289. [PMID: 29129969 PMCID: PMC5679480 DOI: 10.1177/1471082x17698255] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
The advent of high-throughput multi-platform genomics technologies providing whole-genome molecular summaries of biological samples has revolutionalized biomedical research. These technologiees yield highly structured big data, whose analysis poses significant quantitative challenges. The field of Bioinformatics has emerged to deal with these challenges, and is comprised of many quantitative and biological scientists working together to effectively process these data and extract the treasure trove of information they contain. Statisticians, with their deep understanding of variability and uncertainty quantification, play a key role in these efforts. In this article, we attempt to summarize some of the key contributions of statisticians to bioinformatics, focusing on four areas: (1) experimental design and reproducibility, (2) preprocessing and feature extraction, (3) unified modeling, and (4) structure learning and integration. In each of these areas, we highlight some key contributions and try to elucidate the key statistical principles underlying these methods and approaches. Our goals are to demonstrate major ways in which statisticians have contributed to bioinformatics, encourage statisticians to get involved early in methods development as new technologies emerge, and to stimulate future methodological work based on the statistical principles elucidated in this article and utilizing all availble information to uncover new biological insights.
Collapse
Affiliation(s)
- Jeffrey S Morris
- Department of Biostatistics, The University of Texas M.D. Anderson Cancer Center, Houston, Texas, USA
| | | |
Collapse
|
33
|
Kindo BP, Wang H, Peña EA. Multinomial probit Bayesian additive regression trees. Stat (Int Stat Inst) 2016; 5:119-131. [PMID: 27330743 DOI: 10.1002/sta4.110] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
This article proposes multinomial probit Bayesian additive regression trees (MPBART) as a multinomial probit extension of BART - Bayesian additive regression trees. MPBART is flexible to allow inclusion of predictors that describe the observed units as well as the available choice alternatives. Through two simulation studies and four real data examples, we show that MPBART exhibits very good predictive performance in comparison to other discrete choice and multiclass classification methods. To implement MPBART, the R package mpbart is freely available from CRAN repositories.
Collapse
Affiliation(s)
- Bereket P Kindo
- Department of Statistics, University of South Carolina, Columbia, South Carolina, 29208, USA
| | - Hao Wang
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan, 48824, USA
| | - Edsel A Peña
- Department of Statistics, University of South Carolina, Columbia, South Carolina, 29208, USA
| |
Collapse
|
34
|
Sparapani RA, Logan BR, McCulloch RE, Laud PW. Nonparametric survival analysis using Bayesian Additive Regression Trees (BART). Stat Med 2016; 35:2741-53. [PMID: 26854022 DOI: 10.1002/sim.6893] [Citation(s) in RCA: 61] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2015] [Revised: 01/11/2016] [Accepted: 01/12/2016] [Indexed: 11/06/2022]
Abstract
Bayesian additive regression trees (BART) provide a framework for flexible nonparametric modeling of relationships of covariates to outcomes. Recently, BART models have been shown to provide excellent predictive performance, for both continuous and binary outcomes, and exceeding that of its competitors. Software is also readily available for such outcomes. In this article, we introduce modeling that extends the usefulness of BART in medical applications by addressing needs arising in survival analysis. Simulation studies of one-sample and two-sample scenarios, in comparison with long-standing traditional methods, establish face validity of the new approach. We then demonstrate the model's ability to accommodate data from complex regression models with a simulation study of a nonproportional hazards scenario with crossing survival functions and survival function estimation in a scenario where hazards are multiplicatively modified by a highly nonlinear function of the covariates. Using data from a recently published study of patients undergoing hematopoietic stem cell transplantation, we illustrate the use and some advantages of the proposed method in medical investigations. Copyright © 2016 John Wiley & Sons, Ltd.
Collapse
Affiliation(s)
- Rodney A Sparapani
- Division of Biostatistics, Medical College of Wisconsin, Milwaukee, U.S.A
| | - Brent R Logan
- Division of Biostatistics, Medical College of Wisconsin, Milwaukee, U.S.A
| | | | - Purushottam W Laud
- Division of Biostatistics, Medical College of Wisconsin, Milwaukee, U.S.A
| |
Collapse
|
35
|
|
36
|
Zou M, Liu Z, Zhang XS, Wang Y. NCC-AUC: an AUC optimization method to identify multi-biomarker panel for cancer prognosis from genomic and clinical data. Bioinformatics 2015; 31:3330-8. [PMID: 26092859 DOI: 10.1093/bioinformatics/btv374] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2015] [Accepted: 06/14/2015] [Indexed: 12/26/2022] Open
Abstract
MOTIVATION In prognosis and survival studies, an important goal is to identify multi-biomarker panels with predictive power using molecular characteristics or clinical observations. Such analysis is often challenged by censored, small-sample-size, but high-dimensional genomic profiles or clinical data. Therefore, sophisticated models and algorithms are in pressing need. RESULTS In this study, we propose a novel Area Under Curve (AUC) optimization method for multi-biomarker panel identification named Nearest Centroid Classifier for AUC optimization (NCC-AUC). Our method is motived by the connection between AUC score for classification accuracy evaluation and Harrell's concordance index in survival analysis. This connection allows us to convert the survival time regression problem to a binary classification problem. Then an optimization model is formulated to directly maximize AUC and meanwhile minimize the number of selected features to construct a predictor in the nearest centroid classifier framework. NCC-AUC shows its great performance by validating both in genomic data of breast cancer and clinical data of stage IB Non-Small-Cell Lung Cancer (NSCLC). For the genomic data, NCC-AUC outperforms Support Vector Machine (SVM) and Support Vector Machine-based Recursive Feature Elimination (SVM-RFE) in classification accuracy. It tends to select a multi-biomarker panel with low average redundancy and enriched biological meanings. Also NCC-AUC is more significant in separation of low and high risk cohorts than widely used Cox model (Cox proportional-hazards regression model) and L1-Cox model (L1 penalized in Cox model). These performance gains of NCC-AUC are quite robust across 5 subtypes of breast cancer. Further in an independent clinical data, NCC-AUC outperforms SVM and SVM-RFE in predictive accuracy and is consistently better than Cox model and L1-Cox model in grouping patients into high and low risk categories. CONCLUSION In summary, NCC-AUC provides a rigorous optimization framework to systematically reveal multi-biomarker panel from genomic and clinical data. It can serve as a useful tool to identify prognostic biomarkers for survival analysis. AVAILABILITY AND IMPLEMENTATION NCC-AUC is available at http://doc.aporc.org/wiki/NCC-AUC. CONTACT ywang@amss.ac.cn SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Meng Zou
- Academy of Mathematics and Systems Science, National Center for Mathematics and Interdisciplinary Sciences, Chinese Academy of Sciences, Beijing 10080, China
| | - Zhaoqi Liu
- Academy of Mathematics and Systems Science, National Center for Mathematics and Interdisciplinary Sciences, Chinese Academy of Sciences, Beijing 10080, China
| | - Xiang-Sun Zhang
- Academy of Mathematics and Systems Science, National Center for Mathematics and Interdisciplinary Sciences, Chinese Academy of Sciences, Beijing 10080, China
| | - Yong Wang
- Academy of Mathematics and Systems Science, National Center for Mathematics and Interdisciplinary Sciences, Chinese Academy of Sciences, Beijing 10080, China
| |
Collapse
|
37
|
Zhou L, Xu Q, Wang H. Rotation survival forest for right censored data. PeerJ 2015; 3:e1009. [PMID: 26082863 PMCID: PMC4465950 DOI: 10.7717/peerj.1009] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2015] [Accepted: 05/19/2015] [Indexed: 11/20/2022] Open
Abstract
Recently, survival ensembles have found more and more applications in biological and medical research when censored time-to-event data are often confronted. In this research, we investigate the plausibility of extending a rotation forest, originally proposed for classification purpose, to survival analysis. Supported by the proper statistical analysis, we show that rotation survival forests are able to outperform the state-of-art survival ensembles on right censored data. We also provide a C-index based variable importance measure for evaluating covariates in censored survival data.
Collapse
Affiliation(s)
- Lifeng Zhou
- School of Mathematics and Statistics, Central South University , China
| | - Qingsong Xu
- School of Mathematics and Statistics, Central South University , China
| | - Hong Wang
- School of Mathematics and Statistics, Central South University , China
| |
Collapse
|
38
|
García V, Salvador Sánchez J. Mapping microarray gene expression data into dissimilarity spaces for tumor classification. Inf Sci (N Y) 2015. [DOI: 10.1016/j.ins.2014.09.064] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
39
|
Tiong KL, Chang KC, Yeh KT, Liu TY, Wu JH, Hsieh PH, Lin SH, Lai WY, Hsu YC, Chen JY, Chang JG, Shieh GS. CSNK1E/CTNNB1 are synthetic lethal to TP53 in colorectal cancer and are markers for prognosis. Neoplasia 2014; 16:441-50. [PMID: 24947187 PMCID: PMC4198690 DOI: 10.1016/j.neo.2014.04.007] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2014] [Revised: 04/25/2014] [Accepted: 04/29/2014] [Indexed: 02/03/2023] Open
Abstract
Two genes are called synthetic lethal (SL) if their simultaneous mutations lead to cell death, but each individual mutation does not. Targeting SL partners of mutated cancer genes can kill cancer cells specifically, but leave normal cells intact. We present an integrated approach to uncovering SL pairs in colorectal cancer (CRC). Screening verified SL pairs using microarray gene expression data of cancerous and normal tissues, we first identified potential functionally relevant (simultaneously differentially expressed) gene pairs. From the top-ranked pairs, ~20 genes were chosen for immunohistochemistry (IHC) staining in 171 CRC patients. To find novel SL pairs, all 169 combined pairs from the individual IHC were synergistically correlated to five clinicopathological features, e.g. overall survival. Of the 11 predicted SL pairs, MSH2-POLB and CSNK1E-MYC were consistent with literature, and we validated the top two pairs, CSNK1E-TP53 and CTNNB1-TP53 using RNAi knockdown and small molecule inhibitors of CSNK1E in isogenic HCT-116 and RKO cells. Furthermore, synthetic lethality of CSNK1E and TP53 was verified in mouse model. Importantly, multivariate analysis revealed that CSNK1E-P53, CTNNB1-P53, MSH2-RB1, and BRCA1-WNT5A were independent prognosis markers from stage, with CSNK1E-P53 applicable to early-stage and the remaining three throughout all stages. Our findings suggest that CSNK1E is a promising target for TP53-mutant CRC patients which constitute ~40% to 50% of patients, while to date safety regarding inhibition of TP53 is controversial. Thus the integrated approach is useful in finding novel SL pairs for cancer therapeutics, and it is readily accessible and applicable to other cancers.
Collapse
Affiliation(s)
- Khong-Loon Tiong
- Bioinformatics Program, Taiwan International Graduate Program, Academia Sinica, Taipei 115, Taiwan, R.O.C.; Institute of Biomedical Informatics, National Yang-Ming University, Taipei 112, Taiwan, R.O.C
| | - Kuo-Ching Chang
- Institute of Statistical Science, Academia Sinica, Taipei 115, Taiwan, R.O.C
| | - Kun-Tu Yeh
- Department of Pathology, Changhua Christian Hospital, Changhua 505, Taiwan, R.O.C.; Department of Pathology, School of Medicine, Chung Shan Medical University, Taichung 402, Taiwan, R.O.C
| | - Ting-Yuan Liu
- Graduate Institute of Medicine, College of Medicine, Kaohsiung Medical University, Kaohsiung 807, Taiwan, R.O.C
| | - Jia-Hong Wu
- Institute of Statistical Science, Academia Sinica, Taipei 115, Taiwan, R.O.C
| | - Ping-Heng Hsieh
- Institute of Statistical Science, Academia Sinica, Taipei 115, Taiwan, R.O.C
| | - Shu-Hui Lin
- Department of Pathology, Changhua Christian Hospital, Changhua 505, Taiwan, R.O.C.; Jen-Teh Junior College of Medicine, Nursing Management, Miaoli 356, Taiwan, R.O.C
| | - Wei-Yun Lai
- Molecular Medicine Program, Taiwan International Graduate Program, Institute of Biomedical Sciences, Academia Sinica, Taipei 115, Taiwan, R.O.C.; Institute of Biochemistry and Molecular Biology, School of Life Sciences, National Yang-Ming University, Taipei 112, Taiwan, R.O.C
| | - Yu-Chin Hsu
- Institute of Statistical Science, Academia Sinica, Taipei 115, Taiwan, R.O.C
| | - Jeou-Yuan Chen
- Institute of Biomedical Sciences, Academia Sinica, Taipei 115, Taiwan, R.O.C
| | - Jan-Gowth Chang
- Department of Laboratory Medicine, and Center of RNA Biology and Clinical Application, China Medical University Hospital, China Medical University, Taichung 404, Taiwan, R.O.C..
| | - Grace S Shieh
- Bioinformatics Program, Taiwan International Graduate Program, Academia Sinica, Taipei 115, Taiwan, R.O.C.; Institute of Statistical Science, Academia Sinica, Taipei 115, Taiwan, R.O.C..
| |
Collapse
|
40
|
Zhang L, Baladandayuthapani V, Mallick BK, Manyam GC, Thompson PA, Bondy ML, Do KA. Bayesian hierarchical structured variable selection methods with application to MIP studies in breast cancer. J R Stat Soc Ser C Appl Stat 2014; 63:595-620. [PMID: 25705056 DOI: 10.1111/rssc.12053] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
The analysis of alterations that may occur in nature when segments of chromosomes are copied (known as copy number alterations) has been a focus of research to identify genetic markers of cancer. One high-throughput technique recently adopted is the use of molecular inversion probes (MIPs) to measure probe copy number changes. The resulting data consist of high-dimensional copy number profiles that can be used to ascertain probe-specific copy number alterations in correlative studies with patient outcomes to guide risk stratification and future treatment. We propose a novel Bayesian variable selection method, the hierarchical structured variable selection (HSVS) method, which accounts for the natural gene and probe-within-gene architecture to identify important genes and probes associated with clinically relevant outcomes. We propose the HSVS model for grouped variable selection, where simultaneous selection of both groups and within-group variables is of interest. The HSVS model utilizes a discrete mixture prior distribution for group selection and group-specific Bayesian lasso hierarchies for variable selection within groups. We provide methods for accounting for serial correlations within groups that incorporate Bayesian fused lasso methods for within-group selection. Through simulations we establish that our method results in lower model errors than other methods when a natural grouping structure exists. We apply our method to an MIP study of breast cancer and show that it identifies genes and probes that are significantly associated with clinically relevant subtypes of breast cancer.
Collapse
Affiliation(s)
- Lin Zhang
- Department of Statistics, Texas A&M University, College Station, Texas, U.S.A
| | | | - Bani K Mallick
- Department of Statistics, Texas A&M University, College Station, Texas, U.S.A
| | - Ganiraju C Manyam
- Department of Bioinformatics and Computational Biology, The University of Texas M.D. Anderson Cancer Center, Houston, Texas, U.S.A
| | | | | | - Kim-Anh Do
- Department of Biostatistics, The University of Texas M.D. Anderson Cancer Center, Houston, Texas, U.S.A
| |
Collapse
|
41
|
Park C, Ahn J, Kim H, Park S. Integrative gene network construction to analyze cancer recurrence using semi-supervised learning. PLoS One 2014; 9:e86309. [PMID: 24497942 PMCID: PMC3908883 DOI: 10.1371/journal.pone.0086309] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2013] [Accepted: 12/09/2013] [Indexed: 12/17/2022] Open
Abstract
Background The prognosis of cancer recurrence is an important research area in bioinformatics and is challenging due to the small sample sizes compared to the vast number of genes. There have been several attempts to predict cancer recurrence. Most studies employed a supervised approach, which uses only a few labeled samples. Semi-supervised learning can be a great alternative to solve this problem. There have been few attempts based on manifold assumptions to reveal the detailed roles of identified cancer genes in recurrence. Results In order to predict cancer recurrence, we proposed a novel semi-supervised learning algorithm based on a graph regularization approach. We transformed the gene expression data into a graph structure for semi-supervised learning and integrated protein interaction data with the gene expression data to select functionally-related gene pairs. Then, we predicted the recurrence of cancer by applying a regularization approach to the constructed graph containing both labeled and unlabeled nodes. Conclusions The average improvement rate of accuracy for three different cancer datasets was 24.9% compared to existing supervised and semi-supervised methods. We performed functional enrichment on the gene networks used for learning. We identified that those gene networks are significantly associated with cancer-recurrence-related biological functions. Our algorithm was developed with standard C++ and is available in Linux and MS Windows formats in the STL library. The executable program is freely available at: http://embio.yonsei.ac.kr/~Park/ssl.php.
Collapse
Affiliation(s)
- Chihyun Park
- Department of Computer Science, Yonsei University, Seoul, South Korea
| | - Jaegyoon Ahn
- Department of Computer Science, Yonsei University, Seoul, South Korea
| | - Hyunjin Kim
- Department of Computer Science, Yonsei University, Seoul, South Korea
| | - Sanghyun Park
- Department of Computer Science, Yonsei University, Seoul, South Korea
- * E-mail:
| |
Collapse
|
42
|
Thamrin SA, McGree JM, Mengersen KL. Modelling survival data to account for model uncertainty: a single model or model averaging? SPRINGERPLUS 2013; 2:665. [PMID: 24386617 PMCID: PMC3877415 DOI: 10.1186/2193-1801-2-665] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/02/2013] [Accepted: 11/18/2013] [Indexed: 11/10/2022]
Abstract
ABSTRACT This study considered the problem of predicting survival, based on three alternative models: a single Weibull, a mixture of Weibulls and a cure model. Instead of the common procedure of choosing a single "best" model, where "best" is defined in terms of goodness of fit to the data, a Bayesian model averaging (BMA) approach was adopted to account for model uncertainty. This was illustrated using a case study in which the aim was the description of lymphoma cancer survival with covariates given by phenotypes and gene expression. The results of this study indicate that if the sample size is sufficiently large, one of the three models emerge as having highest probability given the data, as indicated by the goodness of fit measure; the Bayesian information criterion (BIC). However, when the sample size was reduced, no single model was revealed as "best", suggesting that a BMA approach would be appropriate. Although a BMA approach can compromise on goodness of fit to the data (when compared to the true model), it can provide robust predictions and facilitate more detailed investigation of the relationships between gene expression and patient survival.
Collapse
Affiliation(s)
- Sri Astuti Thamrin
- Mathematics Department, Hasanuddin University, Jl. Perintis Kemerdekaan Km 10, 90245 Makassar, South Sulawesi Indonesia ; Mathematics Department, Hasanuddin University, Jl. Perintis Kemerdekaan Km 10, 90245 Makassar, South Sulawesi Indonesia
| | - James M McGree
- Mathematics Department, Hasanuddin University, Jl. Perintis Kemerdekaan Km 10, 90245 Makassar, South Sulawesi Indonesia
| | - Kerrie L Mengersen
- Mathematics Department, Hasanuddin University, Jl. Perintis Kemerdekaan Km 10, 90245 Makassar, South Sulawesi Indonesia
| |
Collapse
|
43
|
Abstract
The modeling of gene networks from transcriptional expression data is an important tool in biomedical research to reveal signaling pathways and to identify treatment targets. Current gene network modeling is primarily based on the use of Gaussian graphical models applied to continuous data, which give a closed-form marginal likelihood. In this paper, we extend network modeling to discrete data, specifically data from serial analysis of gene expression, and RNA-sequencing experiments, both of which generate counts of mRNA transcripts in cell samples. We propose a generalized linear model to fit the discrete gene expression data and assume that the log ratios of the mean expression levels follow a Gaussian distribution. We restrict the gene network structures to decomposable graphs and derive the graphs by selecting the covariance matrix of the Gaussian distribution with the hyper-inverse Wishart priors. Furthermore, we incorporate prior network models based on gene ontology information, which avails existing biological information on the genes of interest. We conduct simulation studies to examine the performance of our discrete graphical model and apply the method to two real datasets for gene network inference.
Collapse
Affiliation(s)
- Lin Zhang
- Texas A&M University, College Station, TX 77843, USA
| | | |
Collapse
|
44
|
Lai Y, Hayashida M, Akutsu T. Survival analysis by penalized regression and matrix factorization. ScientificWorldJournal 2013; 2013:632030. [PMID: 23737722 PMCID: PMC3655687 DOI: 10.1155/2013/632030] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2013] [Accepted: 04/03/2013] [Indexed: 11/18/2022] Open
Abstract
Because every disease has its unique survival pattern, it is necessary to find a suitable model to simulate followups. DNA microarray is a useful technique to detect thousands of gene expressions at one time and is usually employed to classify different types of cancer. We propose combination methods of penalized regression models and nonnegative matrix factorization (NMF) for predicting survival. We tried L1- (lasso), L2- (ridge), and L1-L2 combined (elastic net) penalized regression for diffuse large B-cell lymphoma (DLBCL) patients' microarray data and found that L1-L2 combined method predicts survival best with the smallest logrank P value. Furthermore, 80% of selected genes have been reported to correlate with carcinogenesis or lymphoma. Through NMF we found that DLBCL patients can be divided into 4 groups clearly, and it implies that DLBCL may have 4 subtypes which have a little different survival patterns. Next we excluded some patients who were indicated hard to classify in NMF and executed three penalized regression models again. We found that the performance of survival prediction has been improved with lower logrank P values. Therefore, we conclude that after preselection of patients by NMF, penalized regression models can predict DLBCL patients' survival successfully.
Collapse
Affiliation(s)
- Yeuntyng Lai
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan
| | - Morihiro Hayashida
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan
| |
Collapse
|
45
|
Wang W, Baladandayuthapani V, Morris JS, Broom BM, Manyam G, Do KA. iBAG: integrative Bayesian analysis of high-dimensional multiplatform genomics data. ACTA ACUST UNITED AC 2012; 29:149-59. [PMID: 23142963 PMCID: PMC3546799 DOI: 10.1093/bioinformatics/bts655] [Citation(s) in RCA: 102] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
Motivation: Analyzing data from multi-platform genomics experiments combined
with patients’ clinical outcomes helps us understand the complex biological
processes that characterize a disease, as well as how these processes relate to the
development of the disease. Current data integration approaches are limited in that they
do not consider the fundamental biological relationships that exist among the data
obtained from different platforms. Statistical Model: We propose an integrative Bayesian analysis of genomics
data (iBAG) framework for identifying important genes/biomarkers that are associated with
clinical outcome. This framework uses hierarchical modeling to combine the data obtained
from multiple platforms into one model. Results: We assess the performance of our methods using several synthetic
and real examples. Simulations show our integrative methods to have higher power to detect
disease-related genes than non-integrative methods. Using the Cancer Genome Atlas
glioblastoma dataset, we apply the iBAG model to integrate gene expression and methylation
data to study their associations with patient survival. Our proposed method discovers
multiple methylation-regulated genes that are related to patient survival, most of which
have important biological functions in other diseases but have not been previously studied
in glioblastoma. Availability:http://odin.mdacc.tmc.edu/∼vbaladan/. Contact:veera@mdanderson.org Supplementary information:Supplementary data are available at Bioinformatics
online.
Collapse
Affiliation(s)
- Wenting Wang
- Department of Biostatistics, The University of Texas, MD Anderson Cancer Center, Houston, TX 77030, USA
| | | | | | | | | | | |
Collapse
|