1
|
Huang P, Cai M, Lu X, McKennan C, Wang J. Accurate estimation of rare cell type fractions from tissue omics data via hierarchical deconvolution. bioRxiv 2023:2023.03.15.532820. [PMID: 36993280 PMCID: PMC10055056 DOI: 10.1101/2023.03.15.532820] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Bulk transcriptomics in tissue samples reflects the average expression levels across different cell types and is highly influenced by cellular fractions. As such, it is critical to estimate cellular fractions to both deconfound differential expression analyses and infer cell type-specific differential expression. Since experimentally counting cells is infeasible in most tissues and studies, in silico cellular deconvolution methods have been developed as an alternative. However, existing methods are designed for tissues consisting of clearly distinguishable cell types and have difficulties estimating highly correlated or rare cell types. To address this challenge, we propose Hierarchical Deconvolution (HiDecon) that uses single-cell RNA sequencing references and a hierarchical cell type tree, which models the similarities among cell types and cell differentiation relationships, to estimate cellular fractions in bulk data. By coordinating cell fractions across layers of the hierarchical tree, cellular fraction information is passed up and down the tree, which helps correct estimation biases by pooling information across related cell types. The flexible hierarchical tree structure also enables estimating rare cell fractions by splitting the tree to higher resolutions. Through simulations and real data applications with the ground truth of measured cellular fractions, we demonstrate that HiDecon significantly outperforms existing methods and accurately estimates cellular fractions.
Collapse
Affiliation(s)
- Penghui Huang
- Deparment of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Manqi Cai
- Deparment of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Xinghua Lu
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Chris McKennan
- Deparment of Statistics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Jiebiao Wang
- Deparment of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA
| |
Collapse
|
2
|
Calle ML, Pujolassos M, Susin A. coda4microbiome: compositional data analysis for microbiome cross-sectional and longitudinal studies. BMC Bioinformatics 2023; 24:82. [PMID: 36879227 PMCID: PMC9990256 DOI: 10.1186/s12859-023-05205-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Accepted: 02/22/2023] [Indexed: 03/08/2023] Open
Abstract
BACKGROUND One of the main challenges of microbiome analysis is its compositional nature that if ignored can lead to spurious results. Addressing the compositional structure of microbiome data is particularly critical in longitudinal studies where abundances measured at different times can correspond to different sub-compositions. RESULTS We developed coda4microbiome, a new R package for analyzing microbiome data within the Compositional Data Analysis (CoDA) framework in both, cross-sectional and longitudinal studies. The aim of coda4microbiome is prediction, more specifically, the method is designed to identify a model (microbial signature) containing the minimum number of features with the maximum predictive power. The algorithm relies on the analysis of log-ratios between pairs of components and variable selection is addressed through penalized regression on the "all-pairs log-ratio model", the model containing all possible pairwise log-ratios. For longitudinal data, the algorithm infers dynamic microbial signatures by performing penalized regression over the summary of the log-ratio trajectories (the area under these trajectories). In both, cross-sectional and longitudinal studies, the inferred microbial signature is expressed as the (weighted) balance between two groups of taxa, those that contribute positively to the microbial signature and those that contribute negatively. The package provides several graphical representations that facilitate the interpretation of the analysis and the identified microbial signatures. We illustrate the new method with data from a Crohn's disease study (cross-sectional data) and on the developing microbiome of infants (longitudinal data). CONCLUSIONS coda4microbiome is a new algorithm for identification of microbial signatures in both, cross-sectional and longitudinal studies. The algorithm is implemented as an R package that is available at CRAN ( https://cran.r-project.org/web/packages/coda4microbiome/ ) and is accompanied with a vignette with a detailed description of the functions. The website of the project contains several tutorials: https://malucalle.github.io/coda4microbiome/.
Collapse
Affiliation(s)
- M Luz Calle
- Biosciences Department, Faculty of Sciences, Technology and Engineering, University of Vic - Central University of Catalonia, Carrer de La Laura, 13, 08500, Vic, Spain.
| | - Meritxell Pujolassos
- Biosciences Department, Faculty of Sciences, Technology and Engineering, University of Vic - Central University of Catalonia, Carrer de La Laura, 13, 08500, Vic, Spain
| | - Antoni Susin
- Mathematical Department, UPC-Barcelona Tech, Barcelona, Spain
| |
Collapse
|
3
|
Xia X, Zhang Y, Wei Y, Wang MH. Statistical Methods for Disease Risk Prediction with Genotype Data. Methods Mol Biol 2023; 2629:331-347. [PMID: 36929084 DOI: 10.1007/978-1-0716-2986-4_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/18/2023]
Abstract
Single-nucleotide polymorphism (SNP) is the basic unit to understand the heritability of complex traits. One attractive application of the susceptible SNPs is to construct prediction models for assessing disease risk. Here, we introduce prediction methods for human traits using SNPs data, including the polygenic risk score (PRS), linear mixed models (LMMs), penalized regressions, and methods for controlling population stratification.
Collapse
Affiliation(s)
- Xiaoxuan Xia
- JC School of Public Health and Primary Care, the Chinese University of Hong Kong (CUHK), Shatin, Hong Kong
- Department of Statistics, the Chinese University of Hong Kong (CUHK), Shatin, Hong Kong
| | | | - Yingying Wei
- Department of Statistics, the Chinese University of Hong Kong (CUHK), Shatin, Hong Kong
| | - Maggie Haitian Wang
- JC School of Public Health and Primary Care, the Chinese University of Hong Kong (CUHK), Shatin, Hong Kong.
- CUHK Shenzhen Institute, Shenzhen, China.
| |
Collapse
|
4
|
Jardillier R, Koca D, Chatelain F, Guyon L. Prognosis of lasso-like penalized Cox models with tumor profiling improves prediction over clinical data alone and benefits from bi-dimensional pre-screening. BMC Cancer 2022; 22:1045. [PMID: 36199072 PMCID: PMC9533541 DOI: 10.1186/s12885-022-10117-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2022] [Accepted: 09/14/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Prediction of patient survival from tumor molecular '-omics' data is a key step toward personalized medicine. Cox models performed on RNA profiling datasets are popular for clinical outcome predictions. But these models are applied in the context of "high dimension", as the number p of covariates (gene expressions) greatly exceeds the number n of patients and e of events. Thus, pre-screening together with penalization methods are widely used for dimensional reduction. METHODS In the present paper, (i) we benchmark the performance of the lasso penalization and three variants (i.e., ridge, elastic net, adaptive elastic net) on 16 cancers from TCGA after pre-screening, (ii) we propose a bi-dimensional pre-screening procedure based on both gene variability and p-values from single variable Cox models to predict survival, and (iii) we compare our results with iterative sure independence screening (ISIS). RESULTS First, we show that integration of mRNA-seq data with clinical data improves predictions over clinical data alone. Second, our bi-dimensional pre-screening procedure can only improve, in moderation, the C-index and/or the integrated Brier score, while excluding irrelevant genes for prediction. We demonstrate that the different penalization methods reached comparable prediction performances, with slight differences among datasets. Finally, we provide advice in the case of multi-omics data integration. CONCLUSIONS Tumor profiles convey more prognostic information than clinical variables such as stage for many cancer subtypes. Lasso and Ridge penalizations perform similarly than Elastic Net penalizations for Cox models in high-dimension. Pre-screening of the top 200 genes in term of single variable Cox model p-values is a practical way to reduce dimension, which may be particularly useful when integrating multi-omics.
Collapse
Affiliation(s)
- Rémy Jardillier
- IRIG, Biosanté U1292, Univ. Grenoble Alpes, Inserm, CEA, Grenoble, France.,GIPSA-lab, Institute of Engineering University Grenoble Alpes, Univ. Grenoble Alpes, CNRS, Grenoble INP, Grenoble, France
| | - Dzenis Koca
- IRIG, Biosanté U1292, Univ. Grenoble Alpes, Inserm, CEA, Grenoble, France
| | - Florent Chatelain
- GIPSA-lab, Institute of Engineering University Grenoble Alpes, Univ. Grenoble Alpes, CNRS, Grenoble INP, Grenoble, France
| | - Laurent Guyon
- IRIG, Biosanté U1292, Univ. Grenoble Alpes, Inserm, CEA, Grenoble, France.
| |
Collapse
|
5
|
Abstract
The graph fused lasso-which includes as a special case the one-dimensional fused lasso-is widely used to reconstruct signals that are piecewise constant on a graph, meaning that nodes connected by an edge tend to have identical values. We consider testing for a difference in the means of two connected components estimated using the graph fused lasso. A naive procedure such as a z-test for a difference in means will not control the selective Type I error, since the hypothesis that we are testing is itself a function of the data. In this work, we propose a new test for this task that controls the selective Type I error, and conditions on less information than existing approaches, leading to substantially higher power. We illustrate our approach in simulation and on datasets of drug overdose death rates and teenage birth rates in the contiguous United States. Our approach yields more discoveries on both datasets. Supplementary materials for this article are available online.
Collapse
Affiliation(s)
- Yiqun Chen
- Department of Biostatistics, University of Washington, Seattle, WA
| | - Sean Jewell
- Department of Statistics, University of Washington, Seattle, WA
| | - Daniela Witten
- Department of Biostatistics, University of Washington, Seattle, WA
- Department of Statistics, University of Washington, Seattle, WA
| |
Collapse
|
6
|
Kammer M, Dunkler D, Michiels S, Heinze G. Evaluating methods for Lasso selective inference in biomedical research: a comparative simulation study. BMC Med Res Methodol 2022; 22:206. [PMID: 35883041 PMCID: PMC9316707 DOI: 10.1186/s12874-022-01681-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2022] [Accepted: 07/11/2022] [Indexed: 12/03/2022] Open
Abstract
Background Variable selection for regression models plays a key role in the analysis of biomedical data. However, inference after selection is not covered by classical statistical frequentist theory, which assumes a fixed set of covariates in the model. This leads to over-optimistic selection and replicability issues. Methods We compared proposals for selective inference targeting the submodel parameters of the Lasso and its extension, the adaptive Lasso: sample splitting, selective inference conditional on the Lasso selection (SI), and universally valid post-selection inference (PoSI). We studied the properties of the proposed selective confidence intervals available via R software packages using a neutral simulation study inspired by real data commonly seen in biomedical studies. Furthermore, we present an exemplary application of these methods to a publicly available dataset to discuss their practical usability. Results Frequentist properties of selective confidence intervals by the SI method were generally acceptable, but the claimed selective coverage levels were not attained in all scenarios, in particular with the adaptive Lasso. The actual coverage of the extremely conservative PoSI method exceeded the nominal levels, and this method also required the greatest computational effort. Sample splitting achieved acceptable actual selective coverage levels, but the method is inefficient and leads to less accurate point estimates. The choice of inference method had a large impact on the resulting interval estimates, thereby necessitating that the user is acutely aware of the goal of inference in order to interpret and communicate the results. Conclusions Despite violating nominal coverage levels in some scenarios, selective inference conditional on the Lasso selection is our recommended approach for most cases. If simplicity is strongly favoured over efficiency, then sample splitting is an alternative. If only few predictors undergo variable selection (i.e. up to 5) or the avoidance of false positive claims of significance is a concern, then the conservative approach of PoSI may be useful. For the adaptive Lasso, SI should be avoided and only PoSI and sample splitting are recommended. In summary, we find selective inference useful to assess the uncertainties in the importance of individual selected predictors for future applications. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-022-01681-y.
Collapse
Affiliation(s)
- Michael Kammer
- Section for Clinical Biometrics, Center for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna, Vienna, Austria.,Division of Nephrology and Dialysis, Department for Internal Medicine III, Medical University of Vienna, Vienna, Austria
| | - Daniela Dunkler
- Section for Clinical Biometrics, Center for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna, Vienna, Austria
| | - Stefan Michiels
- Service de Biostatistique et d'Epidémiologie, Gustave Roussy, Oncostat U1018, INSERM, University Paris-Saclay, labeled Ligue Contre le Cancer, Villejuif, France
| | - Georg Heinze
- Section for Clinical Biometrics, Center for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna, Vienna, Austria.
| |
Collapse
|
7
|
Yoo JE, Rho M. Large-Scale Survey Data Analysis with Penalized Regression: A Monte Carlo Simulation on Missing Categorical Predictors. Multivariate Behav Res 2022; 57:642-657. [PMID: 33703972 DOI: 10.1080/00273171.2021.1891856] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
With the advent of the big data era, machine learning methods have evolved and proliferated. This study focused on penalized regression, a procedure that builds interpretive prediction models among machine learning methods. In particular, penalized regression coupled with large-scale data can explore hundreds or thousands of variables in one statistical model without convergence problems and identify yet uninvestigated important predictors. As one of the first Monte Carlo simulation studies to investigate predictive modeling with missing categorical predictors in the context of social science research, this study endeavored to emulate real social science large-scale data. Likert-scaled variables were simulated as well as multiple-category and count variables. Due to the inclusion of the categorical predictors in modeling, penalized regression methods that consider the grouping effect were employed such as group Mnet. We also examined the applicability of the simulation conditions with a real large-scale dataset that the simulation study referenced. Particularly, the study presented selection counts of variables after multiple iterations of modeling in order to consider the bias resulting from data-splitting in model validation. Selection counts turned out to be a necessary tool when variable selection is of research interest. Efforts to utilize large-scale data to the fullest appear to offer a valid approach to mitigate the effect of nonignorable missingness. Overall, penalized regression which assumes linearity is a viable method to analyze social science large-scale survey data.
Collapse
Affiliation(s)
- Jin Eun Yoo
- Department of Education, Korea National University of Education
| | - Minjeong Rho
- Department of Education, Korea National University of Education
| |
Collapse
|
8
|
Ni A, Song C. Variable Selection for Time-to-Event Data. Methods Mol Biol 2021; 2194:61-76. [PMID: 32926362 DOI: 10.1007/978-1-0716-0849-4_5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
With the increasing availability of large scale biomedical and -omics data, researchers are offered with unprecedented opportunities to discover novel biomarkers for clinical outcomes. At the same time, they are also faced with great challenges to accurately identify important biomarkers from numerous candidates. Many novel statistical methodologies have been developed to tackle these challenges in the last couple of decades. When the clinical outcome is time-to-event data, special statistical methods are needed to analyze this type of data due to the presence of censoring. In this article, we review some of the most commonly used modern statistical methodologies for variable selection for time-to-event data. The reviewed methods are classified into three large categories: filter-test based method, penalized regression method, and machine learning method.
Collapse
Affiliation(s)
- Ai Ni
- Division of Biostatistics, College of Public Health, The Ohio State University, Columbus, OH, USA.
| | - Chi Song
- Division of Biostatistics, College of Public Health, The Ohio State University, Columbus, OH, USA.
| |
Collapse
|
9
|
Nené NR, Barrett J, Jones A, Evans I, Reisel D, Timms JF, Paprotka T, Leimbach A, Franchi D, Colombo N, Bjørge L, Zikan M, Cibula D, Widschwendter M. DNA methylation signatures to predict the cervicovaginal microbiome status. Clin Epigenetics 2020; 12:180. [PMID: 33228781 PMCID: PMC7686703 DOI: 10.1186/s13148-020-00966-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2020] [Accepted: 11/03/2020] [Indexed: 01/21/2023] Open
Abstract
Background The composition of the microbiome plays an important role in human health and disease. Whether there is a direct association between the cervicovaginal microbiome and the host’s epigenome is largely unexplored.
Results Here we analyzed a total of 448 cervicovaginal smear samples and studied both the DNA methylome of the host and the microbiome using the Illumina EPIC array and next-generation sequencing, respectively. We found that those CpGs that are hypo-methylated in samples with non-lactobacilli (O-type) dominating communities are strongly associated with gastrointestinal differentiation and that a signature consisting of 819 CpGs was able to discriminate lactobacilli-dominating (L-type) from O-type samples with an area under the receiver operator characteristic curve (AUC) of 0.84 (95% CI = 0.77–0.90) in an independent validation set. The performance found in samples with more than 50% epithelial cells was further improved (AUC 0.87) and in women younger than 50 years of age was even higher (AUC 0.91). In a subset of 96 women, the buccal but not the blood cell DNA showed the same trend as the cervicovaginal samples in discriminating women with L- from O-type cervicovaginal communities. Conclusions These findings strongly support the view that the epithelial epigenome plays an essential role in hosting specific microbial communities.
Collapse
Affiliation(s)
- Nuno R Nené
- Department of Women's Cancer, EGA Institute for Women's Health, University College London, London, UK.,Department of Mathematics, University College London, London, UK
| | - James Barrett
- Department of Women's Cancer, EGA Institute for Women's Health, University College London, London, UK.,European Translational Oncology Prevention and Screening (EUTOPS) Institute, 6060, Hall in Tirol, Austria.,Research Institute for Biomedical Aging Research, Universität Innsbruck, 6020, Innsbruck, Austria
| | - Allison Jones
- Department of Women's Cancer, EGA Institute for Women's Health, University College London, London, UK
| | - Iona Evans
- Department of Women's Cancer, EGA Institute for Women's Health, University College London, London, UK
| | - Daniel Reisel
- Department of Women's Cancer, EGA Institute for Women's Health, University College London, London, UK
| | - John F Timms
- Department of Women's Cancer, EGA Institute for Women's Health, University College London, London, UK
| | | | | | | | - Nicoletta Colombo
- Europeo Di Oncologia, IRCCS, Milan, Italy.,University of Milano-Bicocca, Milan, Italy
| | - Line Bjørge
- Department of Obstetrics and Gynecology, Haukeland University Hospital, Bergen, Norway.,Centre for Cancer Biomarkers, Department of Clinical Science, CCBIO, University of Bergen, Bergen, Norway
| | - Michal Zikan
- Hospital Na Bulovce, Prague, Czech Republic.,Department of Obstetrics and Gynecology, General University Hospital in Prague, First Faculty of Medicine, Charles University, Prague, Czech Republic
| | - David Cibula
- Department of Obstetrics and Gynecology, General University Hospital in Prague, First Faculty of Medicine, Charles University, Prague, Czech Republic
| | - Martin Widschwendter
- Department of Women's Cancer, EGA Institute for Women's Health, University College London, London, UK. .,European Translational Oncology Prevention and Screening (EUTOPS) Institute, 6060, Hall in Tirol, Austria. .,Research Institute for Biomedical Aging Research, Universität Innsbruck, 6020, Innsbruck, Austria.
| |
Collapse
|
10
|
Barbosa S, Khalfallah O, Forhan A, Galera C, Heude B, Glaichenhaus N, Davidovic L. Serum cytokines associated with behavior: A cross-sectional study in 5-year-old children. Brain Behav Immun 2020; 87:377-387. [PMID: 31923553 DOI: 10.1016/j.bbi.2020.01.005] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/23/2019] [Revised: 12/23/2019] [Accepted: 01/05/2020] [Indexed: 12/22/2022] Open
Abstract
Nearly 10% of 5-year-old children experience social, emotional or behavioral problems and are at increased risk of developing mental disorders later in life. While animal and human studies have demonstrated that cytokines can regulate brain functions, it is unclear whether individual cytokines are associated with specific behavioral dimensions in population-based pediatric samples. Here, we used data and biological samples from 786 mother-child pairs participating to the French national mother-child cohort EDEN. At the age of 5, children were assessed for behavioral difficulties using the Strengths and Difficulties Questionnaire (SDQ) and had their serum collected. Serum samples were analyzed for levels of well-characterized effector or regulatory cytokines. We then used a penalized logistic regression method (Elastic Net), to investigate associations between serum levels of cytokines and each of the five SDQ-assessed behavioral dimensions after adjustment for relevant covariates and confounders, including psychosocial variables. We found that interleukin (IL)-6, IL-7, and IL-15 were associated with increased odds of problems in prosocial behavior, emotions, and peer relationships, respectively. In contrast, eight cytokines were associated with decreased odds of problems in one dimension: IL-8, IL-10, and IL-17A with emotional problems, Tumor Necrosis Factor (TNF)-α with conduct problems, C-C motif chemokine Ligand (CCL)2 with hyperactivity/inattention, C-X-C motif chemokine Ligand (CXCL)10 with peer problems, and CCL3 and IL-16 with abnormal prosocial behavior. Without implying causation, these associations support the notion that cytokines regulate brain functions and behavior and provide a rationale for launching longitudinal studies.
Collapse
Affiliation(s)
- Susana Barbosa
- Université Côte d'Azur, Centre National de la Recherche Scientifique, Institut de Pharmacologie Moléculaire et Cellulaire, Valbonne, France
| | - Olfa Khalfallah
- Université Côte d'Azur, Centre National de la Recherche Scientifique, Institut de Pharmacologie Moléculaire et Cellulaire, Valbonne, France
| | - Anne Forhan
- Université de Paris, Institut National de la Santé et de la Recherche Médicale, Institut National de la Recherche Agronomique, Centre de Recherche en Épidémiologie et Statistiques, Paris, France
| | - Cédric Galera
- University Bordeaux Segalen, Charles Perrens Hospital, Child and Adolescent Psychiatry Department, Bordeaux, France
| | - Barbara Heude
- Université de Paris, Institut National de la Santé et de la Recherche Médicale, Institut National de la Recherche Agronomique, Centre de Recherche en Épidémiologie et Statistiques, Paris, France
| | - Nicolas Glaichenhaus
- Université Côte d'Azur, Centre National de la Recherche Scientifique, Institut de Pharmacologie Moléculaire et Cellulaire, Valbonne, France
| | - Laetitia Davidovic
- Université Côte d'Azur, Centre National de la Recherche Scientifique, Institut de Pharmacologie Moléculaire et Cellulaire, Valbonne, France.
| |
Collapse
|
11
|
Abstract
We consider high-dimensional regression over subgroups of observations. Our work is motivated by biomedical problems, where subsets of samples, representing for example disease subtypes, may differ with respect to underlying regression models. In the high-dimensional setting, estimating a different model for each subgroup is challenging due to limited sample sizes. Focusing on the case in which subgroup-specific models may be expected to be similar but not necessarily identical, we treat subgroups as related problem instances and jointly estimate subgroup-specific regression coefficients. This is done in a penalized framework, combining an $\ell_1$ term with an additional term that penalizes differences between subgroup-specific coefficients. This gives solutions that are globally sparse but that allow information-sharing between the subgroups. We present algorithms for estimation and empirical results on simulated data and using Alzheimer's disease, amyotrophic lateral sclerosis, and cancer datasets. These examples demonstrate the gains joint estimation can offer in prediction as well as in providing subgroup-specific sparsity patterns.
Collapse
Affiliation(s)
- Frank Dondelinger
- Lancaster Medical School, Lancaster University, Furness College, Bailrigg, Lancaster, UK
| | - Sach Mukherjee
- Statistics and Machine Learning, German Center for Neurodegenerative Diseases (DZNE), Sigmund-Freud-Straße 27, Bonn, Germany
| | | |
Collapse
|
12
|
Sheng A, Ghosh SK. Effects of Proportional Hazard Assumption on Variable Selection Methods for Censored Data. Stat Biopharm Res 2020; 12:199-209. [PMID: 34040695 DOI: 10.1080/19466315.2019.1694578] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
The Cox proportional hazard (PH) model is widely used to determine the effects of risk factors and treatments (covariates) on survival time of subjects that might be right censored. The selection of covariates depends crucially on the specific form of the conditional hazard model, which is often assumed to be PH, Accelerated Failure time (AFT) or proportional odds (PO). However, we show that none of these semi-parametric models allow for the crossing of the survival functions and hence such strong assumptions may adversely affect the selection of variables. Moreover, the most commonly used PH assumption may also be violated when there is a delayed effect of the risk factors. Taking into account all of these modeling assumptions, this study examines the effect of the PH assumption on covariate selection when the data generating model may have non-PH. In particular, variable selection under two alternative models are explored: (i) the penalized PH model (using the elastic-net penalty) and (ii) the linear spline based hazard regression model. We apply the aforementioned models to the ACTG-175 data set and simulated data sets with survival times generated from the Weibull and log-normal distributions. We also examine the effect on covariate selection of stratifying the analysis on the off-treatment indicator.
Collapse
Affiliation(s)
- Alvin Sheng
- Department of Statistics, North Carolina State University
| | - Sujit K Ghosh
- Department of Statistics, North Carolina State University
| |
Collapse
|
13
|
Wang F, Mukherjee S, Richardson S, Hill SM. High-dimensional regression in practice: an empirical study of finite-sample prediction, variable selection and ranking. Stat Comput 2019; 30:697-719. [PMID: 32132772 PMCID: PMC7026376 DOI: 10.1007/s11222-019-09914-9] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/27/2018] [Accepted: 10/16/2019] [Indexed: 06/01/2023]
Abstract
Penalized likelihood approaches are widely used for high-dimensional regression. Although many methods have been proposed and the associated theory is now well developed, the relative efficacy of different approaches in finite-sample settings, as encountered in practice, remains incompletely understood. There is therefore a need for empirical investigations in this area that can offer practical insight and guidance to users. In this paper, we present a large-scale comparison of penalized regression methods. We distinguish between three related goals: prediction, variable selection and variable ranking. Our results span more than 2300 data-generating scenarios, including both synthetic and semisynthetic data (real covariates and simulated responses), allowing us to systematically consider the influence of various factors (sample size, dimensionality, sparsity, signal strength and multicollinearity). We consider several widely used approaches (Lasso, Adaptive Lasso, Elastic Net, Ridge Regression, SCAD, the Dantzig Selector and Stability Selection). We find considerable variation in performance between methods. Our results support a "no panacea" view, with no unambiguous winner across all scenarios or goals, even in this restricted setting where all data align well with the assumptions underlying the methods. The study allows us to make some recommendations as to which approaches may be most (or least) suitable given the goal and some data characteristics. Our empirical results complement existing theory and provide a resource to compare methods across a range of scenarios and metrics.
Collapse
Affiliation(s)
- Fan Wang
- MRC Biostatistics Unit, University of Cambridge, Cambridge, UK
| | - Sach Mukherjee
- German Centre for Neurodegenerative Diseases (DZNE), Bonn, Germany
| | | | - Steven M. Hill
- MRC Biostatistics Unit, University of Cambridge, Cambridge, UK
| |
Collapse
|
14
|
Velten B, Huber W. Adaptive penalization in high-dimensional regression and classification with external covariates using variational Bayes. Biostatistics 2019; 22:348-364. [PMID: 31596468 PMCID: PMC8036004 DOI: 10.1093/biostatistics/kxz034] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2018] [Revised: 06/27/2019] [Accepted: 08/14/2019] [Indexed: 12/18/2022] Open
Abstract
Penalization schemes like Lasso or ridge regression are routinely used to regress a response of interest on a high-dimensional set of potential predictors. Despite being decisive, the question of the relative strength of penalization is often glossed over and only implicitly determined by the scale of individual predictors. At the same time, additional information on the predictors is available in many applications but left unused. Here, we propose to make use of such external covariates to adapt the penalization in a data-driven manner. We present a method that differentially penalizes feature groups defined by the covariates and adapts the relative strength of penalization to the information content of each group. Using techniques from the Bayesian tool-set our procedure combines shrinkage with feature selection and provides a scalable optimization scheme. We demonstrate in simulations that the method accurately recovers the true effect sizes and sparsity patterns per feature group. Furthermore, it leads to an improved prediction performance in situations where the groups have strong differences in dynamic range. In applications to data from high-throughput biology, the method enables re-weighting the importance of feature groups from different assays. Overall, using available covariates extends the range of applications of penalized regression, improves model interpretability and can improve prediction performance.
Collapse
Affiliation(s)
- Britta Velten
- Genome Biology Unit, European Molecular Biology Laboratory, Meyerhofstr. 1, 69117 Heidelberg, Germany
| | - Wolfgang Huber
- Genome Biology Unit, European Molecular Biology Laboratory, Meyerhofstr. 1, 69117 Heidelberg, Germany
| |
Collapse
|
15
|
Abstract
Penalized regression methods are an attractive tool for high-dimensional data analysis, but their widespread adoption has been hampered by the difficulty of applying inferential tools. In particular, the question "How reliable is the selection of those features?" has proved difficult to address. In part, this difficulty arises from defining false discoveries in the classical, fully conditional sense, which is possible in low dimensions but does not scale well to high-dimensional settings. Here, we consider the analysis of marginal false discovery rates (mFDRs) for penalized regression methods. Restricting attention to the mFDR permits straightforward estimation of the number of selections that would likely have occurred by chance alone, and therefore provides a useful summary of selection reliability. Theoretical analysis and simulation studies demonstrate that this approach is quite accurate when the correlation among predictors is mild, and only slightly conservative when the correlation is stronger. Finally, the practical utility of the proposed method and its considerable advantages over other approaches are illustrated using gene expression data from The Cancer Genome Atlas and genome-wide association study data from the Myocardial Applied Genomics Network.
Collapse
|
16
|
Garcia-Carretero R, Barquero-Perez O, Mora-Jimenez I, Soguero-Ruiz C, Goya-Esteban R, Ramos-Lopez J. Identification of clinically relevant features in hypertensive patients using penalized regression: a case study of cardiovascular events. Med Biol Eng Comput 2019; 57:2011-2026. [PMID: 31346948 DOI: 10.1007/s11517-019-02007-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2019] [Accepted: 06/24/2019] [Indexed: 12/18/2022]
Abstract
Appropriate management of hypertensive patients relies on the accurate identification of clinically relevant features. However, traditional statistical methods may ignore important information in datasets or overlook possible interactions among features. Machine learning may improve the prediction accuracy and interpretability of regression models by identifying the most relevant features in hypertensive patients. We sought the most relevant features for prediction of cardiovascular (CV) events in a hypertensive population. We used the penalized regression models least absolute shrinkage and selection operator (LASSO) and elastic net (EN) to obtain the most parsimonious and accurate models. The clinical parameters and laboratory biomarkers were collected from the clinical records of 1,471 patients receiving care at Mostoles University Hospital. The outcome was the development of major adverse CV events. Cox proportional hazards regression was performed alone and with penalized regression analyses (LASSO and EN), producing three models. The modeling was performed using 10-fold cross-validation to fit the penalized models. The three predictive models were compared and statistically analyzed to assess their classification accuracy, sensitivity, specificity, discriminative power, and calibration accuracy. The standard Cox model identified five relevant features, while LASSO and EN identified only three (age, LDL cholesterol, and kidney function). The accuracies of the models (prediction vs. observation) were 0.767 (Cox model), 0.754 (LASSO), and 0.764 (EN), and the areas under the curve were 0.694, 0.670, and 0.673, respectively. However, pairwise comparison of performance yielded no statistically significant differences. All three calibration curves showed close agreement between the predicted and observed probabilities of the development of a CV event. Although the performance was similar for all three models, both penalized regression analyses produced models with good fit and fewer features than the Cox regression predictive model but with the same accuracy. This case study of predictive models using penalized regression analyses shows that penalized regularization techniques can provide predictive models for CV risk assessment that are parsimonious, highly interpretable, and generalizable and that have good fit. For clinicians, a parsimonious model can be useful where available data are limited, as such a model can offer a simple but efficient way to model the impact of the different features on the prediction of CV events. Management of these features may lower the risk for a CV event. Graphical Abstract In a clinical setting, with numerous biological and laboratory features and incomplete datasets, traditional statistical methods may ignore important information and overlook possible interactions among features. Our aim was to identify the most relevant features to predict cardiovascular events in a hypertensive population, using three different regression approaches for feature selection, to improve the prediction accuracy and interpretability of regression models by identifying the relevant features in these patients.
Collapse
Affiliation(s)
- Rafael Garcia-Carretero
- Internal Medicine Department, Mostoles University Hospital, Calle Rio Jucar, s/n, 28935, Mostoles, Madrid, Spain. .,Rey Juan Carlos University, Móstoles, Spain.
| | - Oscar Barquero-Perez
- Department of Signal Theory and Communications and Telematics Systems and Computing, Rey Juan Carlos University, Móstoles, Spain
| | - Inmaculada Mora-Jimenez
- Department of Signal Theory and Communications and Telematics Systems and Computing, Rey Juan Carlos University, Móstoles, Spain
| | - Cristina Soguero-Ruiz
- Department of Signal Theory and Communications and Telematics Systems and Computing, Rey Juan Carlos University, Móstoles, Spain
| | - Rebeca Goya-Esteban
- Department of Signal Theory and Communications and Telematics Systems and Computing, Rey Juan Carlos University, Móstoles, Spain
| | - Javier Ramos-Lopez
- Department of Signal Theory and Communications and Telematics Systems and Computing, Rey Juan Carlos University, Móstoles, Spain
| |
Collapse
|
17
|
Abstract
Biomarkers play important roles in early diagnosis and treatment plan for cancer patients and the importance is growing. With advances in high-throughput molecular profiling technology for various types of molecules such as DNA, RNA, proteins, or metabolites, it is now possible to perform massive profiling analysis that allows accelerating discovery of novel biomolecules. Because no single marker is sufficiently accurate for clinical use, the cancer biomarker is developed in the form of multiple biomarker panels. No single marker is sufficiently accurate for clinical use, and thus cancer biomarkers are developed in the form of multiple biomarker panels. Of various types of molecular biomarkers, microRNA (miRNA) has emerged as a class of promising cancer biomarker recently. MiRNAs are small noncoding RNAs that regulate gene expression. The chapter overviews the process of identification of biomarker panels from miRNA profiles focusing on statistical methods. Introduction to molecular cancer biomarkers is touched first. From sample design to miRNA profiling process is reviewed in the method section.Statistical methods for biomarker development are introduced according to three typical purposes of molecular biomarkers: tumor subtype classification, early detection, and prediction of treatment response or prognosis of patients. Example codes for R program are provided as well for selected methods.
Collapse
|
18
|
Abstract
BACKGROUND Accurate gene regulatory networks can be used to explain the emergence of different phenotypes, disease mechanisms, and other biological functions. Many methods have been proposed to infer networks from gene expression data but have been hampered by problems such as low sample size, inaccurate constraints, and incomplete characterizations of regulatory dynamics. Since expression regulation is dynamic, time-course data can be used to infer causality, but these datasets tend to be short or sparsely sampled. In addition, temporal methods typically assume that the expression of a gene at a time point depends on the expression of other genes at only the immediately preceding time point, while other methods include additional time points without any constraints to account for their temporal distance. These limitations can contribute to inaccurate networks with many missing and anomalous links. RESULTS We adapted the time-lagged Ordered Lasso, a regularized regression method with temporal monotonicity constraints, for de novo reconstruction. We also developed a semi-supervised method that embeds prior network information into the Ordered Lasso to discover novel regulatory dependencies in existing pathways. R code is available at https://github.com/pn51/laggedOrderedLassoNetwork . CONCLUSIONS We evaluated these approaches on simulated data for a repressilator, time-course data from past DREAM challenges, and a HeLa cell cycle dataset to show that they can produce accurate networks subject to the dynamics and assumptions of the time-lagged Ordered Lasso regression.
Collapse
Affiliation(s)
- Phan Nguyen
- Department of Engineering Sciences and Applied Mathematics, Northwestern University, Evanston, IL USA
| | - Rosemary Braun
- Department of Engineering Sciences and Applied Mathematics, Northwestern University, Evanston, IL USA
- Biostatistics Division, Feinberg School of Medicine, Northwestern University, Chicago, IL USA
| |
Collapse
|
19
|
Klau S, Jurinovic V, Hornung R, Herold T, Boulesteix AL. Priority-Lasso: a simple hierarchical approach to the prediction of clinical outcome using multi-omics data. BMC Bioinformatics 2018; 19:322. [PMID: 30208855 DOI: 10.1186/s12859-018-2344-6] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2018] [Accepted: 08/29/2018] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND The inclusion of high-dimensional omics data in prediction models has become a well-studied topic in the last decades. Although most of these methods do not account for possibly different types of variables in the set of covariates available in the same dataset, there are many such scenarios where the variables can be structured in blocks of different types, e.g., clinical, transcriptomic, and methylation data. To date, there exist a few computationally intensive approaches that make use of block structures of this kind. RESULTS In this paper we present priority-Lasso, an intuitive and practical analysis strategy for building prediction models based on Lasso that takes such block structures into account. It requires the definition of a priority order of blocks of data. Lasso models are calculated successively for every block and the fitted values of every step are included as an offset in the fit of the next step. We apply priority-Lasso in different settings on an acute myeloid leukemia (AML) dataset consisting of clinical variables, cytogenetics, gene mutations and expression variables, and compare its performance on an independent validation dataset to the performance of standard Lasso models. CONCLUSION The results show that priority-Lasso is able to keep pace with Lasso in terms of prediction accuracy. Variables of blocks with higher priorities are favored over variables of blocks with lower priority, which results in easily usable and transportable models for clinical practice.
Collapse
|
20
|
Abstract
We compare alternative computing strategies for solving the constrained lasso problem. As its name suggests, the constrained lasso extends the widely-used lasso to handle linear constraints, which allow the user to incorporate prior information into the model. In addition to quadratic programming, we employ the alternating direction method of multipliers (ADMM) and also derive an efficient solution path algorithm. Through both simulations and benchmark data examples, we compare the different algorithms and provide practical recommendations in terms of efficiency and accuracy for various sizes of data. We also show that, for an arbitrary penalty matrix, the generalized lasso can be transformed to a constrained lasso, while the converse is not true. Thus, our methods can also be used for estimating a generalized lasso, which has wide-ranging applications. Code for implementing the algorithms is freely available in both the Matlab toolbox SparseReg and the Julia package ConstrainedLasso. Supplementary materials for this article are available online.
Collapse
Affiliation(s)
- Brian R Gaines
- Department of Statistics, North Carolina State University
| | - Juhyun Kim
- Department of Biostatistics, University of California, Los Angeles (UCLA)
| | - Hua Zhou
- Department of Biostatistics, University of California, Los Angeles (UCLA)
| |
Collapse
|
21
|
Martínez-Ávila JC, García Bartolomé A, García I, Dapía I, Tong HY, Díaz L, Guerra P, Frías J, Carcás Sansuan AJ, Borobia AM. Pharmacometabolomics applied to zonisamide pharmacokinetic parameter prediction. Metabolomics 2018; 14:70. [PMID: 30830352 DOI: 10.1007/s11306-018-1365-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/21/2018] [Accepted: 04/25/2018] [Indexed: 10/16/2022]
Abstract
INTRODUCTION Zonisamide is a new-generation anticonvulsant antiepileptic drug metabolized primarily in the liver, with subsequent elimination via the renal route. OBJECTIVES Our objective was to evaluate the utility of pharmacometabolomics in the detection of zonisamide metabolites that could be related to its disposition and therefore, to its efficacy and toxicity. METHODS This study was nested to a bioequivalence clinical trial with 28 healthy volunteers. Each participant received a single dose of zonisamide on two separate occasions (period 1 and period 2), with a washout period between them. Blood samples of zonisamide were obtained from all patients at baseline for each period, before volunteers were administered any medication, for metabolomics analysis. RESULTS After a Lasso regression was applied, age, height, branched-chain amino acids, steroids, triacylglycerols, diacyl glycerophosphoethanolamine, glycerophospholipids susceptible to methylation, phosphatidylcholines with 20:4 FA (arachidonic acid) and cholesterol ester and lysophosphatidylcholine were obtained in both periods. CONCLUSION To our knowledge, this is the only research study to date that has attempted to link basal metabolomic status with pharmacokinetic parameters of zonisamide.
Collapse
Affiliation(s)
- J C Martínez-Ávila
- Clinical Pharmacology Department, La Paz University Hospital, School of Medicine, IdiPAZ, Autonomous University of Madrid, Madrid, Spain.
| | - A García Bartolomé
- Clinical Pharmacology Department, La Paz University Hospital, School of Medicine, IdiPAZ, Autonomous University of Madrid, Madrid, Spain
| | - I García
- Clinical Pharmacology Department, La Paz University Hospital, School of Medicine, IdiPAZ, Autonomous University of Madrid, Madrid, Spain
| | - I Dapía
- Medical and Molecular Genetics Institute (INGEMM), La Paz University Hospital, Rare Diseases Networking Biomedical Research Center (CIBERER), ISCIII, Madrid, Spain
| | - Hoi Y Tong
- Clinical Pharmacology Department, La Paz University Hospital, School of Medicine, IdiPAZ, Autonomous University of Madrid, Madrid, Spain
| | - L Díaz
- Clinical Pharmacology Department, La Paz University Hospital, School of Medicine, IdiPAZ, Autonomous University of Madrid, Madrid, Spain
| | - P Guerra
- Clinical Pharmacology Department, La Paz University Hospital, School of Medicine, IdiPAZ, Autonomous University of Madrid, Madrid, Spain
| | - J Frías
- Clinical Pharmacology Department, La Paz University Hospital, School of Medicine, IdiPAZ, Autonomous University of Madrid, Madrid, Spain
| | - A J Carcás Sansuan
- Clinical Pharmacology Department, La Paz University Hospital, School of Medicine, IdiPAZ, Autonomous University of Madrid, Madrid, Spain.
| | - A M Borobia
- Clinical Pharmacology Department, La Paz University Hospital, School of Medicine, IdiPAZ, Autonomous University of Madrid, Madrid, Spain
| |
Collapse
|
22
|
Abstract
Despite overwhelming data on predictors of inpatient mortality, it is unclear which variables are the most instructive in predicting mortality of patients in departments of internal medicine. This study aims to identify the most informative predictors of inpatient mortality, and builds a prediction model on an individual level, given a constellation of patient characteristics. We use a penalized method for developing the prediction model by applying the least-absolute-shrinkage and selection-operator regression. We utilize a cohort of adult patients admitted to any of 5 departments of internal medicine during 3.5 years. We integrated data from electronic health records that included clinical, epidemiological, administrative, and laboratory variables. The prediction model was evaluated using the validation sample. Of 10,788 patients hospitalized during the study period, 874 (8.1%) died during admission. We find that the strongest predictors of inpatient mortality are prior admission within 3 months, malignant morbidity, serum creatinine levels, and hypoalbuminemia at hospital admission, and an admitting diagnosis of sepsis, pneumonia, malignant neoplastic disease, or cerebrovascular disease. The C-statistic of the risk prediction model is 89.4% (95% CI 88.4-90.4%). The predictive performance of this model is better than a multivariate stepwise logistic regression model. By utilizing the prediction model, the AUC for the independent (validation) data set is 85.7% (95% CI 84.1-87.3%). Using penalized regression, this prediction model identifies the most informative predictors of inpatient mortality. The model illustrates the potential value and feasibility of a tool that can aid physicians in decision-making.
Collapse
Affiliation(s)
- Naama Schwartz
- Research Authority, Emek Medical Center, Clalit Health Services, Afula, Israel
| | - Ali Sakhnini
- Department of Medicine D, Emek Medical Center, Clalit Health Services, 21 Rabin Avenue, 18341, Afula, Israel
| | - Naiel Bisharat
- Department of Medicine D, Emek Medical Center, Clalit Health Services, 21 Rabin Avenue, 18341, Afula, Israel.
- Ruth and Bruce Rappaport Faculty of Medicine, Technion-Israel Institute of Technology, Haifa, Israel.
| |
Collapse
|
23
|
Wu Y, Cook RJ. Variable selection and prediction in biased samples with censored outcomes. Lifetime Data Anal 2018; 24:72-93. [PMID: 28215038 DOI: 10.1007/s10985-017-9392-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/29/2016] [Accepted: 02/07/2017] [Indexed: 06/06/2023]
Abstract
With the increasing availability of large prospective disease registries, scientists studying the course of chronic conditions often have access to multiple data sources, with each source generated based on its own entry conditions. The different entry conditions of the various registries may be explicitly based on the response process of interest, in which case the statistical analysis must recognize the unique truncation schemes. Moreover, intermittent assessment of individuals in the registries can lead to interval-censored times of interest. We consider the problem of selecting important prognostic biomarkers from a large set of candidates when the event times of interest are truncated and right- or interval-censored. Methods for penalized regression are adapted to handle truncation via a Turnbull-type complete data likelihood. An expectation-maximization algorithm is described which is empirically shown to perform well. Inverse probability weights are used to adjust for the selection bias when assessing predictive accuracy based on individuals whose event status is known at a time of interest. Application to the motivating study of the development of psoriatic arthritis in patients with psoriasis in both the psoriasis cohort and the psoriatic arthritis cohort illustrates the procedure.
Collapse
Affiliation(s)
- Ying Wu
- Institute of Statistics, Nankai University, Tianjin, China
| | - Richard J Cook
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, ON, N2L 3G1, Canada.
| |
Collapse
|
24
|
Lee JW, Punshon T, Moen EL, Karagas MR, Gui J. Penalized estimation of sparse concentration matrices based on prior knowledge with applications to placenta elemental data. Comput Biol Chem 2017; 71:219-223. [PMID: 29153892 DOI: 10.1016/j.compbiolchem.2017.10.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2017] [Revised: 10/29/2017] [Accepted: 10/30/2017] [Indexed: 10/18/2022]
Abstract
Identifying patterns of association or dependency among high-dimensional biological datasets with sparse precision matrices remains a challenge. In this paper, we introduce a weighted sparse Gaussian graphical model that can incorporate prior knowledge to infer the structure of the network of trace element concentrations, including essential elements as well as toxic metals and metaloids measured in the human placentas. We present the weighted L1 penalized regularization procedure for estimating the sparse precision matrix in the setting of Gaussian graphical models. First, we use simulation models to demonstrate that the proposed method yields a better estimate of the precision matrix than the procedures that fail to account for the prior knowledge of the network structure. Then, we apply this method to estimate sparse element concentration matrices of placental biopsies from the New Hampshire Birth Cohort Study. The chemical architecture for elements is complex; thus, the method proposed herein was applied to infer the dependency structures of the elements using prior knowledge of their biological roles.
Collapse
Affiliation(s)
- Jai Woo Lee
- Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH, United States
| | - Tracy Punshon
- Department of Biological Sciences, Dartmouth College, Hanover, NH, United States
| | - Erika L Moen
- Department of Epidemiology, Geisel School of Medicine, Lebanon, NH, United States; The Dartmouth Institute for Health Policy and Clinical Practice, Geisel School of Medicine, Lebanon, NH, United States
| | - Margaret R Karagas
- Department of Epidemiology, Geisel School of Medicine, Lebanon, NH, United States
| | - Jiang Gui
- Institute for Quantitative Biomedical Sciences, Dartmouth College, Hanover, NH, United States; Department of Biomedical Data Science, Geisel School of Medicine, Lebanon, NH, United States.
| |
Collapse
|
25
|
Shen R, Luo L, Jiang H. Identification of gene pairs through penalized regression subject to constraints. BMC Bioinformatics 2017; 18:466. [PMID: 29100492 DOI: 10.1186/s12859-017-1872-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2017] [Accepted: 10/17/2017] [Indexed: 02/07/2023] Open
Abstract
Background This article concerns the identification of gene pairs or combinations of gene pairs associated with biological phenotype or clinical outcome, allowing for building predictive models that are not only robust to normalization but also easily validated and measured by qPCR techniques. However, given a small number of biological samples yet a large number of genes, this problem suffers from the difficulty of high computational complexity and imposes challenges to the accuracy of identification statistically. Results In this paper, we propose a parsimonious model representation and develop efficient algorithms for identification. Particularly, we derive an equivalent model subject to a sum-to-zero constraint in penalized linear regression, where the correspondence between nonzero coefficients in these models is established. Most importantly, it reduces the model complexity of the traditional approach from the quadratic order to the linear order in the number of candidate genes, while overcoming the difficulty of model nonidentifiablity. Computationally, we develop an algorithm using the alternating direction method of multipliers (ADMM) to deal with the constraint. Numerically, we demonstrate that the proposed method outperforms the traditional method in terms of the statistical accuracy. Moreover, we demonstrate that our ADMM algorithm is more computationally efficient than a coordinate descent algorithm with a local search. Finally, we illustrate the proposed method on a prostate cancer dataset to identify gene pairs that are associated with pre-operative prostate-specific antigen. Conclusion Our findings demonstrate the feasibility and utility of using gene pairs as biomarkers.
Collapse
|
26
|
Sedighi Maman Z, Alamdar Yazdi MA, Cavuoto LA, Megahed FM. A data-driven approach to modeling physical fatigue in the workplace using wearable sensors. Appl Ergon 2017; 65:515-529. [PMID: 28259238 DOI: 10.1016/j.apergo.2017.02.001] [Citation(s) in RCA: 57] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/04/2016] [Revised: 01/28/2017] [Accepted: 02/01/2017] [Indexed: 05/14/2023]
Abstract
Wearable sensors are currently being used to manage fatigue in professional athletics, transportation and mining industries. In manufacturing, physical fatigue is a challenging ergonomic/safety "issue" since it lowers productivity and increases the incidence of accidents. Therefore, physical fatigue must be managed. There are two main goals for this study. First, we examine the use of wearable sensors to detect physical fatigue occurrence in simulated manufacturing tasks. The second goal is to estimate the physical fatigue level over time. In order to achieve these goals, sensory data were recorded for eight healthy participants. Penalized logistic and multiple linear regression models were used for physical fatigue detection and level estimation, respectively. Important features from the five sensors locations were selected using Least Absolute Shrinkage and Selection Operator (LASSO), a popular variable selection methodology. The results show that the LASSO model performed well for both physical fatigue detection and modeling. The modeling approach is not participant and/or workload regime specific and thus can be adopted for other applications.
Collapse
Affiliation(s)
- Zahra Sedighi Maman
- Department of Industrial and Systems Engineering, Auburn University, AL 36849, USA.
| | | | - Lora A Cavuoto
- Department of Industrial and Systems Engineering, University at Buffalo, NY 14260, USA.
| | | |
Collapse
|
27
|
Lee HS, Krischer JP. A new framework for prediction and variable selection for uncommon events in a large prospective cohort study. Model Assist Stat Appl 2017; 12:227-237. [PMID: 29075164 PMCID: PMC5654558 DOI: 10.3233/mas-170397] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
When prediction is a goal, validation utilizing data outside of the prediction effort is desirable. Typically, data is split into two parts: one for a development and one for validation. But this approach becomes less attractive when predicting uncommon events, as it substantially reduces power. When predicting uncommon events within a large prospective cohort study, we propose the use of a nested case-control design, which is an alternative to the full cohort analysis. By including all cases but only a subset of the non-cases, this design is expected to produce a result similar to the full cohort analysis. In our framework, variable selection is conducted and a prediction model is fit on those selected variables in the case-control cohort. Then, the fraction of true negative predictions (specificity) of the fitted prediction model in the case-control cohort is compared to that in the rest of the cohort (non-cases) for validation. In addition, we propose an iterative variable selection using random forest for missing data imputation, as well as a strategy for a valid classification. Our framework is illustrated with an application featuring high-dimensional variable selection in a large prospective cohort study.
Collapse
Affiliation(s)
- Hye-Seung Lee
- Health Informatics Institute, 3650 Spectrum Blvd., Suite 100, University of South Florida, Tampa, Florida 33612
| | - Jeffrey P Krischer
- Health Informatics Institute, 3650 Spectrum Blvd., Suite 100, University of South Florida, Tampa, Florida 33612
| |
Collapse
|
28
|
Ternès N, Rotolo F, Michiels S. Robust estimation of the expected survival probabilities from high-dimensional Cox models with biomarker-by-treatment interactions in randomized clinical trials. BMC Med Res Methodol 2017; 17:83. [PMID: 28532387 PMCID: PMC5441049 DOI: 10.1186/s12874-017-0354-0] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2016] [Accepted: 04/27/2017] [Indexed: 11/10/2022] Open
Abstract
Background Thanks to the advances in genomics and targeted treatments, more and more prediction models based on biomarkers are being developed to predict potential benefit from treatments in a randomized clinical trial. Despite the methodological framework for the development and validation of prediction models in a high-dimensional setting is getting more and more established, no clear guidance exists yet on how to estimate expected survival probabilities in a penalized model with biomarker-by-treatment interactions. Methods Based on a parsimonious biomarker selection in a penalized high-dimensional Cox model (lasso or adaptive lasso), we propose a unified framework to: estimate internally the predictive accuracy metrics of the developed model (using double cross-validation); estimate the individual survival probabilities at a given timepoint; construct confidence intervals thereof (analytical or bootstrap); and visualize them graphically (pointwise or smoothed with spline). We compared these strategies through a simulation study covering scenarios with or without biomarker effects. We applied the strategies to a large randomized phase III clinical trial that evaluated the effect of adding trastuzumab to chemotherapy in 1574 early breast cancer patients, for which the expression of 462 genes was measured. Results In our simulations, penalized regression models using the adaptive lasso estimated the survival probability of new patients with low bias and standard error; bootstrapped confidence intervals had empirical coverage probability close to the nominal level across very different scenarios. The double cross-validation performed on the training data set closely mimicked the predictive accuracy of the selected models in external validation data. We also propose a useful visual representation of the expected survival probabilities using splines. In the breast cancer trial, the adaptive lasso penalty selected a prediction model with 4 clinical covariates, the main effects of 98 biomarkers and 24 biomarker-by-treatment interactions, but there was high variability of the expected survival probabilities, with very large confidence intervals. Conclusion Based on our simulations, we propose a unified framework for: developing a prediction model with biomarker-by-treatment interactions in a high-dimensional setting and validating it in absence of external data; accurately estimating the expected survival probability of future patients with associated confidence intervals; and graphically visualizing the developed prediction model. All the methods are implemented in the R package biospear, publicly available on the CRAN. Electronic supplementary material The online version of this article (doi:10.1186/s12874-017-0354-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Nils Ternès
- Service de Biostatistique et d'Epidémiologie, Gustave Roussy, B2M, RdC.114 rue Edouard-Vaillant, 94805, Villejuif, France.,CESP, Fac. de médecine - Univ. Paris-Sud, Fac. de médecine - UVSQ, INSERM, Université Paris-Saclay, Villejuif, 94805, France
| | - Federico Rotolo
- Service de Biostatistique et d'Epidémiologie, Gustave Roussy, B2M, RdC.114 rue Edouard-Vaillant, 94805, Villejuif, France.,CESP, Fac. de médecine - Univ. Paris-Sud, Fac. de médecine - UVSQ, INSERM, Université Paris-Saclay, Villejuif, 94805, France
| | - Stefan Michiels
- Service de Biostatistique et d'Epidémiologie, Gustave Roussy, B2M, RdC.114 rue Edouard-Vaillant, 94805, Villejuif, France. .,CESP, Fac. de médecine - Univ. Paris-Sud, Fac. de médecine - UVSQ, INSERM, Université Paris-Saclay, Villejuif, 94805, France.
| |
Collapse
|
29
|
Zhai J, Hsu CH, Daye ZJ. Ridle for sparse regression with mandatory covariates with application to the genetic assessment of histologic grades of breast cancer. BMC Med Res Methodol 2017; 17:12. [PMID: 28122498 PMCID: PMC5267467 DOI: 10.1186/s12874-017-0291-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2016] [Accepted: 01/06/2017] [Indexed: 12/13/2022] Open
Abstract
Background Many questions in statistical genomics can be formulated in terms of variable selection of candidate biological factors for modeling a trait or quantity of interest. Often, in these applications, additional covariates describing clinical, demographical or experimental effects must be included a priori as mandatory covariates while allowing the selection of a large number of candidate or optional variables. As genomic studies routinely require mandatory covariates, it is of interest to propose principled methods of variable selection that can incorporate mandatory covariates. Methods In this article, we propose the ridge-lasso hybrid estimator (ridle), a new penalized regression method that simultaneously estimates coefficients of mandatory covariates while allowing selection for others. The ridle provides a principled approach to mitigate effects of multicollinearity among the mandatory covariates and possible dependency between mandatory and optional variables. We provide detailed empirical and theoretical studies to evaluate our method. In addition, we develop an efficient algorithm for the ridle. Software, based on efficient Fortran code with R-language wrappers, is publicly and freely available at https://sites.google.com/site/zhongyindaye/software. Results The ridle is useful when mandatory predictors are known to be significant due to prior knowledge or must be kept for additional analysis. Both theoretical and comprehensive simulation studies have shown that the ridle to be advantageous when mandatory covariates are correlated with the irrelevant optional predictors or are highly correlated among themselves. A microarray gene expression analysis of the histologic grades of breast cancer has identified 24 genes, in which 2 genes are selected only by the ridle among current methods and found to be associated with tumor grade. Conclusions In this article, we proposed the ridle as a principled sparse regression method for the selection of optional variables while incorporating mandatory ones. Results suggest that the ridle is advantageous when mandatory covariates are correlated with the irrelevant optional predictors or are highly correlated among themselves. Electronic supplementary material The online version of this article (doi:10.1186/s12874-017-0291-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jing Zhai
- Epidemiology and Biostatistics Department, University of Arizona, Tucson, USA
| | - Chiu-Hsieh Hsu
- Epidemiology and Biostatistics Department, University of Arizona, Tucson, USA
| | - Z John Daye
- Epidemiology and Biostatistics Department, University of Arizona, Tucson, USA.
| |
Collapse
|
30
|
Moradi E, Hallikainen I, Hänninen T, Tohka J. Rey's Auditory Verbal Learning Test scores can be predicted from whole brain MRI in Alzheimer's disease. Neuroimage Clin 2016; 13:415-427. [PMID: 28116234 PMCID: PMC5233798 DOI: 10.1016/j.nicl.2016.12.011] [Citation(s) in RCA: 88] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2016] [Revised: 11/25/2016] [Accepted: 12/11/2016] [Indexed: 12/18/2022]
Abstract
Rey's Auditory Verbal Learning Test (RAVLT) is a powerful neuropsychological tool for testing episodic memory, which is widely used for the cognitive assessment in dementia and pre-dementia conditions. Several studies have shown that an impairment in RAVLT scores reflect well the underlying pathology caused by Alzheimer's disease (AD), thus making RAVLT an effective early marker to detect AD in persons with memory complaints. We investigated the association between RAVLT scores (RAVLT Immediate and RAVLT Percent Forgetting) and the structural brain atrophy caused by AD. The aim was to comprehensively study to what extent the RAVLT scores are predictable based on structural magnetic resonance imaging (MRI) data using machine learning approaches as well as to find the most important brain regions for the estimation of RAVLT scores. For this, we built a predictive model to estimate RAVLT scores from gray matter density via elastic net penalized linear regression model. The proposed approach provided highly significant cross-validated correlation between the estimated and observed RAVLT Immediate (R = 0.50) and RAVLT Percent Forgetting (R = 0.43) in a dataset consisting of 806 AD, mild cognitive impairment (MCI) or healthy subjects. In addition, the selected machine learning method provided more accurate estimates of RAVLT scores than the relevance vector regression used earlier for the estimation of RAVLT based on MRI data. The top predictors were medial temporal lobe structures and amygdala for the estimation of RAVLT Immediate and angular gyrus, hippocampus and amygdala for the estimation of RAVLT Percent Forgetting. Further, the conversion of MCI subjects to AD in 3-years could be predicted based on either observed or estimated RAVLT scores with an accuracy comparable to MRI-based biomarkers.
Collapse
Affiliation(s)
- Elaheh Moradi
- Institute of Biosciences and Medical Technology, University of Tampere, Tampere, Finland
| | - Ilona Hallikainen
- University of Eastern Finland, Institute of Clinical Medicine, Department of Neurology, Kuopio, Finland
| | - Tuomo Hänninen
- Neurocenter, Neurology, Kuopio University Hospital, Kuopio, Finland
| | - Jussi Tohka
- Department of Bioengineering and Aerospace Engineering, Universidad Carlos III de Madrid, Leganes, Spain
- Instituto de Investigación Sanitaria Gregorio Marañon, Madrid, Spain
- University of Eastern Finland, AI Virtanen Institute for Molecular Sciences, Kuopio, Finland
| | | |
Collapse
|
31
|
Abstract
BACKGROUND The case-crossover design is an attractive alternative to the classical case-control design which can be used to study the onset of acute events if the risk factors of interest vary in time. By comparing exposures within cases at different time periods, the case-crossover design does not rely on control subjects which can be difficult to acquire. However, using the standard method of maximum likelihood, resulting risk estimates can be heavily biased when the prevalence to risk factors is very low (or very high). METHODS To overcome the problem of low risk factor prevalences, penalized conditional logistic regression via the lasso (least absolute shrinkage and selection operator) has been proposed in the literature as well as related methods such as the Firth correction. We apply and compare several penalized regression approaches in the context of a case-crossover analysis of the European Study of Severe Cutaneous Adverse Reactions (EuroSCAR; 1997-2001). RESULTS Out of 30 drugs, standard methods only correctly classified 17 drugs (including some highly implausible risk estimates), while penalized methods correctly classified 22 drugs. CONCLUSION Penalized methods generally yield better risk classifications and much more plausible risk estimates for the EuroSCAR study than standard methods. As these novel techniques can be easily implemented using available R packages, we encourage routine use of penalized conditional logistic regression for case-crossover data.
Collapse
Affiliation(s)
- Sam Doerken
- Institute for Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Germany
| | - Maja Mockenhaupt
- Dokumentationszentrum schwerer Hautreaktionen (dZh), Medical Center, University of Freiburg, Freiburg, Germany
| | - Luigi Naldi
- USC di Dermatologia, Azienda Ospedaliero Papa Giovanni XXIII, Bergamo, Italy
| | - Martin Schumacher
- Institute for Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Germany
| | - Peggy Sekula
- Institute for Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Freiburg, Germany
| |
Collapse
|
32
|
Ojeda FM, Müller C, Börnigen D, Trégouët DA, Schillert A, Heinig M, Zeller T, Schnabel RB. Comparison of Cox Model Methods in A Low-dimensional Setting with Few Events. Genomics Proteomics Bioinformatics 2016; 14:235-43. [PMID: 27224515 PMCID: PMC4996851 DOI: 10.1016/j.gpb.2016.03.006] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/19/2015] [Revised: 03/01/2016] [Accepted: 03/22/2016] [Indexed: 11/01/2022]
Abstract
Prognostic models based on survival data frequently make use of the Cox proportional hazards model. Developing reliable Cox models with few events relative to the number of predictors can be challenging, even in low-dimensional datasets, with a much larger number of observations than variables. In such a setting we examined the performance of methods used to estimate a Cox model, including (i) full model using all available predictors and estimated by standard techniques, (ii) backward elimination (BE), (iii) ridge regression, (iv) least absolute shrinkage and selection operator (lasso), and (v) elastic net. Based on a prospective cohort of patients with manifest coronary artery disease (CAD), we performed a simulation study to compare the predictive accuracy, calibration, and discrimination of these approaches. Candidate predictors for incident cardiovascular events we used included clinical variables, biomarkers, and a selection of genetic variants associated with CAD. The penalized methods, i.e., ridge, lasso, and elastic net, showed a comparable performance, in terms of predictive accuracy, calibration, and discrimination, and outperformed BE and the full model. Excessive shrinkage was observed in some cases for the penalized methods, mostly on the simulation scenarios having the lowest ratio of a number of events to the number of variables. We conclude that in similar settings, these three penalized methods can be used interchangeably. The full model and backward elimination are not recommended in rare event scenarios.
Collapse
Affiliation(s)
- Francisco M Ojeda
- Department of General and Interventional Cardiology, University Heart Center Hamburg-Eppendorf, 20246 Hamburg, Germany.
| | - Christian Müller
- Department of General and Interventional Cardiology, University Heart Center Hamburg-Eppendorf, 20246 Hamburg, Germany; German Center for Cardiovascular Research (DZHK), Hamburg/Kiel/Luebeck, Germany
| | - Daniela Börnigen
- Department of General and Interventional Cardiology, University Heart Center Hamburg-Eppendorf, 20246 Hamburg, Germany; German Center for Cardiovascular Research (DZHK), Hamburg/Kiel/Luebeck, Germany
| | - David-Alexandre Trégouët
- Sorbonne Universités, Université Pierre et Marie Curie Paris 06, Institut National pour la Santé et la Recherche Médicale (INSERM), Unité Mixte de Recherche en Santé (UMR_S) 1166, F-75013 Paris, France; Institute for Cardiometabolism and Nutrition (ICAN), F-75013 Paris, France
| | - Arne Schillert
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, 23562 Lübeck, Germany; German Center for Cardiovascular Research (DZHK), Hamburg/Kiel/Luebeck, Germany
| | - Matthias Heinig
- Institute of Computational Biology, German Research Center for Environmental Health, Helmholtz Zentrum München, 85764 Neuherberg, Germany
| | - Tanja Zeller
- Department of General and Interventional Cardiology, University Heart Center Hamburg-Eppendorf, 20246 Hamburg, Germany; German Center for Cardiovascular Research (DZHK), Hamburg/Kiel/Luebeck, Germany
| | - Renate B Schnabel
- Department of General and Interventional Cardiology, University Heart Center Hamburg-Eppendorf, 20246 Hamburg, Germany; German Center for Cardiovascular Research (DZHK), Hamburg/Kiel/Luebeck, Germany
| |
Collapse
|
33
|
Zhao LP, Bolouri H. Object-oriented regression for building predictive models with high dimensional omics data from translational studies. J Biomed Inform 2016; 60:431-45. [PMID: 26972839 PMCID: PMC5097461 DOI: 10.1016/j.jbi.2016.03.001] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2015] [Revised: 02/23/2016] [Accepted: 03/01/2016] [Indexed: 12/31/2022]
Abstract
Maturing omics technologies enable researchers to generate high dimension omics data (HDOD) routinely in translational clinical studies. In the field of oncology, The Cancer Genome Atlas (TCGA) provided funding support to researchers to generate different types of omics data on a common set of biospecimens with accompanying clinical data and has made the data available for the research community to mine. One important application, and the focus of this manuscript, is to build predictive models for prognostic outcomes based on HDOD. To complement prevailing regression-based approaches, we propose to use an object-oriented regression (OOR) methodology to identify exemplars specified by HDOD patterns and to assess their associations with prognostic outcome. Through computing patient's similarities to these exemplars, the OOR-based predictive model produces a risk estimate using a patient's HDOD. The primary advantages of OOR are twofold: reducing the penalty of high dimensionality and retaining the interpretability to clinical practitioners. To illustrate its utility, we apply OOR to gene expression data from non-small cell lung cancer patients in TCGA and build a predictive model for prognostic survivorship among stage I patients, i.e., we stratify these patients by their prognostic survival risks beyond histological classifications. Identification of these high-risk patients helps oncologists to develop effective treatment protocols and post-treatment disease management plans. Using the TCGA data, the total sample is divided into training and validation data sets. After building up a predictive model in the training set, we compute risk scores from the predictive model, and validate associations of risk scores with prognostic outcome in the validation data (P-value=0.015).
Collapse
Affiliation(s)
- Lue Ping Zhao
- Division of Public Health Sciences, Fred Hutchinson Cancer Research Center, Seattle, WA, United States; Department of Biostatistics and Epidemiology, University of Washington School of Public Health, Seattle, WA, United States.
| | - Hamid Bolouri
- Division of Human Biology, Fred Hutchinson Cancer Research Center, Seattle, WA, United States
| |
Collapse
|
34
|
Abstract
Penalized regression methods, such as L1 regularization, are routinely used in high-dimensional applications, and there is a rich literature on optimality properties under sparsity assumptions. In the Bayesian paradigm, sparsity is routinely induced through two-component mixture priors having a probability mass at zero, but such priors encounter daunting computational problems in high dimensions. This has motivated continuous shrinkage priors, which can be expressed as global-local scale mixtures of Gaussians, facilitating computation. In contrast to the frequentist literature, little is known about the properties of such priors and the convergence and concentration of the corresponding posterior distribution. In this article, we propose a new class of Dirichlet-Laplace priors, which possess optimal posterior concentration and lead to efficient posterior computation. Finite sample performance of Dirichlet-Laplace priors relative to alternatives is assessed in simulated and real data examples.
Collapse
Affiliation(s)
- Anirban Bhattacharya
- Department of Statistics, Texas A&M University, Department of Statistics, Florida State University, Department of Statistics, Harvard University, Department of Statistical Science, Duke University
| | - Debdeep Pati
- Department of Statistics, Texas A&M University, Department of Statistics, Florida State University, Department of Statistics, Harvard University, Department of Statistical Science, Duke University
| | - Natesh S. Pillai
- Department of Statistics, Texas A&M University, Department of Statistics, Florida State University, Department of Statistics, Harvard University, Department of Statistical Science, Duke University
| | - David B. Dunson
- Department of Statistics, Texas A&M University, Department of Statistics, Florida State University, Department of Statistics, Harvard University, Department of Statistical Science, Duke University
| |
Collapse
|
35
|
Ha MJ, Sun W, Xie J. PenPC: A two-step approach to estimate the skeletons of high-dimensional directed acyclic graphs. Biometrics 2015; 72:146-55. [PMID: 26406114 DOI: 10.1111/biom.12415] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2014] [Revised: 05/01/2015] [Accepted: 07/01/2015] [Indexed: 11/29/2022]
Abstract
Estimation of the skeleton of a directed acyclic graph (DAG) is of great importance for understanding the underlying DAG and causal effects can be assessed from the skeleton when the DAG is not identifiable. We propose a novel method named PenPC to estimate the skeleton of a high-dimensional DAG by a two-step approach. We first estimate the nonzero entries of a concentration matrix using penalized regression, and then fix the difference between the concentration matrix and the skeleton by evaluating a set of conditional independence hypotheses. For high-dimensional problems where the number of vertices p is in polynomial or exponential scale of sample size n, we study the asymptotic property of PenPC on two types of graphs: traditional random graphs where all the vertices have the same expected number of neighbors, and scale-free graphs where a few vertices may have a large number of neighbors. As illustrated by extensive simulations and applications on gene expression data of cancer patients, PenPC has higher sensitivity and specificity than the state-of-the-art method, the PC-stable algorithm.
Collapse
Affiliation(s)
- Min Jin Ha
- Department of Biostatistics, MD Anderson Cancer Center, Houston, Texas, 77030, U.S.A
| | - Wei Sun
- Department of Biostatistics, Department of Genetics, UNC Chapel Hill, North Carolina, 27514, U.S.A.,Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA
| | - Jichun Xie
- Department of Biostatistics & Bioinformatics, Duke University, Durham, North Carolina, 27708, U.S.A
| |
Collapse
|
36
|
Sabourin JA, Valdar W, Nobel AB. A permutation approach for selecting the penalty parameter in penalized model selection. Biometrics 2015; 71:1185-94. [PMID: 26243050 DOI: 10.1111/biom.12359] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2014] [Revised: 04/01/2015] [Accepted: 05/01/2015] [Indexed: 11/27/2022]
Abstract
We describe a simple, computationally efficient, permutation-based procedure for selecting the penalty parameter in LASSO-penalized regression. The procedure, permutation selection, is intended for applications where variable selection is the primary focus, and can be applied in a variety of structural settings, including that of generalized linear models. We briefly discuss connections between permutation selection and existing theory for the LASSO. In addition, we present a simulation study and an analysis of real biomedical data sets in which permutation selection is compared with selection based on the following: cross-validation (CV), the Bayesian information criterion (BIC), scaled sparse linear regression, and a selection method based on recently developed testing procedures for the LASSO.
Collapse
Affiliation(s)
- Jeremy A Sabourin
- Department of Genetics, University of North Carolina at Chapel Hill, North Carolina, U.S.A.,Genometrics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Baltimore, Maryland, U.S.A
| | - William Valdar
- Department of Genetics, University of North Carolina at Chapel Hill, North Carolina, U.S.A.,Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, North Carolina, U.S.A
| | - Andrew B Nobel
- Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, North Carolina, U.S.A.,Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, North Carolina, U.S.A.,Department of Biostatistics, University of North Carolina at Chapel Hill, North Carolina, U.S.A
| |
Collapse
|
37
|
Neely ML, Bondell HD, Tzeng JY. A penalized likelihood approach for investigating gene-drug interactions in pharmacogenetic studies. Biometrics 2015; 71:529-37. [PMID: 25604216 DOI: 10.1111/biom.12259] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2013] [Revised: 09/01/2014] [Accepted: 09/01/2014] [Indexed: 11/28/2022]
Abstract
Pharmacogenetics investigates the relationship between heritable genetic variation and the variation in how individuals respond to drug therapies. Often, gene-drug interactions play a primary role in this response, and identifying these effects can aid in the development of individualized treatment regimes. Haplotypes can hold key information in understanding the association between genetic variation and drug response. However, the standard approach for haplotype-based association analysis does not directly address the research questions dictated by individualized medicine. A complementary post-hoc analysis is required, and this post-hoc analysis is usually under powered after adjusting for multiple comparisons and may lead to seemingly contradictory conclusions. In this work, we propose a penalized likelihood approach that is able to overcome the drawbacks of the standard approach and yield the desired personalized output. We demonstrate the utility of our method by applying it to the Scottish Randomized Trial in Ovarian Cancer. We also conducted simulation studies and showed that the proposed penalized method has comparable or more power than the standard approach and maintains low Type I error rates for both binary and quantitative drug responses. The largest performance gains are seen when the haplotype frequency is low, the difference in effect sizes are small, or the true relationship among the drugs is more complex.
Collapse
Affiliation(s)
- Megan L Neely
- Department of Biostatistics and Bioinformatics, Duke University, Durham, North Carolina, 27705, U.S.A
| | - Howard D Bondell
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, 27695, U.S.A
| | - Jung-Ying Tzeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina, 27695, U.S.A.,Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, 27695, U.S.A
| |
Collapse
|
38
|
Chaturvedi N, de Menezes RX, Goeman JJ. Fused lasso algorithm for Cox' proportional hazards and binomial logit models with application to copy number profiles. Biom J 2014; 56:477-92. [PMID: 24496763 DOI: 10.1002/bimj.201200241] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2012] [Revised: 10/15/2013] [Accepted: 10/19/2013] [Indexed: 11/09/2022]
Abstract
This paper presents an efficient algorithm based on the combination of Newton Raphson and Gradient Ascent, for using the fused lasso regression method to construct a genome-based classifier. The characteristic structure of copy number data suggests that feature selection should take genomic location into account for producing more interpretable results for genome-based classifiers. The fused lasso penalty, an extension of the lasso penalty, encourages sparsity of the coefficients and their differences by penalizing the L1-norm for both of them at the same time, thus using genomic location. The major advantage of the algorithm over other existing fused lasso optimization techniques is its ability to predict binomial as well as survival response efficiently. We apply our algorithm to two publicly available datasets in order to predict survival and binary outcomes.
Collapse
Affiliation(s)
- Nimisha Chaturvedi
- Epidemiology and Biostatistics, Vrije Universiteit Medical Center, Amsterdam, The Netherlands; Netherlands Bioinformatics Centre, Geert Grooteplein 28, 6525 GA Nijmegen, The Netherlands
| | | | | |
Collapse
|
39
|
Song R, Yi F, Zou H. On Varying-coefficient Independence Screening for High-dimensional Varying-coefficient Models. Stat Sin 2014; 24:1735-1752. [PMID: 25484548 PMCID: PMC4251601] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Varying coefficient models have been widely used in longitudinal data analysis, nonlinear time series, survival analysis, and so on. They are natural non-parametric extensions of the classical linear models in many contexts, keeping good interpretability and allowing us to explore the dynamic nature of the model. Recently, penalized estimators have been used for fitting varying-coefficient models for high-dimensional data. In this paper, we propose a new computationally attractive algorithm called IVIS for fitting varying-coefficient models in ultra-high dimensions. The algorithm first fits a gSCAD penalized varying-coefficient model using a subset of covariates selected by a new varying-coefficient independence screening (VIS) technique. The sure screening property is established for VIS. The proposed algorithm then iterates between a greedy conditional VIS step and a gSCAD penalized fitting step. Simulation and a real data analysis demonstrate that IVIS has very competitive performance for moderate sample size and high dimension.
Collapse
Affiliation(s)
- Rui Song
- North Carolina State University and University of Minnesota
| | - Feng Yi
- North Carolina State University and University of Minnesota
| | - Hui Zou
- North Carolina State University and University of Minnesota
| |
Collapse
|
40
|
Pan W, Shen X, Liu B. Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty. J Mach Learn Res 2013; 14:1865. [PMID: 24358018 PMCID: PMC3866036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Clustering analysis is widely used in many fields. Traditionally clustering is regarded as unsupervised learning for its lack of a class label or a quantitative response variable, which in contrast is present in supervised learning such as classification and regression. Here we formulate clustering as penalized regression with grouping pursuit. In addition to the novel use of a non-convex group penalty and its associated unique operating characteristics in the proposed clustering method, a main advantage of this formulation is its allowing borrowing some well established results in classification and regression, such as model selection criteria to select the number of clusters, a difficult problem in clustering analysis. In particular, we propose using the generalized cross-validation (GCV) based on generalized degrees of freedom (GDF) to select the number of clusters. We use a few simple numerical examples to compare our proposed method with some existing approaches, demonstrating our method's promising performance.
Collapse
Affiliation(s)
- Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455
| | - Xiaotong Shen
- School of Statistics, University of Minnesota, Minneapolis, MN 55455
| | - Binghui Liu
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, MN 55455
- School of Statistics, University of Minnesota, Minneapolis, MN 55455
| |
Collapse
|
41
|
Liu J, Wang K, Ma S, Huang J. Accounting for linkage disequilibrium in genome-wide association studies: A penalized regression method. Stat Interface 2013; 6:99-115. [PMID: 25258655 PMCID: PMC4172344 DOI: 10.4310/sii.2013.v6.n1.a10] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Penalized regression methods are becoming increasingly popular in genome-wide association studies (GWAS) for identifying genetic markers associated with disease. However, standard penalized methods such as LASSO do not take into account the possible linkage disequilibrium between adjacent markers. We propose a novel penalized approach for GWAS using a dense set of single nucleotide polymorphisms (SNPs). The proposed method uses the minimax concave penalty (MCP) for marker selection and incorporates linkage disequilibrium (LD) information by penalizing the difference of the genetic effects at adjacent SNPs with high correlation. A coordinate descent algorithm is derived to implement the proposed method. This algorithm is efficient in dealing with a large number of SNPs. A multi-split method is used to calculate the p-values of the selected SNPs for assessing their significance. We refer to the proposed penalty function as the smoothed MCP and the proposed approach as the SMCP method. Performance of the proposed SMCP method and its comparison with LASSO and MCP approaches are evaluated through simulation studies, which demonstrate that the proposed method is more accurate in selecting associated SNPs. Its applicability to real data is illustrated using heterogeneous stock mice data and a rheumatoid arthritis.
Collapse
Affiliation(s)
- Jin Liu
- School of Public Health, Yale University, New Haven, CT 06520, USA
| | - Kai Wang
- Department of Biostatistics, University of Iowa, Iowa City, IA 52242, USA
| | - Shuangge Ma
- School of Public Health, Yale University, New Haven, CT 06520, USA
| | - Jian Huang
- Department of Statistics and Actuarial Science, Department of Biostatistics, University of Iowa, Iowa City, IA 52242, USA
| |
Collapse
|
42
|
Abstract
In this article, we present a selective overview of some recent developments in Bayesian model and variable selection methods for high dimensional linear models. While most of the reviews in literature are based on conventional methods, we focus on recently developed methods, which have proven to be successful in dealing with high dimensional variable selection. First, we give a brief overview of the traditional model selection methods (viz. Mallow's Cp, AIC, BIC, DIC), followed by a discussion on some recently developed methods (viz. EBIC, regularization), which have occupied the minds of many statisticians. Then, we review high dimensional Bayesian methods with a particular emphasis on Bayesian regularization methods, which have been used extensively in recent years. We conclude by briefly addressing the asymptotic behaviors of Bayesian variable selection methods for high dimensional linear models under different regularity conditions.
Collapse
Affiliation(s)
- Himel Mallick
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Nengjun Yi
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA
| |
Collapse
|
43
|
Abstract
The semiparametric partially linear model allows flexible modeling of covariate effects on the response variable in regression. It combines the flexibility of nonparametric regression and parsimony of linear regression. The most important assumption in the existing methods for the estimation in this model is to assume a priori that it is known which covariates have a linear effect and which do not. However, in applied work, this is rarely known in advance. We consider the problem of estimation in the partially linear models without assuming a priori which covariates have linear effects. We propose a semiparametric regression pursuit method for identifying the covariates with a linear effect. Our proposed method is a penalized regression approach using a group minimax concave penalty. Under suitable conditions we show that the proposed approach is model-pursuit consistent, meaning that it can correctly determine which covariates have a linear effect and which do not with high probability. The performance of the proposed method is evaluated using simulation studies, which support our theoretical results. A real data example is used to illustrated the application of the proposed method.
Collapse
|
44
|
Abstract
We develop an approach to tuning of penalized regression variable selection methods by calculating the sparsest estimator contained in a confidence region of a specified level. Because confidence intervals/regions are generally understood, tuning penalized regression methods in this way is intuitive and more easily understood by scientists and practitioners. More importantly, our work shows that tuning to a fixed confidence level often performs better than tuning via the common methods based on AIC, BIC, or cross-validation (CV) over a wide range of sample sizes and levels of sparsity. Additionally, we prove that by tuning with a sequence of confidence levels converging to one, asymptotic selection consistency is obtained; and with a simple two-stage procedure, an oracle property is achieved. The confidence region based tuning parameter is easily calculated using output from existing penalized regression computer packages.Our work also shows how to map any penalty parameter to a corresponding confidence coefficient. This mapping facilitates comparisons of tuning parameter selection methods such as AIC, BIC and CV, and reveals that the resulting tuning parameters correspond to confidence levels that are extremely low, and can vary greatly across data sets. Supplemental materials for the article are available online.
Collapse
Affiliation(s)
- Funda Gunes
- Department of Statistics, North Carolina State University
| | | |
Collapse
|