1
|
Yue M, Li J, Ma S. Sparse boosting for high-dimensional survival data with varying coefficients. Stat Med 2018; 37:789-800. [PMID: 29152776 DOI: 10.1002/sim.7544] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2017] [Revised: 09/04/2017] [Accepted: 10/06/2017] [Indexed: 12/26/2022]
Abstract
Motivated by high-throughput profiling studies in biomedical research, variable selection methods have been a focus for biostatisticians. In this paper, we consider semiparametric varying-coefficient accelerated failure time models for right censored survival data with high-dimensional covariates. Instead of adopting the traditional regularization approaches, we offer a novel sparse boosting (SparseL2 Boosting) algorithm to conduct model-based prediction and variable selection. One main advantage of this new method is that we do not need to perform the time-consuming selection of tuning parameters. Extensive simulations are conducted to examine the performance of our sparse boosting feature selection techniques. We further illustrate our methods using a lung cancer data analysis.
Collapse
Affiliation(s)
- Mu Yue
- Department of Statistics and Applied Probability, National University of Singapore, Singapore
| | - Jialiang Li
- Department of Statistics and Applied Probability, National University of Singapore, Singapore.,Duke-NUS Graduate Medical School, Singapore.,Singapore Eye Research Institute, Singapore
| | - Shuangge Ma
- School of Public Health, Yale University, 60 College ST, LEPH 206, New Haven, 06520, CT, USA
| |
Collapse
|
2
|
Yue M, Li J. Improvement Screening for Ultra-High Dimensional Data with Censored Survival Outcomes and Varying Coefficients. Int J Biostat 2017; 13:/j/ijb.2017.13.issue-1/ijb-2017-0024/ijb-2017-0024.xml. [PMID: 28541925 DOI: 10.1515/ijb-2017-0024] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Motivated by risk prediction studies with ultra-high dimensional bio markers, we propose a novel improvement screening methodology. Accurate risk prediction can be quite useful for patient treatment selection, prevention strategy or disease management in evidence-based medicine. The question of how to choose new markers in addition to the conventional ones is especially important. In the past decade, a number of new measures for quantifying the added value from the new markers were proposed, among which the integrated discrimination improvement (IDI) and net reclassification improvement (NRI) stand out. Meanwhile, C-statistics are routinely used to quantify the capacity of the estimated risk score in discriminating among subjects with different event times. In this paper, we will examine these improvement statistics as well as the norm-based approach for evaluating the incremental values of new markers and compare these four measures by analyzing ultra-high dimensional censored survival data. In particular, we consider Cox proportional hazards models with varying coefficients. All measures perform very well in simulations and we illustrate our methods in an application to a lung cancer study.
Collapse
|
3
|
Sa J, Liu X, He T, Liu G, Cui Y. A Nonlinear Model for Gene-Based Gene-Environment Interaction. Int J Mol Sci 2016; 17:E882. [PMID: 27271617 DOI: 10.3390/ijms17060882] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2016] [Revised: 05/07/2016] [Accepted: 05/21/2016] [Indexed: 11/16/2022] Open
Abstract
A vast amount of literature has confirmed the role of gene-environment (G×E) interaction in the etiology of complex human diseases. Traditional methods are predominantly focused on the analysis of interaction between a single nucleotide polymorphism (SNP) and an environmental variable. Given that genes are the functional units, it is crucial to understand how gene effects (rather than single SNP effects) are influenced by an environmental variable to affect disease risk. Motivated by the increasing awareness of the power of gene-based association analysis over single variant based approach, in this work, we proposed a sparse principle component regression (sPCR) model to understand the gene-based G×E interaction effect on complex disease. We first extracted the sparse principal components for SNPs in a gene, then the effect of each principal component was modeled by a varying-coefficient (VC) model. The model can jointly model variants in a gene in which their effects are nonlinearly influenced by an environmental variable. In addition, the varying-coefficient sPCR (VC-sPCR) model has nice interpretation property since the sparsity on the principal component loadings can tell the relative importance of the corresponding SNPs in each component. We applied our method to a human birth weight dataset in Thai population. We analyzed 12,005 genes across 22 chromosomes and found one significant interaction effect using the Bonferroni correction method and one suggestive interaction. The model performance was further evaluated through simulation studies. Our model provides a system approach to evaluate gene-based G×E interaction.
Collapse
|
4
|
Chen T, Ma Y, Wang Y. Predicting cumulative risk of disease onset by redistributing weights. Stat Med 2015; 34:2427-43. [PMID: 25847392 PMCID: PMC4457675 DOI: 10.1002/sim.6499] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2013] [Revised: 02/23/2015] [Accepted: 03/12/2015] [Indexed: 11/09/2022]
Abstract
We propose a simple approach predicting the cumulative risk of disease accommodating predictors with time-varying effects and outcomes subject to censoring. We use a nonparametric function for the coefficient of the time-varying effect and handle censoring through self-consistency equations that redistribute the probability mass of censored outcomes to the right. The computational procedure is extremely convenient and can be implemented by standard software. We prove large sample properties of the proposed estimator and evaluate its finite sample performance through simulation studies. We apply the method to estimate the cumulative risk of developing Huntington's disease (HD) from subjects with huntingtin gene mutation using a large collaborative HD study data and illustrate an inverse relationship between the cumulative risk of HD and the length of cytosine-adenine-guanine repeats in the huntingtin gene.
Collapse
Affiliation(s)
| | - Yanyuan Ma
- Department of Statistics, Texas A&M University
| | - Yuanjia Wang
- Department of Biostatistics, Mailman School of Public Health, Columbia University
| |
Collapse
|
5
|
Moore CM, MaWhinney S, Forster JE, Carlson NE, Allshouse A, Wang X, Routy JP, Conway B, Connick E. Accounting for dropout reason in longitudinal studies with nonignorable dropout. Stat Methods Med Res 2015; 26:1854-1866. [PMID: 26078357 DOI: 10.1177/0962280215590432] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Dropout is a common problem in longitudinal cohort studies and clinical trials, often raising concerns of nonignorable dropout. Selection, frailty, and mixture models have been proposed to account for potentially nonignorable missingness by relating the longitudinal outcome to time of dropout. In addition, many longitudinal studies encounter multiple types of missing data or reasons for dropout, such as loss to follow-up, disease progression, treatment modifications and death. When clinically distinct dropout reasons are present, it may be preferable to control for both dropout reason and time to gain additional clinical insights. This may be especially interesting when the dropout reason and dropout times differ by the primary exposure variable. We extend a semi-parametric varying-coefficient method for nonignorable dropout to accommodate dropout reason. We apply our method to untreated HIV-infected subjects recruited to the Acute Infection and Early Disease Research Program HIV cohort and compare longitudinal CD4+ T cell count in injection drug users to nonusers with two dropout reasons: anti-retroviral treatment initiation and loss to follow-up.
Collapse
Affiliation(s)
- Camille M Moore
- 1 Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Denver, Aurora, CO, USA
| | - Samantha MaWhinney
- 1 Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Denver, Aurora, CO, USA
| | - Jeri E Forster
- 1 Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Denver, Aurora, CO, USA.,2 Veterans Integrated Service Network 19, Mental Illness Research Education and Clinical Center, Denver VA Medical Center, Denver, CO, USA
| | - Nichole E Carlson
- 1 Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Denver, Aurora, CO, USA
| | - Amanda Allshouse
- 1 Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Denver, Aurora, CO, USA
| | - Xinshuo Wang
- 1 Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Denver, Aurora, CO, USA.,3 Department of Epidemiology and Biostatistics, College of Public Health, University of Georgia, Athens, GA, USA
| | - Jean-Pierre Routy
- 4 Division of Hematology and Chronic Viral Illness Service, McGill University, Montreal, Quebec, Canada
| | - Brian Conway
- 5 Vancouver Infectious Diseases Centre, Vancouver, British Columbia, Canada
| | - Elizabeth Connick
- 6 Division of Infectious Diseases, University of Colorado Denver, Aurora, CO, USA
| |
Collapse
|
6
|
Ma Y, Wang Y. Nonparametric modeling and analysis of association between Huntington's disease onset and CAG repeats. Stat Med 2013; 33:1369-82. [PMID: 24027120 DOI: 10.1002/sim.5971] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2012] [Accepted: 08/21/2013] [Indexed: 11/09/2022]
Abstract
Huntington's disease (HD) is a neurodegenerative disorder with a dominant genetic mode of inheritance caused by an expansion of CAG repeats on chromosome 4. Typically, a longer sequence of CAG repeat length is associated with increased risk of experiencing earlier onset of HD. Previous studies of the association between HD onset age and CAG length have favored a logistic model, where the CAG repeat length enters the mean and variance components of the logistic model in a complex exponential-linear form. To relax the parametric assumption of the exponential-linear association to the true HD onset distribution, we propose to leave both mean and variance functions of the CAG repeat length unspecified and perform semiparametric estimation in this context through a local kernel and backfitting procedure. Motivated by including family history of HD information available in the family members of participants in the Cooperative Huntington's Observational Research Trial (COHORT), we develop the methodology in the context of mixture data, where some subjects have a positive probability of being risk free. We also allow censoring on the age at onset of disease and accommodate covariates other than the CAG length. We study the theoretical properties of the proposed estimator and derive its asymptotic distribution. Finally, we apply the proposed methods to the COHORT data to estimate the HD onset distribution using a group of study participants and the disease family history information available on their family members.
Collapse
Affiliation(s)
- Yanyuan Ma
- Department of Statistics, Texas A&M University, College Station, TX, U.S.A
| | | |
Collapse
|
7
|
Xie M, Simpson DG, Carroll RJ. Semiparametric Analysis of Heterogeneous Data Using Varying-Scale Generalized Linear Models. J Am Stat Assoc 2008; 103:650-660. [PMID: 19444331 PMCID: PMC2681270 DOI: 10.1198/016214508000000210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
This article describes a class of heteroscedastic generalized linear regression models in which a subset of the regression parameters are rescaled nonparametrically, and develops efficient semiparametric inferences for the parametric components of the models. Such models provide a means to adapt for heterogeneity in the data due to varying exposures, varying levels of aggregation, and so on. The class of models considered includes generalized partially linear models and nonparametrically scaled link function models as special cases. We present an algorithm to estimate the scale function nonparametrically, and obtain asymptotic distribution theory for regression parameter estimates. In particular, we establish that the asymptotic covariance of the semiparametric estimator for the parametric part of the model achieves the semiparametric lower bound. We also describe bootstrap-based goodness-of-scale test. We illustrate the methodology with simulations, published data, and data from collaborative research on ultrasound safety.
Collapse
Affiliation(s)
- Minge Xie
- Associate Professor and Director of Office of Statistical Consulting, Department of Statistics, Rutgers University, Piscataway, NJ 08854 (E-mail: )
| | - Douglas G. Simpson
- Professor and Chair, Department of Statistics, University of Illinois, Champaign, IL 61820 (E-mail: )
| | - Raymond J. Carroll
- Distinguished Professor, Department of Statistics, Texas A&M University, College Station, TX 77843 (E-mail: )
| |
Collapse
|