1
|
Predicting pediatric severe asthma exacerbations: an administrative claims-based predictive model. J Asthma 2024; 61:203-211. [PMID: 37725084 DOI: 10.1080/02770903.2023.2260881] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2023] [Accepted: 09/14/2023] [Indexed: 09/21/2023]
Abstract
OBJECTIVE Previous machine learning approaches fail to consider race and ethnicity and social determinants of health (SDOH) to predict childhood asthma exacerbations. A predictive model for asthma exacerbations in children is developed to explore the importance of race and ethnicity, rural-urban commuting area (RUCA) codes, the Child Opportunity Index (COI), and other ICD-10 SDOH in predicting asthma outcomes. METHODS Insurance and coverage claims data from the Arkansas All-Payer Claims Database were used to capture risk factors. We identified a cohort of 22,631 children with asthma aged 5-18 years with 2 years of continuous Medicaid enrollment and at least one asthma diagnosis in 2018. The goal was to predict asthma-related hospitalizations and asthma-related emergency department (ED) visits in 2019. The analytic sample was 59% age 5-11 years, 39% White, 33% Black, and 6% Hispanic. Conditional random forest models were used to train the model. RESULTS The model yielded an area under the curve (AUC) of 72%, sensitivity of 55% and specificity of 78% in the OOB samples and AUC of 73%, sensitivity of 58% and specificity of 77% in the training samples. Consistent with previous literature, asthma-related hospitalization or ED visits in the previous year (2018) were the two most important variables in predicting hospital or ED use in the following year (2019), followed by the total number of reliever and controller medications. CONCLUSIONS Predictive models for asthma-related exacerbation achieved moderate accuracy, but race and ethnicity, ICD-10 SDOH, RUCA codes, and COI measures were not important in improving model accuracy.
Collapse
|
2
|
Flexible variable selection in the presence of missing data. Int J Biostat 2024; 0:ijb-2023-0059. [PMID: 38348882 DOI: 10.1515/ijb-2023-0059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Accepted: 11/21/2023] [Indexed: 05/22/2024]
Abstract
In many applications, it is of interest to identify a parsimonious set of features, or panel, from multiple candidates that achieves a desired level of performance in predicting a response. This task is often complicated in practice by missing data arising from the sampling design or other random mechanisms. Most recent work on variable selection in missing data contexts relies in some part on a finite-dimensional statistical model, e.g., a generalized or penalized linear model. In cases where this model is misspecified, the selected variables may not all be truly scientifically relevant and can result in panels with suboptimal classification performance. To address this limitation, we propose a nonparametric variable selection algorithm combined with multiple imputation to develop flexible panels in the presence of missing-at-random data. We outline strategies based on the proposed algorithm that achieve control of commonly used error rates. Through simulations, we show that our proposal has good operating characteristics and results in panels with higher classification and variable selection performance compared to several existing penalized regression approaches in cases where a generalized linear model is misspecified. Finally, we use the proposed method to develop biomarker panels for separating pancreatic cysts with differing malignancy potential in a setting where complicated missingness in the biomarkers arose due to limited specimen volumes.
Collapse
|
3
|
A new method for clustered survival data: Estimation of treatment effect heterogeneity and variable selection. Biom J 2024; 66:e2200178. [PMID: 38072661 PMCID: PMC10953775 DOI: 10.1002/bimj.202200178] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Revised: 07/31/2023] [Accepted: 08/11/2023] [Indexed: 01/30/2024]
Abstract
We recently developed a new method random-intercept accelerated failure time model with Bayesian additive regression trees (riAFT-BART) to draw causal inferences about population treatment effect on patient survival from clustered and censored survival data while accounting for the multilevel data structure. The practical utility of this method goes beyond the estimation of population average treatment effect. In this work, we exposit how riAFT-BART can be used to solve two important statistical questions with clustered survival data: estimating the treatment effect heterogeneity and variable selection. Leveraging the likelihood-based machine learning, we describe a way in which we can draw posterior samples of the individual survival treatment effect from riAFT-BART model runs, and use the drawn posterior samples to perform an exploratory treatment effect heterogeneity analysis to identify subpopulations who may experience differential treatment effects than population average effects. There is sparse literature on methods for variable selection among clustered and censored survival data, particularly ones using flexible modeling techniques. We propose a permutation-based approach using the predictor's variable inclusion proportion supplied by the riAFT-BART model for variable selection. To address the missing data issue frequently encountered in health databases, we propose a strategy to combine bootstrap imputation and riAFT-BART for variable selection among incomplete clustered survival data. We conduct an expansive simulation study to examine the practical operating characteristics of our proposed methods, and provide empirical evidence that our proposed methods perform better than several existing methods across a wide range of data scenarios. Finally, we demonstrate the methods via a case study of predictors for in-hospital mortality among severe COVID-19 patients and estimating the heterogeneous treatment effects of three COVID-specific medications. The methods developed in this work are readily available in the R ${\textsf {R}}$ package riAFTBART $\textsf {riAFTBART}$ .
Collapse
|
4
|
Change talk subtypes as predictors of alcohol use following brief motivational intervention. PSYCHOLOGY OF ADDICTIVE BEHAVIORS 2023; 37:875-885. [PMID: 36442021 PMCID: PMC10225014 DOI: 10.1037/adb0000898] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2023]
Abstract
OBJECTIVE To examine the relative importance of client change language subtypes as predictors of alcohol use following motivational interviewing (MI). METHOD Participants were 164 heavy drinkers (57.3% female, Mage = 28.5 years, 13.4% Hispanic/Latinx, 82.9% White) recruited during an emergency department visit who received MI for alcohol and human immunodeficiency virus/sexual risk in a randomized-controlled trial. MI sessions were coded with the motivational interviewing skill code (MISC) and the generalized behavioral intervention analysis system (GBIAS). Variable importance analyses used targeted maximum likelihood estimation to rank order change language subtypes defined by these systems as predictors of alcohol use over 9 months of follow-up. RESULTS Among GBIAS change language subtypes, higher sustain talk (ST) around change planning was ranked the most important predictor of drinks per week (b = -5.57, 95% CI [-8.11, -3.02]) and heavy drinking days (b = -2.07, 95% CI [-3.17, -0.98]); this talk reflected (a) rejection of alcohol abstinence as a desired change goal, (b) rejection of specific change strategies, or (c) discussion of anticipated challenges in changing drinking. Among MISC change language subtypes, higher ST around taking steps-reflecting recent escalations in drinking described by a small minority of participants-was ranked the most important predictor of drinks per week (b = 22.71, 95% CI [20.29, 25.13]) and heavy drinking days (b = -2.45, 95% CI [1.68, 3.21]). CONCLUSIONS Results challenge the assumption that all ST during MI is a negative prognostic indicator and highlight the importance of the context in which change language emerges. (PsycInfo Database Record (c) 2023 APA, all rights reserved).
Collapse
|
5
|
Multi-Model Prediction of West Nile Virus Neuroinvasive Disease With Machine Learning for Identification of Important Regional Climatic Drivers. GEOHEALTH 2023; 7:e2023GH000906. [PMID: 38023388 PMCID: PMC10654557 DOI: 10.1029/2023gh000906] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Revised: 09/15/2023] [Accepted: 10/21/2023] [Indexed: 12/01/2023]
Abstract
West Nile virus (WNV) is the leading cause of mosquito-borne illness in the continental United States (CONUS). Spatial heterogeneity in historical incidence, environmental factors, and complex ecology make prediction of spatiotemporal variation in WNV transmission challenging. Machine learning provides promising tools for identification of important variables in such situations. To predict annual WNV neuroinvasive disease (WNND) cases in CONUS (2015-2021), we fitted 10 probabilistic models with variation in complexity from naïve to machine learning algorithm and an ensemble. We made predictions in each of nine climate regions on a hexagonal grid and evaluated each model's predictive accuracy. Using the machine learning models (random forest and neural network), we identified the relative importance and variation in ranking of predictors (historical WNND cases, climate anomalies, human demographics, and land use) across regions. We found that historical WNND cases and population density were among the most important factors while anomalies in temperature and precipitation often had relatively low importance. While the relative performance of each model varied across climatic regions, the magnitude of difference between models was small. All models except the naïve model had non-significant differences in performance relative to the baseline model (negative binomial model fit per hexagon). No model, including the ensemble or more complex machine learning models, outperformed models based on historical case counts on the hexagon or region level; these models are good forecasting benchmarks. Further work is needed to assess if predictive capacity can be improved beyond that of these historical baselines.
Collapse
|
6
|
Commentary: Modeling mortality risk in patients with severe COVID-19 from Mexico. Front Med (Lausanne) 2023; 10:1247741. [PMID: 37840999 PMCID: PMC10568472 DOI: 10.3389/fmed.2023.1247741] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Accepted: 09/14/2023] [Indexed: 10/17/2023] Open
|
7
|
Using Geospatial Data and Random Forest To Predict PFAS Contamination in Fish Tissue in the Columbia River Basin, United States. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2023; 57:14024-14035. [PMID: 37669088 PMCID: PMC10515492 DOI: 10.1021/acs.est.3c03670] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Revised: 08/08/2023] [Accepted: 08/09/2023] [Indexed: 09/06/2023]
Abstract
Decision makers in the Columbia River Basin (CRB) are currently challenged with identifying and characterizing the extent of per- and polyfluoroalkyl substances (PFAS) contamination and human exposure to PFAS. This work aims to develop and pilot a methodology to help decision makers target and prioritize sampling investigations and identify contaminated natural resources. Here we use random forest models to predict ∑PFAS in fish tissue; understanding PFAS levels in fish is particularly important in the CRB because fish can be a major component of tribal and indigenous people diet. Geospatial data, including land cover and distances to known or potential PFAS sources and industries, were leveraged as predictors for modeling. Models were developed and evaluated for Washington state and Oregon using limited available empirical data. Mapped predictions show several areas where detectable concentrations of PFAS in fish tissue are predicted to occur, but prior sampling has not yet confirmed. Variable importance is analyzed to identify potentially important sources of PFAS in fish in this region. The cost-effective methodologies demonstrated here can help address sparsity of existing PFAS occurrence data in environmental media in this and other regions while also giving insights into potentially important drivers and sources of PFAS in fish.
Collapse
|
8
|
A Molecular Dynamics Simulation Study on Enhancement of Mechanical and Tribological Properties of Nitrile-Butadiene Rubber with Varied Contents of Acrylonitrile. Polymers (Basel) 2023; 15:3799. [PMID: 37765653 PMCID: PMC10535401 DOI: 10.3390/polym15183799] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Revised: 09/08/2023] [Accepted: 09/08/2023] [Indexed: 09/29/2023] Open
Abstract
The molecular models of nitrile-butadiene rubber (NBR) with varied contents of acrylonitrile (ACN) were developed and investigated to provide an understanding of the enhancement mechanisms of ACN. The investigation was conducted using molecular dynamics (MD) simulations to calculate and predict the mechanical and tribological properties of NBR through the constant strain method and the shearing model. The MD simulation results showed that the mechanical properties of NBR showed an increasing trend until the content of ACN reached 40%. The mechanism to enhance the strength of the rubber by ACN was investigated and analyzed by assessing the binding energy, radius of gyration, mean square displacement, and free volume. The abrasion rate (AR) of NBR was calculated using Fe-NBR-Fe models during the friction processes. The wear results of atomistic simulations indicated that the NBR with 40% ACN content had the best tribological properties due to the synergy among appropriate polarity, rigidity, and chain length of the NBR molecules. In addition, the random forest regression model of predicted AR, based on the dataset of feature parameters extracted by the MD models, was developed to obtain the variable importance for identifying the highly correlated parameters of AR. The torsion-bend-bend energy was obtained and used to successfully predict the AR trend on the new NBR models with other acrylonitrile contents.
Collapse
|
9
|
Impacts of COVID-19 on Stress in Middle School Teachers and Staff in Minnesota: An Exploratory Study Using Random Forest Analysis. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2023; 20:6698. [PMID: 37681838 PMCID: PMC10487626 DOI: 10.3390/ijerph20176698] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 08/26/2023] [Accepted: 08/29/2023] [Indexed: 09/09/2023]
Abstract
While the COVID-19 pandemic has negatively impacted many occupations, teachers and school staff have faced unique challenges related to remote and hybrid teaching, less contact with students, and general uncertainty. This study aimed to measure the associations between specific impacts of the COVID-19 pandemic and stress levels in Minnesota educators. A total of 296 teachers and staff members from eight middle schools completed online surveys between May and July of 2020. The Epidemic Pandemic Impacts Inventory (EPII) measured the effects of the COVID-19 pandemic according to nine domains (i.e., Economic, Home Life). The Kessler-6 scale measured non-specific stress (range: 0-24), with higher scores indicating greater levels of stress. Random forest analysis determined which items of the EPII were predictive of stress. The average Kessler-6 score was 6.8, indicating moderate stress. Three EPII items explained the largest amount of variation in the Kessler-6 score: increase in mental health problems or symptoms, hard time making the transition to working from home, and increase in sleep problems or poor sleep quality. These findings indicate potential areas for intervention to reduce employee stress in the event of future disruptions to in-person teaching or other major transitions during dynamic times.
Collapse
|
10
|
Predicting Norovirus in England Using Existing and Emerging Syndromic Data: Infodemiology Study. J Med Internet Res 2023; 25:e37540. [PMID: 37155231 DOI: 10.2196/37540] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2022] [Revised: 11/28/2022] [Accepted: 02/19/2023] [Indexed: 05/10/2023] Open
Abstract
BACKGROUND Norovirus is associated with approximately 18% of the global burden of gastroenteritis and affects all age groups. There is currently no licensed vaccine or available antiviral treatment. However, well-designed early warning systems and forecasting can guide nonpharmaceutical approaches to norovirus infection prevention and control. OBJECTIVE This study evaluates the predictive power of existing syndromic surveillance data and emerging data sources, such as internet searches and Wikipedia page views, to predict norovirus activity across a range of age groups across England. METHODS We used existing syndromic surveillance and emerging syndromic data to predict laboratory data indicating norovirus activity. Two methods are used to evaluate the predictive potential of syndromic variables. First, the Granger causality framework was used to assess whether individual variables precede changes in norovirus laboratory reports in a given region or an age group. Then, we used random forest modeling to estimate the importance of each variable in the context of others with two methods: (1) change in the mean square error and (2) node purity. Finally, these results were combined into a visualization indicating the most influential predictors for norovirus laboratory reports in a specific age group and region. RESULTS Our results suggest that syndromic surveillance data include valuable predictors for norovirus laboratory reports in England. However, Wikipedia page views are less likely to provide prediction improvements on top of Google Trends and Existing Syndromic Data. Predictors displayed varying relevance across age groups and regions. For example, the random forest modeling based on selected existing and emerging syndromic variables explained 60% variance in the ≥65 years age group, 42% in the East of England, but only 13% in the South West region. Emerging data sets highlighted relative search volumes, including "flu symptoms," "norovirus in pregnancy," and norovirus activity in specific years, such as "norovirus 2016." Symptoms of vomiting and gastroenteritis in multiple age groups were identified as important predictors within existing data sources. CONCLUSIONS Existing and emerging data sources can help predict norovirus activity in England in some age groups and geographic regions, particularly, predictors concerning vomiting, gastroenteritis, and norovirus in the vulnerable populations and historical terms such as stomach flu. However, syndromic predictors were less relevant in some age groups and regions likely due to contrasting public health practices between regions and health information-seeking behavior between age groups. Additionally, predictors relevant to one norovirus season may not contribute to other seasons. Data biases, such as low spatial granularity in Google Trends and especially in Wikipedia data, also play a role in the results. Moreover, internet searches can provide insight into mental models, that is, an individual's conceptual understanding of norovirus infection and transmission, which could be used in public health communication strategies.
Collapse
|
11
|
Artificial intelligence to investigate predictors and prognostic impact of time to surgery in colon cancer. J Surg Oncol 2023; 127:966-974. [PMID: 36840925 DOI: 10.1002/jso.27224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Accepted: 02/18/2023] [Indexed: 02/26/2023]
Abstract
BACKGROUND AND OBJECTIVES The role of time to surgery (TTS) for long-term outcomes in colon cancer (CC) remains ill-defined. We sought to utilize artificial intelligence (AI) to characterize the drivers of TTS and its prognostic impact. METHODS The National Cancer Database was utilized to identify patients diagnosed with non-metastatic CC between 2004 and 2018. AI models were employed to rank the importance of several sociodemographic, facility, and tumor characteristics in determining TTS, and postoperative survival. RESULTS Among 518 983 patients, 137 902 (26.6%) received intraoperative diagnosis of CC (TTS = 0), while 381 081 (74.4%) underwent elective surgery (TTS > 0) with median TTS of 19.0 days (interquartile range [IQR]: 7.0-33.0). An AI model, identified tumor stage, receipt of adequate lymphadenectomy, histologic grade, lymphovascular invasion, and insurance status as the most important variables associated with TTS = 0. Conversely, the type and location of treating facility and receipt of adjuvant therapy were among the most important variables for TTS > 0. Notably, TTS was among the most important variables associated with survival, and TTS > 3 weeks was associated with an incremental increase in mortality risk. CONCLUSIONS The identification of factors associated with TTS can help stratify patients most likely to suffer poor outcomes due to prolonged TTS, as well as guide quality improvement initiatives related to timely surgical care.
Collapse
|
12
|
A Versatile Method for Quantitative Analysis of Total Iron Content in Iron Ore Using Laser-Induced Breakdown Spectroscopy. APPLIED SPECTROSCOPY 2023; 77:140-150. [PMID: 36348501 DOI: 10.1177/00037028221141102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Focus in quality assessment of iron ore is the content of total iron (TFe). Laser-induced breakdown spectroscopy (LIBS) technology possesses the merits of rapid, in situ, real-time multielement analysis for iron ore, but its application to quantitative TFe content is subject to interference of the iron matrix effect and the lack of suitable data mining tools. Here, a new method of LIBS-based variable importance back propagation artificial neural network (VI-BP-ANN) for quantitative TFe content in iron ore was first proposed. After the LIBS spectra of 80 representative iron samples were obtained, random forest (RF) was optimized by out-of-bag (OOB) error and then used to measure and rank variable importance. The variable importance thresholds and the number of neurons were optimized with five-fold cross-validation (CV) with correlation coefficient (R2) and root mean square error (RMSE). With using only 1.40% of full spectral variables to construct BP-ANN model, the resulted R2, the root mean squared error of prediction (RMSEP) and the modeling time of the final VI-BP-ANN model was 0.9450, 0.3174 wt%, and 24 s, respectively. Compared with full spectrum-based model, for example, BP-ANN, RF, support vector machine (SVM), and PLS and VI-RF model, the VI-BP-ANN model reduced overfitting and obtained the highest R2 and the lowest RMSE both for calibration and prediction. Meanwhile, the characteristics of variables selected by VI were analyzed. In addition to the elemental emission lines of Ca, Al, Na, K, Mn, Si, Mg, Ti, Zr, and Li, partial spectral baselines of 540-610 nm and 820-970 nm were also selected as characteristic variables, which indicated that VI can take into full consideration the elemental interactions and the spectral baselines. Our approach shows that LIBS combined with VI-BP-ANN is able to quantify TFe content rapidly and accurately in iron ore.
Collapse
|
13
|
Determinants and prediction of Chlamydia trachomatis re-testing and re-infection within 1 year among heterosexuals with chlamydia attending a sexual health clinic. Front Public Health 2023; 10:1031372. [PMID: 36711362 PMCID: PMC9880158 DOI: 10.3389/fpubh.2022.1031372] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2022] [Accepted: 12/23/2022] [Indexed: 01/14/2023] Open
Abstract
Background Chlamydia trachomatis (chlamydia) is one of the most common sexually transmitted infections (STI) globally, and re-infections are common. Current Australian guidelines recommend re-testing for chlamydia 3 months after treatment to identify possible re-infection. Patient-delivered partner therapy (PDPT) has been proposed to control chlamydia re-infection among heterosexuals. We aimed to identify determinants and the prediction of chlamydia re-testing and re-infection within 1 year among heterosexuals with chlamydia to identify potential PDPT candidates. Methods Our baseline data included 5,806 heterosexuals with chlamydia aged ≥18 years and 2,070 re-tested for chlamydia within 1 year of their chlamydia diagnosis at the Melbourne Sexual Health Center from January 2, 2015, to May 15, 2020. We used routinely collected electronic health record (EHR) variables and machine-learning models to predict chlamydia re-testing and re-infection events. We also used logistic regression to investigate factors associated with chlamydia re-testing and re-infection. Results About 2,070 (36%) of 5,806 heterosexuals with chlamydia were re-tested for chlamydia within 1 year. Among those retested, 307 (15%) were re-infected. Multivariable logistic regression analysis showed that older age (≥35 years old), female, living with HIV, being a current sex worker, patient-delivered partner therapy users, and higher numbers of sex partners were associated with an increased chlamydia re-testing within 1 year. Multivariable logistic regression analysis also showed that younger age (18-24 years), male gender, and living with HIV were associated with an increased chlamydia re-infection within 1 year. The XGBoost model was the best model for predicting chlamydia re-testing and re-infection within 1 year among heterosexuals with chlamydia; however, machine learning approaches and these self-reported answers from clients did not provide a good predictive value (AUC < 60.0%). Conclusion The low rate of chlamydia re-testing and high rate of chlamydia re-infection among heterosexuals with chlamydia highlights the need for further interventions. Better targeting of individuals more likely to be re-infected is needed to optimize the provision of PDPT and encourage the test of re-infection at 3 months.
Collapse
|
14
|
Risk factors affecting patients survival with colorectal cancer in Morocco : Survival Analysis using an Interpretable Machine Learning Approach. RESEARCH SQUARE 2023:rs.3.rs-2435106. [PMID: 36711858 PMCID: PMC9882696 DOI: 10.21203/rs.3.rs-2435106/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
The aim of our study was to assess the overall survival rates for colorectal patients in Morocco and to identify strong prognostic factors using a novel approach combining survival random forest and the Cox model. Covariate selection was performed using the variable importance based on permutation and partial dependence plots were displayed to explore in depth the relationship between the estimated partial effect of a given predictor and survival rates. The predictive performance was measured by two metrics, the Concordance Index (C-index) and the Brier Score (BS). Overall survival rates at 1, 2 and 3 years were, respectively, 87% (SE = 0.02; CI-95% = 0.84-0.91), 77% (SE = 0.02; CI-95% = 0.73-0.82) and 60% (SE = 0.03; CI-95% = 0.54-0.66). In the Cox model after adjustment for all covariates, sex, tumor differentiation had no significant effect on prognosis, but rather tumor site had a significant effect. The variable importance obtained from RSF strengthens that surgery, stage, insurance, residency, and age were the most important prognostic factors. The discriminative capacity of the Cox PH and RSF was, respectively, 0.771 and 0.798 for the C-index, while the accuracy of the Cox PH and RSF was, respectively, 0.257 and 0.207 for the Brier Score. This shows that RSF had both better discriminative capacity and predictive accuracy. Our results show that patients who are older than 70, living in rural areas, without health insurance, at a distant stage and who have not had surgery constitute a subgroup of patients with poor prognosis.
Collapse
|
15
|
Variable importance evaluation with personalized odds ratio for machine learning model interpretability with applications to electronic health records-based mortality prediction. Stat Med 2023; 42:761-780. [PMID: 36601725 DOI: 10.1002/sim.9642] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Revised: 11/09/2022] [Accepted: 12/18/2022] [Indexed: 01/06/2023]
Abstract
The interpretability of machine learning models, even though with an excellent prediction performance, remains a challenge in practical applications. The model interpretability and variable importance for well-performed supervised machine learning models are investigated in this study. With the commonly accepted concept of odds ratio (OR), we propose a novel and computationally efficient Variable Importance evaluation framework based on the Personalized Odds Ratio (VIPOR). It is a model-agnostic interpretation method that can be used to evaluate variable importance both locally and globally. Locally, the variable importance is quantified by the personalized odds ratio (POR), which can account for subject heterogeneity in machine learning. Globally, we utilize a hierarchical tree to group the predictors into five groups: completely positive, completely negative, positive dominated, negative dominated, and neutral groups. The relative importance of predictors within each group is ranked based on different statistics of PORs across subjects for different application purposes. For illustration, we apply the proposed VIPOR method to interpreting a multilayer perceptron (MLP) model, which aims to predict the mortality of subarachnoid hemorrhage (SAH) patients using real-world electronic health records (EHR) data. We compare the important variables derived from MLP with other machine learning models, including tree-based models and the L1-regularized logistic regression model. The top importance variables are consistently identified by VIPOR across different prediction models. Comparisons with existing interpretation methods are also conducted and discussed based on publicly available data sets.
Collapse
|
16
|
Ensemble classification based feature selection: a case of identification on plant pentatricopeptide repeat proteins. Brief Bioinform 2022; 23:6760138. [PMID: 36239380 DOI: 10.1093/bib/bbac369] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Revised: 07/20/2022] [Accepted: 08/05/2022] [Indexed: 12/14/2022] Open
Abstract
In order to identify plant pentatricopeptide repeat (PPR) proteins, a framework of variable selection has been proposed. In fact, it is an effective feature selection strategy that focuses on the performance of classification. Random forest has been used as the classifier with certain variables automatically selected for discrimination between PPR functional and non-functional proteins. However, it is found that samples regarded as PPR functional proteins are wrongly classified in a high rate. In this paper, we plan to improve the framework in order to achieve better classification results. Modifications are made on the framework for better identifying PPR functional proteins. Instead of random forest, a hybrid ensemble classifier is built with its base classifiers derived from six different classification methods. Besides, an incremental strategy and a clustering by search in descending order are alternatively used for feature selection, which can effectively select the most representative variables for identification on PPR proteins. In addition, it can be found that different base classifiers alternately play an important role in the ensemble classifier with feature dimension increasing. The experimental results demonstrate the effectiveness of our improvements.
Collapse
|
17
|
A general framework for inference on algorithm-agnostic variable importance. J Am Stat Assoc 2022; 118:1645-1658. [PMID: 37982008 PMCID: PMC10652709 DOI: 10.1080/01621459.2021.2003200] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2020] [Accepted: 10/31/2021] [Indexed: 10/19/2022]
Abstract
In many applications, it is of interest to assess the relative contribution of features (or subsets of features) toward the goal of predicting a response - in other words, to gauge the variable importance of features. Most recent work on variable importance assessment has focused on describing the importance of features within the confines of a given prediction algorithm. However, such assessment does not necessarily characterize the prediction potential of features, and may provide a misleading reflection of the intrinsic value of these features. To address this limitation, we propose a general framework for nonparametric inference on interpretable algorithm-agnostic variable importance. We define variable importance as a population-level contrast between the oracle predictiveness of all available features versus all features except those under consideration. We propose a nonparametric efficient estimation procedure that allows the construction of valid confidence intervals, even when machine learning techniques are used. We also outline a valid strategy for testing the null importance hypothesis. Through simulations, we show that our proposal has good operating characteristics, and we illustrate its use with data from a study of an antibody against HIV-1 infection.
Collapse
|
18
|
Risk controlled decision trees and random forests for precision Medicine. Stat Med 2021; 41:719-735. [PMID: 34786731 PMCID: PMC8863134 DOI: 10.1002/sim.9253] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2021] [Revised: 10/15/2021] [Accepted: 10/15/2021] [Indexed: 11/08/2022]
Abstract
Statistical methods generating individualized treatment rules (ITRs) often focus on maximizing expected benefit, but these rules may expose patients to excess risk. For instance, aggressive treatment of type 2 diabetes (T2D) with insulin therapies may result in an ITR which controls blood glucose levels but increases rates of hypoglycemia, diminishing the appeal of the ITR. This work proposes two methods to identify risk-controlled ITRs (rcITR), a class of ITR which maximizes a benefit while controlling risk at a prespecified threshold. A novel penalized recursive partitioning algorithm is developed which optimizes an unconstrained, penalized value function. The final rule is a risk-controlled decision tree (rcDT) that is easily interpretable. A natural extension of the rcDT model, risk controlled random forests (rcRF), is also proposed. Simulation studies demonstrate the robustness of rcRF modeling. Three variable importance measures are proposed to further guide clinical decision-making. Both rcDT and rcRF procedures can be applied to data from randomized controlled trials or observational studies. An extensive simulation study interrogates the performance of the proposed methods. A data analysis of the DURABLE diabetes trial in which two therapeutics were compared is additionally presented. An R package implements the proposed methods ( https://github.com/kdoub5ha/rcITR).
Collapse
|
19
|
Variable selection with missing data in both covariates and outcomes: Imputation and machine learning. Stat Methods Med Res 2021; 30:2651-2671. [PMID: 34696650 DOI: 10.1177/09622802211046385] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Variable selection in the presence of both missing covariates and outcomes is an important statistical research topic. Parametric regression are susceptible to misspecification, and as a result are sub-optimal for variable selection. Flexible machine learning methods mitigate the reliance on the parametric assumptions, but do not provide as naturally defined variable importance measure as the covariate effect native to parametric models. We investigate a general variable selection approach when both the covariates and outcomes can be missing at random and have general missing data patterns. This approach exploits the flexibility of machine learning models and bootstrap imputation, which is amenable to nonparametric methods in which the covariate effects are not directly available. We conduct expansive simulations investigating the practical operating characteristics of the proposed variable selection approach, when combined with four tree-based machine learning methods, extreme gradient boosting, random forests, Bayesian additive regression trees, and conditional random forests, and two commonly used parametric methods, lasso and backward stepwise selection. Numeric results suggest that, extreme gradient boosting and Bayesian additive regression trees have the overall best variable selection performance with respect to the F1 score and Type I error, while the lasso and backward stepwise selection have subpar performance across various settings. There is no significant difference in the variable selection performance due to imputation methods. We further demonstrate the methods via a case study of risk factors for 3-year incidence of metabolic syndrome with data from the Study of Women's Health Across the Nation.
Collapse
|
20
|
Predicting 1-Year Mortality after Hip Fracture Surgery: An Evaluation of Multiple Machine Learning Approaches. J Pers Med 2021; 11:jpm11080727. [PMID: 34442370 PMCID: PMC8401745 DOI: 10.3390/jpm11080727] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2021] [Revised: 07/24/2021] [Accepted: 07/24/2021] [Indexed: 12/15/2022] Open
Abstract
Postoperative death within 1 year following hip fracture surgery is reported to be up to 27%. In the current study, we benchmarked the predictive precision and accuracy of the algorithms support vector machine (SVM), naïve Bayes classifier (NB), and random forest classifier (RF) against logistic regression (LR) in predicting 1-year postoperative mortality in hip fracture patients as well as assessed the relative importance of the variables included in the LR model. All adult patients who underwent primary emergency hip fracture surgery in Sweden, between 1 January 2008 and 31 December 2017 were included in the study. Patients with pathological fractures and non-operatively managed hip fractures, as well as those who died within 30 days after surgery, were excluded from the analysis. A LR model with an elastic net regularization were fitted and compared to NB, SVM, and RF. The relative importance of the variables in the LR model was then evaluated using the permutation importance. The LR model including all the variables demonstrated an acceptable predictive ability on both the training and test datasets for predicting one-year postoperative mortality (Area under the curve (AUC) = 0.74 and 0.74 respectively). NB, SVM, and RF tended to over-predict the mortality, particularly NB and SVM algorithms. In contrast, LR only over-predicted mortality when the predicted probability of mortality was larger than 0.7. The LR algorithm outperformed the other three algorithms in predicting 1-year postoperative mortality in hip fracture patients. The most important predictors of 1-year mortality were the presence of a metastatic carcinoma, American Society of Anesthesiologists(ASA) classification, sex, Charlson Comorbidity Index (CCI) ≤ 4, age, dementia, congestive heart failure, hypertension, surgery using pins/screws, and chronic kidney disease.
Collapse
|
21
|
Predictive Values of Preoperative Characteristics for 30-Day Mortality in Traumatic Hip Fracture Patients. J Pers Med 2021; 11:353. [PMID: 33924993 PMCID: PMC8146802 DOI: 10.3390/jpm11050353] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2021] [Revised: 03/21/2021] [Accepted: 04/23/2021] [Indexed: 12/23/2022] Open
Abstract
Hip fracture patients have a high risk of mortality after surgery, with 30-day postoperative rates as high as 10%. This study aimed to explore the predictive ability of preoperative characteristics in traumatic hip fracture patients as they relate to 30-day postoperative mortality using readily available variables in clinical practice. All adult patients who underwent primary emergency hip fracture surgery in Sweden between 2008 and 2017 were included in the analysis. Associations between the possible predictors and 30-day mortality was performed using a multivariate logistic regression (LR) model; the bidirectional stepwise method was used for variable selection. An LR model and convolutional neural network (CNN) were then fitted for prediction. The relative importance of individual predictors was evaluated using the permutation importance and Gini importance. A total of 134,915 traumatic hip fracture patients were included in the study. The CNN and LR models displayed an acceptable predictive ability for predicting 30-day postoperative mortality using a test dataset, displaying an area under the ROC curve (AUC) of as high as 0.76. The variables with the highest importance in prediction were age, sex, hypertension, dementia, American Society of Anesthesiologists (ASA) classification, and the Revised Cardiac Risk Index (RCRI). Both the CNN and LR models achieved an acceptable performance in identifying patients at risk of mortality 30 days after hip fracture surgery. The most important variables for prediction, based on the variables used in the current study are age, hypertension, dementia, sex, ASA classification, and RCRI.
Collapse
|
22
|
Model and variable selection using machine learning methods with applications to childhood stunting in Bangladesh. Inform Health Soc Care 2021; 46:425-442. [PMID: 33851897 DOI: 10.1080/17538157.2021.1904938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
Childhood stunting is a serious public health concern in Bangladesh. Earlier research used conventional statistical methods to identify the risk factors of stunting, and very little is known about the applications and usefulness of machine learning (ML) methods that can identify the risk factors of various health conditions based on complex data. This research evaluates the performance of ML methods in predicting stunting among under-5 aged children using 2014 Bangladesh Demographic and Health Survey data. Besides, this paper identifies variables which are important to predict stunting in Bangladesh. Among the selected ML methods, gradient boosting provides the smallest misclassification error in predicting stunting, followed by random forests, support vector machines, classification tree and logistic regression with forward-stepwise selection. The top 10 important variables (in order of importance) that better predict childhood stunting in Bangladesh are child age, wealth index, maternal education, preceding birth interval, paternal education, division, household size, maternal age at first birth, maternal nutritional status, and parental age. Our study shows that ML can support the building of prediction models and emphasizes on the demographic, socioeconomic, nutritional and environmental factors to understand stunting in Bangladesh.
Collapse
|
23
|
Nonparametric variable importance assessment using machine learning techniques. Biometrics 2021; 77:9-22. [PMID: 33043428 PMCID: PMC7946807 DOI: 10.1111/biom.13392] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2018] [Revised: 03/20/2019] [Accepted: 03/22/2019] [Indexed: 01/04/2023]
Abstract
In a regression setting, it is often of interest to quantify the importance of various features in predicting the response. Commonly, the variable importance measure used is determined by the regression technique employed. For this reason, practitioners often only resort to one of a few regression techniques for which a variable importance measure is naturally defined. Unfortunately, these regression techniques are often suboptimal for predicting the response. Additionally, because the variable importance measures native to different regression techniques generally have a different interpretation, comparisons across techniques can be difficult. In this work, we study a variable importance measure that can be used with any regression technique, and whose interpretation is agnostic to the technique used. This measure is a property of the true data-generating mechanism. Specifically, we discuss a generalization of the analysis of variance variable importance measure and discuss how it facilitates the use of machine learning techniques to flexibly estimate the variable importance of a single feature or group of features. The importance of each feature or group of features in the data can then be described individually, using this measure. We describe how to construct an efficient estimator of this measure as well as a valid confidence interval. Through simulations, we show that our proposal has good practical operating characteristics, and we illustrate its use with data from a study of risk factors for cardiovascular disease in South Africa.
Collapse
|
24
|
Predictors of Survival among Male and Female Patients with Malignant Pleural Mesothelioma: A Random Survival Forest Analysis of Data from the 2000-2017 Surveillance, Epidemiology, and End Results Program. JOURNAL OF REGISTRY MANAGEMENT 2021; 48:118-125. [PMID: 35413729] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
BACKGROUND Malignant pleural mesothelioma (MPM) is a rare and aggressive malignancy with a dismal prognosis. We aimed to identify predictors of survival among male and female MPM patients in the United States. METHODS We identified MPM cases reported by 18 cancer registries in the Surveillance, Epidemiology, and End Results Program (2000-2017). We applied a random survival forest (RSF) algorithm to identify and rank the importance of 10 variables at patient, cancer, and area level in predicting all-cause survival overall and by female and male subgroups. RESULTS Approximately 91.4% (n = 11,160) of the MPM patients had died, with better survival among females than males (11.7% vs 7.8%). The median follow-up time was 7 months (interquartile range, 2-17 months). A majority of the patients were male (78.6%), non-Hispanic White (81.8%), and residing in metropolitan counties with a population greater than 1 million (63.7%). The top 3 factors for predicting overall MPM survival were age, histological type, and cancer-directed surgery status. Except for age, the relative ranking of covariates varied by the 3 sample groups. Stage ranked fifth in predicting female survival, while it was replaced by metastasis status for male and overall patients. Race/ethnicity was not a good predictor for survival among MPM patients overall or the male subgroup, but ranked sixth for predicting survival among females. Median household income was not a good predictor for survival among females. CONCLUSION We demonstrated that RSF successfully identified predictors of MPM survival. RSF is a viable complement to the commonly used Cox proportional hazard model and a viable alternative, particularly when the proportional hazard assumption is unmet. RSF also identified differences between the sexes, which may help explain the sex differences in MPM survival rates.
Collapse
|
25
|
Identifying Plant Pentatricopeptide Repeat Proteins Using a Variable Selection Method. FRONTIERS IN PLANT SCIENCE 2021; 12:506681. [PMID: 33732270 PMCID: PMC7957076 DOI: 10.3389/fpls.2021.506681] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/22/2019] [Accepted: 02/08/2021] [Indexed: 05/05/2023]
Abstract
Motivation: Pentatricopeptide repeat (PPR), which is a triangular pentapeptide repeat domain, plays an important role in plant growth. Features extracted from sequences are applicable to PPR protein identification using certain classification methods. However, which components of a multidimensional feature (namely variables) are more effective for protein discrimination has never been discussed. Therefore, we seek to select variables from a multidimensional feature for identifying PPR proteins. Method: A framework of variable selection for identifying PPR proteins is proposed. Samples representing PPR positive proteins and negative ones are equally split into a training and a testing set. Variable importance is regarded as scores derived from an iteration of resampling, training, and scoring step on the training set. A model selection method based on Gaussian mixture model is applied to automatic choice of variables which are effective to identify PPR proteins. Measurements are used on the testing set to show the effectiveness of the selected variables. Results: Certain variables other than the multidimensional feature they belong to do work for discrimination between PPR positive proteins and those negative ones. In addition, the content of methionine may play an important role in predicting PPR proteins.
Collapse
|
26
|
Inference and Prediction Diverge in Biomedicine. PATTERNS 2020; 1:100119. [PMID: 33294865 PMCID: PMC7691397 DOI: 10.1016/j.patter.2020.100119] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/16/2020] [Revised: 08/17/2020] [Accepted: 09/14/2020] [Indexed: 02/04/2023]
Abstract
In the 20th century, many advances in biological knowledge and evidence-based medicine were supported by p values and accompanying methods. In the early 21st century, ambitions toward precision medicine place a premium on detailed predictions for single individuals. The shift causes tension between traditional regression methods used to infer statistically significant group differences and burgeoning predictive analysis tools suited to forecast an individual's future. Our comparison applies linear models for identifying significant contributing variables and for finding the most predictive variable sets. In systematic data simulations and common medical datasets, we explored how variables identified as significantly relevant and variables identified as predictively relevant can agree or diverge. Across analysis scenarios, even small predictive performances typically coincided with finding underlying significant statistical relationships, but not vice versa. More complete understanding of different ways to define “important” associations is a prerequisite for reproducible research and advances toward personalizing medical care. We systematically juxtapose variable selection using significance versus prediction Successful prediction often coincided with significant p values Yet strong statistical significance did not always coincide with predictive value Understanding the inference-prediction dilemma is imperative for precision medicine
Across research communities, the analysis goals of inference and prediction are two sides of a coin. Many empirical studies leaning on statistical significance typically focus interpretation on the best p values obtained for one or a few variables. In contrast, many empirical studies dedicated to prediction are backed up by cross-validated model performance on fresh data points. In a future of single-patient prediction from big biomedical data, it may become central that modeling for inference and modeling for prediction are related but importantly different. The relevant subset of variables identified based on p values or based on predictive value can converge or diverge depending on the data scenario. We show that diverging conclusions can emerge even when the data are identical and when widespread linear models are used. Awareness of the relative strengths and weaknesses of both “data-analysis cultures” may become unavoidable in navigating between complementary goals in scientific inquiry.
Collapse
|
27
|
Length of stay for inpatient incompetent to stand trial patients: importance of clinical and demographic variables. CNS Spectr 2020; 25:734-742. [PMID: 32286208 DOI: 10.1017/s1092852920001273] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
OBJECTIVE We investigated clinical and demographic variables to better understand their relationship to hospital length of stay for patients involuntarily committed to California state psychiatric hospitals under the state's incompetent to stand trial (IST) statutes. Additionally, we determined the most important variables in the model that influenced patient length of stay. METHODS We retrospectively studied all patients admitted as IST to California state psychiatric hospitals during the period January 1, 2010 through June 30, 2018 (N = 20 041). Primary diagnosis, total number of violent acts while hospitalized, age at admission, treating hospital, level of functioning at admission, ethnicity, sex, and having had a previous state hospital admission were evaluated using a parametric survival model. RESULTS The analysis showed that the most important variables related to length of stay were (1) diagnosis, (2) number of violent acts while hospitalized, and (3) age of admission. Specifically, longer length of stay was associated with (1) having a diagnosis of schizophrenia or neurocognitive disorder, (2) one or more violent acts, and (3) older age at admission. The other variables studied were also statistically significant, but not as influential in the model. CONCLUSIONS We found significant relations between length of stay and the variables studied, with the most important variables being (1) diagnosis, (2) number of physically violent acts, and (3) age at admission. These findings emphasize the need for treatments to target cognitive issues in the seriously mentally ill as well as treatment of violence and early identification of violence risk factors.
Collapse
|
28
|
A note on the interpretation of tree-based regression models. Biom J 2020; 62:1564-1573. [PMID: 32449821 DOI: 10.1002/bimj.201900195] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2019] [Revised: 02/21/2020] [Accepted: 03/12/2020] [Indexed: 02/02/2023]
Abstract
Tree-based models are a popular tool for predicting a response given a set of explanatory variables when the regression function is characterized by a certain degree of complexity. Sometimes, they are also used to identify important variables and for variable selection. We show that if the generating model contains chains of direct and indirect effects, then the typical variable importance measures suggest selecting as important mainly the background variables, which have a strong indirect effect, disregarding the variables that directly influence the response. This is attributable mainly to the variable choice in the first steps of the algorithm selecting the splitting variable and to the greedy nature of such search. This pitfall could be relevant when using tree-based algorithms for understanding the underlying generating process, for population segmentation and for causal inference.
Collapse
|
29
|
A machine learning-based approach for estimating and testing associations with multivariate outcomes. Int J Biostat 2020; 17:7-21. [PMID: 32784265 DOI: 10.1515/ijb-2019-0061] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2019] [Accepted: 06/18/2020] [Indexed: 11/15/2022]
Abstract
We propose a method for summarizing the strength of association between a set of variables and a multivariate outcome. Classical summary measures are appropriate when linear relationships exist between covariates and outcomes, while our approach provides an alternative that is useful in situations where complex relationships may be present. We utilize machine learning to detect nonlinear relationships and covariate interactions and propose a measure of association that captures these relationships. A hypothesis test about the proposed associative measure can be used to test the strong null hypothesis of no association between a set of variables and a multivariate outcome. Simulations demonstrate that this hypothesis test has greater power than existing methods against alternatives where covariates have nonlinear relationships with outcomes. We additionally propose measures of variable importance for groups of variables, which summarize each groups' association with the outcome. We demonstrate our methodology using data from a birth cohort study on childhood health and nutrition in the Philippines.
Collapse
|
30
|
Gender-Specific Differences in Patients With Chronic Tinnitus-Baseline Characteristics and Treatment Effects. Front Neurosci 2020; 14:487. [PMID: 32523506 PMCID: PMC7261931 DOI: 10.3389/fnins.2020.00487] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2020] [Accepted: 04/20/2020] [Indexed: 11/29/2022] Open
Abstract
Whilst some studies have identified gender-specific differences, there is no consensus about gender-specific determinants for prevalence rates or concomitant symptoms of chronic tinnitus such as depression or anxiety. However, gender-associated differences in psychological response profiles and coping strategies may differentially affect tinnitus chronification and treatment success rates. Thus, understanding gender-associated differences may facilitate a more detailed identification of symptom profiles, heighten treatment response rates, and help to create access for vulnerable populations that are potentially less visible in clinical settings. Our research questions are: RQ1: how do male and female tinnitus patients differ regarding tinnitus-related distress, depression severity, and treatment response, RQ2: to what extent are answers to questionnaires administered at baseline associated with gender, and RQ3: which baseline questionnaire items are associated with tinnitus distress, depression, and treatment response, while relating to one gender only? In this work, we present a data analysis workflow to investigate gender-specific differences in N = 1,628 patients with chronic tinnitus (828 female, 800 male) who completed a 7-day multimodal treatment encompassing cognitive behavioral therapy (CBT), physiotherapy, auditory attention training, and information counseling components. For this purpose, we extracted 181 variables from 7 self-report questionnaires on socio-demographics, tinnitus-related distress, tinnitus frequency, loudness, localization, and quality as well as physical and mental health status. Our workflow comprises (i) training machine learning models, (ii) a comprehensive evaluation including hyperparameter optimization, and (iii) post-learning steps to identify predictive variables. We found that female patients reported higher levels of tinnitus-related distress, depression and response to treatment (RQ1). Female patients indicated higher levels of tension, stress, and psychological coping strategies rates. By contrast, male patients reported higher levels of bodily pain associated with chronic tinnitus whilst judging their overall health as better (RQ2). Variables measuring depression, sleep problems, tinnitus frequency, and loudness were associated with tinnitus-related distress in both genders and indicators of mental health and subjective stress were found to be associated with depression in both genders (RQ3). Our results suggest that gender-associated differences in symptomatology and treatment response profiles suggest clinical and conceptual needs for differential diagnostics, case conceptualization and treatment pathways.
Collapse
|
31
|
Repeated measures random forests (RMRF): Identifying factors associated with nocturnal hypoglycemia. Biometrics 2020; 77:343-351. [PMID: 32311079 DOI: 10.1111/biom.13284] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2018] [Revised: 04/04/2020] [Accepted: 04/07/2020] [Indexed: 12/19/2022]
Abstract
Nocturnal hypoglycemia is a common phenomenon among patients with diabetes and can lead to a broad range of adverse events and complications. Identifying factors associated with hypoglycemia can improve glucose control and patient care. We propose a repeated measures random forest (RMRF) algorithm that can handle nonlinear relationships and interactions and the correlated responses from patients evaluated over several nights. Simulation results show that our proposed algorithm captures the informative variable more often than naïvely assuming independence. RMRF also outperforms standard random forest and extremely randomized trees algorithms. We demonstrate scenarios where RMRF attains greater prediction accuracy than generalized linear models. We apply the RMRF algorithm to analyze a diabetes study with 2524 nights from 127 patients with type 1 diabetes. We find that nocturnal hypoglycemia is associated with HbA1c, bedtime blood glucose (BG), insulin on board, time system activated, exercise intensity, and daytime hypoglycemia. The RMRF can accurately classify nights at high risk of nocturnal hypoglycemia.
Collapse
|
32
|
Composition of cometary particles collected during two periods of the Rosetta mission: multivariate evaluation of mass spectral data. JOURNAL OF CHEMOMETRICS 2020; 34:e3218. [PMID: 32355406 PMCID: PMC7187198 DOI: 10.1002/cem.3218] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/08/2019] [Revised: 12/30/2019] [Accepted: 01/07/2020] [Indexed: 06/11/2023]
Abstract
The instrument COSIMA (COmetary Secondary Ion Mass Analyzer) onboard of the European Space Agency mission Rosetta collected and analyzed dust particles in the neighborhood of comet 67P/Churyumov-Gerasimenko. The chemical composition of the particle surfaces was characterized by time-of-flight secondary ion mass spectrometry. A set of 2213 spectra has been selected, and relative abundances for CH-containing positive ions as well as positive elemental ions define a set of multivariate data with nine variables. Evaluation by complementary chemometric techniques shows different compositions of sample groups collected during two periods of the mission. The first period was August to November 2014 (far from the Sun); the second period was January 2015 to February 2016 (nearer to the Sun). The applied data evaluation methods consider the compositional nature of the mass spectral data and comprise robust principal component analysis as well as classification with discriminant partial least squares regression, k-nearest neighbor search, and random forest decision trees. The results indicate a high importance of the relative abundances of the secondary ions C+ and Fe+ for the group separation and demonstrate an enhanced content of carbon-containing substances in samples collected in the period with smaller distances to the Sun.
Collapse
|
33
|
Abstract
Supervised neural networks have been applied as a machine learning technique to identify and predict emergent patterns among multiple variables. A common criticism of these methods is the inability to characterize relationships among variables from a fitted model. Although several techniques have been proposed to "illuminate the black box", they have not been made available in an open-source programming environment. This article describes the NeuralNetTools package that can be used for the interpretation of supervised neural network models created in R. Functions in the package can be used to visualize a model using a neural network interpretation diagram, evaluate variable importance by disaggregating the model weights, and perform a sensitivity analysis of the response variables to changes in the input variables. Methods are provided for objects from many of the common neural network packages in R, including caret, neuralnet, nnet, and RSNNS. The article provides a brief overview of the theoretical foundation of neural networks, a description of the package structure and functions, and an applied example to provide a context for model development with NeuralNetTools. Overall, the package provides a toolset for neural networks that complements existing quantitative techniques for data-intensive exploration.
Collapse
|
34
|
New Quantitative Structure-Activity Relationship Model for Angiotensin-Converting Enzyme Inhibitory Dipeptides Based on Integrated Descriptors. JOURNAL OF AGRICULTURAL AND FOOD CHEMISTRY 2017; 65:9774-9781. [PMID: 28984136 DOI: 10.1021/acs.jafc.7b03367] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Angiotensin-converting enzyme (ACE) inhibitory peptides derived from food proteins have been widely reported for hypertension treatment. In this paper, a benchmark data set containing 141 unique ACE inhibitory dipeptides was constructed through database mining, and a quantitative structure-activity relationships (QSAR) study was carried out to predict half-inhibitory concentration (IC50) of ACE activity. Sixteen descriptors were tested and the model generated by G-scale descriptor showed the best predictive performance with the coefficient of determination (R2) and cross-validated R2 (Q2) of 0.6692 and 0.6220, respectively. For most other descriptors, R2 were ranging from 0.52 to 0.68 and Q2 were ranging from 0.48 to 0.61. A complex model combining all 16 descriptors was carried out and variable selection was performed in order to further improve the prediction performance. The quality of model using integrated descriptors (R2 0.7340 ± 0.0038, Q2 0.7151 ± 0.0019) was better than that of G-scale. An in-depth study of variable importance showed that the most correlated properties to ACE inhibitory activity were hydrophobicity, steric, and electronic properties and C-terminal amino acids contribute more than N-terminal amino acids. Five novel predicted ACE-inhibitory peptides were synthesized, and their IC50 values were validated through in vitro experiments. The results indicated that the constructed model could give a reliable prediction of ACE-inhibitory activity of peptides, and it may be useful in the design of novel ACE-inhibitory peptides.
Collapse
|
35
|
Kernel-Based Measure of Variable Importance for Genetic Association Studies. Int J Biostat 2017. [PMID: 28628480 DOI: 10.1515/ijb-2016-0087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
The identification of genetic variants that are associated with disease risk is an important goal of genetic association studies. Standard approaches perform univariate analysis where each genetic variant, usually Single Nucleotide Polymorphisms (SNPs), is tested for association with disease status. Though many genetic variants have been identified and validated so far using this univariate approach, for most complex diseases a large part of their genetic component is still unknown, the so called missing heritability. We propose a Kernel-based measure of variable importance (KVI) that provides the contribution of a SNP, or a group of SNPs, to the joint genetic effect of a set of genetic variants. KVI can be used for ranking genetic markers individually, sets of markers that form blocks of linkage disequilibrium or sets of genetic variants that lie in a gene or a genetic pathway. We prove that, unlike the univariate analysis, KVI captures the relationship with other genetic variants in the analysis, even when measured at the individual level for each genetic variable separately. This is specially relevant and powerful for detecting genetic interactions. We illustrate the results with data from an Alzheimer's disease study and show through simulations that the rankings based on KVI improve those rankings based on two measures of importance provided by the Random Forest. We also prove with a simulation study that KVI is very powerful for detecting genetic interactions.
Collapse
|
36
|
Predicting Pre-planting Risk of Stagonospora nodorum blotch in Winter Wheat Using Machine Learning Models. FRONTIERS IN PLANT SCIENCE 2016; 7:390. [PMID: 27064542 PMCID: PMC4812805 DOI: 10.3389/fpls.2016.00390] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/10/2015] [Accepted: 03/14/2016] [Indexed: 05/06/2023]
Abstract
Pre-planting factors have been associated with the late-season severity of Stagonospora nodorum blotch (SNB), caused by the fungal pathogen Parastagonospora nodorum, in winter wheat (Triticum aestivum). The relative importance of these factors in the risk of SNB has not been determined and this knowledge can facilitate disease management decisions prior to planting of the wheat crop. In this study, we examined the performance of multiple regression (MR) and three machine learning algorithms namely artificial neural networks, categorical and regression trees, and random forests (RF), in predicting the pre-planting risk of SNB in wheat. Pre-planting factors tested as potential predictor variables were cultivar resistance, latitude, longitude, previous crop, seeding rate, seed treatment, tillage type, and wheat residue. Disease severity assessed at the end of the growing season was used as the response variable. The models were developed using 431 disease cases (unique combinations of predictors) collected from 2012 to 2014 and these cases were randomly divided into training, validation, and test datasets. Models were evaluated based on the regression of observed against predicted severity values of SNB, sensitivity-specificity ROC analysis, and the Kappa statistic. A strong relationship was observed between late-season severity of SNB and specific pre-planting factors in which latitude, longitude, wheat residue, and cultivar resistance were the most important predictors. The MR model explained 33% of variability in the data, while machine learning models explained 47 to 79% of the total variability. Similarly, the MR model correctly classified 74% of the disease cases, while machine learning models correctly classified 81 to 83% of these cases. Results show that the RF algorithm, which explained 79% of the variability within the data, was the most accurate in predicting the risk of SNB, with an accuracy rate of 93%. The RF algorithm could allow early assessment of the risk of SNB, facilitating sound disease management decisions prior to planting of wheat.
Collapse
|
37
|
The Random Forests statistical technique: An examination of its value for the study of reading. SCIENTIFIC STUDIES OF READING : THE OFFICIAL JOURNAL OF THE SOCIETY FOR THE SCIENTIFIC STUDY OF READING 2016; 20:20-33. [PMID: 26770056 PMCID: PMC4710485 DOI: 10.1080/10888438.2015.1107073] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
Studies investigating individual differences in reading ability often involve data sets containing a large number of collinear predictors and a small number of observations. In this paper, we discuss the method of Random Forests and demonstrate its suitability for addressing the statistical concerns raised by such datasets. The method is contrasted with other methods of estimating relative variable importance, especially Dominance Analysis and Multimodel Inference. All methods were applied to a dataset that gauged eye-movements during reading and offline comprehension in the context of multiple ability measures with high collinearity due to their shared verbal core. We demonstrate that the Random Forests method surpasses other methods in its ability to handle model overfitting, and accounts for a comparable or larger amount of variance in reading measures relative to other methods.
Collapse
|
38
|
Modeling X Chromosome Data Using Random Forests: Conquering Sex Bias. Genet Epidemiol 2015; 40:123-32. [PMID: 26639183 DOI: 10.1002/gepi.21946] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2015] [Revised: 10/29/2015] [Accepted: 10/29/2015] [Indexed: 12/12/2022]
Abstract
Machine learning methods, including Random Forests (RF), are increasingly used for genetic data analysis. However, the standard RF algorithm does not correctly model the effects of X chromosome single nucleotide polymorphisms (SNPs), leading to biased estimates of variable importance. We propose extensions of RF to correctly model X SNPs, including a stratified approach and an approach based on the process of X chromosome inactivation. We applied the new and standard RF approaches to case-control alcohol dependence data from the Study of Addiction: Genes and Environment (SAGE), and compared the performance of the alternative approaches via a simulation study. Standard RF applied to a case-control study of alcohol dependence yielded inflated variable importance estimates for X SNPs, even when sex was included as a variable, but the results of the new RF methods were consistent with univariate regression-based approaches that correctly model X chromosome data. Simulations showed that the new RF methods eliminate the bias in standard RF variable importance for X SNPs when sex is associated with the trait, and are able to detect causal autosomal and X SNPs. Even in the absence of sex effects, the new extensions perform similarly to standard RF. Thus, we provide a powerful multimarker approach for genetic analysis that accommodates X chromosome data in an unbiased way. This method is implemented in the freely available R package "snpRF" (http://www.cran.r-project.org/web/packages/snpRF/).
Collapse
|
39
|
Estimate variable importance for recurrent event outcomes with an application to identify hypoglycemia risk factors. Stat Med 2015; 34:2743-54. [PMID: 25908216 DOI: 10.1002/sim.6516] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2014] [Revised: 02/22/2015] [Accepted: 04/06/2015] [Indexed: 11/10/2022]
Abstract
Recurrent event data are an important data type for medical research. In particular, many safety endpoints are recurrent outcomes, such as hypoglycemic events. For such a situation, it is important to identify the factors causing these events and rank these factors by their importance. Traditional model selection methods are not able to provide variable importance in this context. Methods that are able to evaluate the variable importance, such as gradient boosting and random forest algorithms, cannot directly be applied to recurrent events data. In this paper, we propose a two-step method that enables us to evaluate the variable importance for recurrent events data. We evaluated the performance of our proposed method by simulations and applied it to a data set from a diabetes study.
Collapse
|
40
|
Abstract
In most experimental and observational studies, participants are not followed in continuous time. Instead, data is collected about participants only at certain monitoring times. These monitoring times are random and often participant specific. As a result, outcomes are only known up to random time intervals, resulting in interval-censored data. In contrast, when estimating variable importance measures on interval-censored outcomes, practitioners often ignore the presence of interval censoring, and instead treat the data as continuous or right-censored, applying ad hoc approaches to mask the true interval censoring. In this article, we describe targeted minimum loss-based estimation (TMLE) methods tailored for estimation of binary variable importance measures with interval-censored outcomes. We demonstrate the performance of the interval-censored TMLE procedure through simulation studies and apply the method to analyze the effects of a variety of variables on spontaneous hepatitis C virus clearance among injection drug users, using data from the "International Collaboration of Incident HIV and HCV in Injecting Cohorts" project.
Collapse
|
41
|
Abstract
In the Life Sciences 'omics' data is increasingly generated by different high-throughput technologies. Often only the integration of these data allows uncovering biological insights that can be experimentally validated or mechanistically modelled, i.e. sophisticated computational approaches are required to extract the complex non-linear trends present in omics data. Classification techniques allow training a model based on variables (e.g. SNPs in genetic association studies) to separate different classes (e.g. healthy subjects versus patients). Random Forest (RF) is a versatile classification algorithm suited for the analysis of these large data sets. In the Life Sciences, RF is popular because RF classification models have a high-prediction accuracy and provide information on importance of variables for classification. For omics data, variables or conditional relations between variables are typically important for a subset of samples of the same class. For example: within a class of cancer patients certain SNP combinations may be important for a subset of patients that have a specific subtype of cancer, but not important for a different subset of patients. These conditional relationships can in principle be uncovered from the data with RF as these are implicitly taken into account by the algorithm during the creation of the classification model. This review details some of the to the best of our knowledge rarely or never used RF properties that allow maximizing the biological insights that can be extracted from complex omics data sets using RF.
Collapse
|
42
|
A novel approach to prediction of mild obstructive sleep disordered breathing in a population-based sample: the Sleep Heart Health Study. Sleep 2011; 33:1641-8. [PMID: 21120126 DOI: 10.1093/sleep/33.12.1641] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
This manuscript considers a data-mining approach for the prediction of mild obstructive sleep disordered breathing, defined as an elevated respiratory disturbance index (RDI), in 5,530 participants in a community-based study, the Sleep Heart Health Study. The prediction algorithm was built using modern ensemble learning algorithms, boosting in specific, which allowed for assessing potential high-dimensional interactions between predictor variables or classifiers. To evaluate the performance of the algorithm, the data were split into training and validation sets for varying thresholds for predicting the probability of a high RDI (≥7 events per hour in the given results). Based on a moderate classification threshold from the boosting algorithm, the estimated post-test odds of a high RDI were 2.20 times higher than the pre-test odds given a positive test, while the corresponding post-test odds were decreased by 52% given a negative test (sensitivity and specificity of 0.66 and 0.70, respectively). In rank order, the following variables had the largest impact on prediction performance: neck circumference, body mass index, age, snoring frequency, waist circumference, and snoring loudness.
Collapse
|
43
|
An application of collaborative targeted maximum likelihood estimation in causal inference and genomics. Int J Biostat 2010; 6:Article 18. [PMID: 21731530 PMCID: PMC3126668 DOI: 10.2202/1557-4679.1182] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
A concrete example of the collaborative double-robust targeted likelihood estimator (C-TMLE) introduced in a companion article in this issue is presented, and applied to the estimation of causal effects and variable importance parameters in genomic data. The focus is on non-parametric estimation in a point treatment data structure. Simulations illustrate the performance of C-TMLE relative to current competitors such as the augmented inverse probability of treatment weighted estimator that relies on an external non-collaborative estimator of the treatment mechanism, and inefficient estimation procedures including propensity score matching and standard inverse probability of treatment weighting. C-TMLE is also applied to the estimation of the covariate-adjusted marginal effect of individual HIV mutations on resistance to the anti-retroviral drug lopinavir. The influence curve of the C-TMLE is used to establish asymptotically valid statistical inference. The list of mutations found to have a statistically significant association with resistance is in excellent agreement with mutation scores provided by the Stanford HIVdb mutation scores database.
Collapse
|
44
|
Abstract
Collaborative double robust targeted maximum likelihood estimators represent a fundamental further advance over standard targeted maximum likelihood estimators of a pathwise differentiable parameter of a data generating distribution in a semiparametric model, introduced in van der Laan, Rubin (2006). The targeted maximum likelihood approach involves fluctuating an initial estimate of a relevant factor (Q) of the density of the observed data, in order to make a bias/variance tradeoff targeted towards the parameter of interest. The fluctuation involves estimation of a nuisance parameter portion of the likelihood, g. TMLE has been shown to be consistent and asymptotically normally distributed (CAN) under regularity conditions, when either one of these two factors of the likelihood of the data is correctly specified, and it is semiparametric efficient if both are correctly specified. In this article we provide a template for applying collaborative targeted maximum likelihood estimation (C-TMLE) to the estimation of pathwise differentiable parameters in semi-parametric models. The procedure creates a sequence of candidate targeted maximum likelihood estimators based on an initial estimate for Q coupled with a succession of increasingly non-parametric estimates for g. In a departure from current state of the art nuisance parameter estimation, C-TMLE estimates of g are constructed based on a loss function for the targeted maximum likelihood estimator of the relevant factor Q that uses the nuisance parameter to carry out the fluctuation, instead of a loss function for the nuisance parameter itself. Likelihood-based cross-validation is used to select the best estimator among all candidate TMLE estimators of Q(0) in this sequence. A penalized-likelihood loss function for Q is suggested when the parameter of interest is borderline-identifiable. We present theoretical results for "collaborative double robustness," demonstrating that the collaborative targeted maximum likelihood estimator is CAN even when Q and g are both mis-specified, providing that g solves a specified score equation implied by the difference between the Q and the true Q(0). This marks an improvement over the current definition of double robustness in the estimating equation literature. We also establish an asymptotic linearity theorem for the C-DR-TMLE of the target parameter, showing that the C-DR-TMLE is more adaptive to the truth, and, as a consequence, can even be super efficient if the first stage density estimator does an excellent job itself with respect to the target parameter. This research provides a template for targeted efficient and robust loss-based learning of a particular target feature of the probability distribution of the data within large (infinite dimensional) semi-parametric models, while still providing statistical inference in terms of confidence intervals and p-values. This research also breaks with a taboo (e.g., in the propensity score literature in the field of causal inference) on using the relevant part of likelihood to fine-tune the fitting of the nuisance parameter/censoring mechanism/treatment mechanism.
Collapse
|
45
|
Biomarker discovery using targeted maximum-likelihood estimation: application to the treatment of antiretroviral-resistant HIV infection. Stat Med 2009; 28:152-72. [PMID: 18825650 PMCID: PMC4107931 DOI: 10.1002/sim.3414] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Researchers in clinical science and bioinformatics frequently aim to learn which of a set of candidate biomarkers is important in determining a given outcome, and to rank the contributions of the candidates accordingly. This article introduces a new approach to research questions of this type, based on targeted maximum-likelihood estimation of variable importance measures.The methodology is illustrated using an example drawn from the treatment of HIV infection. Specifically, given a list of candidate mutations in the protease enzyme of HIV, we aim to discover mutations that reduce clinical virologic response to antiretroviral regimens containing the protease inhibitor lopinavir. In the context of this data example, the article reviews the motivation for covariate adjustment in the biomarker discovery process. A standard maximum-likelihood approach to this adjustment is compared with the targeted approach introduced here. Implementation of targeted maximum-likelihood estimation in the context of biomarker discovery is discussed, and the advantages of this approach are highlighted. Results of applying targeted maximum-likelihood estimation to identify lopinavir resistance mutations are presented and compared with results based on unadjusted mutation-outcome associations as well as results of a standard maximum-likelihood approach to adjustment.The subset of mutations identified by targeted maximum likelihood as significant contributors to lopinavir resistance is found to be in better agreement with the current understanding of HIV antiretroviral resistance than the corresponding subsets identified by the other two approaches. This finding suggests that targeted estimation of variable importance represents a promising approach to biomarker discovery.
Collapse
|