Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Fletcher Mercaldo S, Blume JD. Missing data and prediction: the pattern submodel. Biostatistics 2020;21:236-252. [PMID: 30203058 PMCID: PMC7868046 DOI: 10.1093/biostatistics/kxy040] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2017] [Revised: 07/02/2018] [Accepted: 07/09/2018] [Indexed: 11/19/2022] Open

For:	Fletcher Mercaldo S, Blume JD. Missing data and prediction: the pattern submodel. Biostatistics 2020;21:236-252. [PMID: 30203058 PMCID: PMC7868046 DOI: 10.1093/biostatistics/kxy040] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2017] [Revised: 07/02/2018] [Accepted: 07/09/2018] [Indexed: 11/19/2022] Open

Number

Cited by Other Article(s)

D'Agostino McGowan L, Lotspeich SC, Hepler SA. The "Why" behind including "Y" in your imputation model. Stat Methods Med Res 2024:9622802241244608. [PMID: 38625810 DOI: 10.1177/09622802241244608] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/18/2024]

Cheng P, Yang P, Zhang H, Wang H. Prediction Models for Return of Spontaneous Circulation in Patients with Cardiac Arrest: A Systematic Review and Critical Appraisal. Emerg Med Int 2023;2023:6780941. [PMID: 38035124 PMCID: PMC10684323 DOI: 10.1155/2023/6780941] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 04/23/2023] [Accepted: 11/04/2023] [Indexed: 12/02/2023] Open

Abstract

Objectives

Prediction models for the return of spontaneous circulation (ROSC) in patients with cardiac arrest play an important role in helping physicians evaluate the survival probability and providing medical decision-making reference. Although relevant models have been developed, their methodological rigor and model applicability are still unclear. Therefore, this study aims to summarize the evidence for ROSC prediction models and provide a reference for the development, validation, and application of ROSC prediction models.

Methods

PubMed, Cochrane Library, Embase, Elsevier, Web of Science, SpringerLink, Ovid, CNKI, Wanfang, and SinoMed were systematically searched for studies on ROSC prediction models. The search time limit was from the establishment of the database to August 30, 2022. Two reviewers independently screened the literature and extracted the data. The PROBAST was used to evaluate the quality of the included literature.

Results

A total of 8 relevant prediction models were included, and 6 models reported the AUC of 0.662-0.830 in the modeling population, which showed good overall applicability but high risk of bias. The main reasons were improper handling of missing values and variable screening, lack of external validation of the model, and insufficient information of overfitting. Age, gender, etiology, initial heart rhythm, EMS arrival time/BLS intervention time, location, bystander CPR, witnessed during sudden arrest, and ACLS duration/compression duration were the most commonly included predictors. Obvious chest injury, body temperature below 33°C, and possible etiologies were predictive factors for ROSC failure in patients with TOHCA. Age, gender, initial heart rhythm, reason for the hospital visit, length of hospital stay, and the location of occurrence in hospital were the predictors of ROSC in IHCA patients.

Conclusion

The performance of current ROSC prediction models varies greatly and has a high risk of bias, which should be selected with caution. Future studies can further optimize and externally validate the existing models.

Collapse

Godfrey CM, Shipe ME, Welty VF, Maiga AW, Aldrich MC, Montgomery C, Crockett J, Vaszar LT, Regis S, Isbell JM, Rickman OB, Pinkerman R, Lambright ES, Nesbitt JC, Maldonado F, Blume JD, Deppen SA, Grogan EL. The Thoracic Research Evaluation and Treatment 2.0 Model: A Lung Cancer Prediction Model for Indeterminate Nodules Referred for Specialist Evaluation. Chest 2023;164:1305-1314. [PMID: 37421973 PMCID: PMC10635839 DOI: 10.1016/j.chest.2023.06.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2023] [Revised: 05/03/2023] [Accepted: 06/01/2023] [Indexed: 07/10/2023] Open

Abstract

BACKGROUND

Appropriate risk stratification of indeterminate pulmonary nodules (IPNs) is necessary to direct diagnostic evaluation. Currently available models were developed in populations with lower cancer prevalence than that seen in thoracic surgery and pulmonology clinics and usually do not allow for missing data. We updated and expanded the Thoracic Research Evaluation and Treatment (TREAT) model into a more generalized, robust approach for lung cancer prediction in patients referred for specialty evaluation.

RESEARCH QUESTION

Can clinic-level differences in nodule evaluation be incorporated to improve lung cancer prediction accuracy in patients seeking immediate specialty evaluation compared with currently available models?

STUDY DESIGN AND METHODS

Clinical and radiographic data on patients with IPNs from six sites (N = 1,401) were collected retrospectively and divided into groups by clinical setting: pulmonary nodule clinic (n = 374; cancer prevalence, 42%), outpatient thoracic surgery clinic (n = 553; cancer prevalence, 73%), or inpatient surgical resection (n = 474; cancer prevalence, 90%). A new prediction model was developed using a missing data-driven pattern submodel approach. Discrimination and calibration were estimated with cross-validation and were compared with the original TREAT, Mayo Clinic, Herder, and Brock models. Reclassification was assessed with bias-corrected clinical net reclassification index and reclassification plots.

RESULTS

Two-thirds of patients had missing data; nodule growth and fluorodeoxyglucose-PET scan avidity were missing most frequently. The TREAT version 2.0 mean area under the receiver operating characteristic curve across missingness patterns was 0.85 compared with that of the original TREAT (0.80), Herder (0.73), Mayo Clinic (0.72), and Brock (0.68) models with improved calibration. The bias-corrected clinical net reclassification index was 0.23.

INTERPRETATION

The TREAT 2.0 model is more accurate and better calibrated for predicting lung cancer in high-risk IPNs than the Mayo, Herder, or Brock models. Nodule calculators such as TREAT 2.0 that account for varied lung cancer prevalence and that consider missing data may provide more accurate risk stratification for patients seeking evaluation at specialty nodule evaluation clinics.

Collapse

Affiliation(s)

Caroline M Godfrey Department of Thoracic Surgery, Vanderbilt University Medical Center, Nashville, TN
Maren E Shipe Department of Thoracic Surgery, Vanderbilt University Medical Center, Nashville, TN
Valerie F Welty Department of Thoracic Surgery, Vanderbilt University Medical Center, Nashville, TN
Amelia W Maiga Department of Thoracic Surgery, Vanderbilt University Medical Center, Nashville, TN; Division of Thoracic Surgery, Veterans Hospital, Tennessee Valley Healthcare System, Nashville, TN
Melinda C Aldrich Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
Chandler Montgomery Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
Jerod Crockett Department of Thoracic Surgery, Vanderbilt University Medical Center, Nashville, TN
Laszlo T Vaszar Department of Pulmonary Medicine, Mayo Clinic, Phoenix, AZ
Shawn Regis Department of Radiation Oncology, Lahey Hospital and Medical Center, Burlington, MA
James M Isbell Department of Surgery, Memorial Sloan Kettering Cancer Center, New York, NY
Otis B Rickman Division of Pulmonary Medicine, Vanderbilt University Medical Center, Nashville, TN
Rhonda Pinkerman Division of Thoracic Surgery, Veterans Hospital, Tennessee Valley Healthcare System, Nashville, TN
Eric S Lambright Department of Thoracic Surgery, Vanderbilt University Medical Center, Nashville, TN
Jonathan C Nesbitt Department of Thoracic Surgery, Vanderbilt University Medical Center, Nashville, TN; Division of Thoracic Surgery, Veterans Hospital, Tennessee Valley Healthcare System, Nashville, TN
Fabien Maldonado Division of Pulmonary Medicine, Vanderbilt University Medical Center, Nashville, TN
Jeffrey D Blume School of Data Science, University of Virginia, Charlottesville, VA
Stephen A Deppen Department of Thoracic Surgery, Vanderbilt University Medical Center, Nashville, TN
Eric L Grogan Department of Thoracic Surgery, Vanderbilt University Medical Center, Nashville, TN; Division of Thoracic Surgery, Veterans Hospital, Tennessee Valley Healthcare System, Nashville, TN.

Collapse

Ren B, Lipsitz SR, Weiss RD, Fitzmaurice GM. Multiple imputation for non-monotone missing not at random data using the no self-censoring model. Stat Methods Med Res 2023;32:1973-1993. [PMID: 37647237 DOI: 10.1177/09622802231188520] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]

Rodriguez PJ, Heagerty PJ, Clark S, Khor S, Chen Y, Haupt E, Hahn EE, Shankaran V, Bansal A. Using Machine Learning to Leverage Biomarker Change and Predict Colorectal Cancer Recurrence. JCO Clin Cancer Inform 2023;7:e2300066. [PMID: 37963310 PMCID: PMC10681492 DOI: 10.1200/cci.23.00066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2023] [Revised: 06/12/2023] [Accepted: 07/12/2023] [Indexed: 11/16/2023] Open

Abstract

PURPOSE

The risk of colorectal cancer (CRC) recurrence after primary treatment varies across individuals and over time. Using patients' most up-to-date information, including carcinoembryonic antigen (CEA) biomarker profiles, to predict risk could improve personalized decision making.

METHODS

We used electronic health record data from an integrated health system on a cohort of patients diagnosed with American Joint Committee on Cancer stage I-III CRC between 2008 and 2013 (N = 3,970) and monitored until recurrence or end of follow-up. We addressed missingness in recurrence outcomes and longitudinal CEA measures, and engineered CEA features using current and past biomarker values for inclusion in a risk prediction model. We used a discrete time Superlearner model to evaluate various algorithms for predicting recurrence. We evaluated the time-varying discrimination and calibration of the algorithms and assessed the role of individual predictors.

RESULTS

Recurrence was documented in 448 (11.3%) patients. XGBoost with depth = 1 (XGB-D1) predicted recurrence substantially better than all other algorithms at all time points, with AUC ranging from 0.87 (95% CI, 0.86 to 0.88) at 6 months to 0.94 (95% CI, 0.92 to 0.96) at 54 months. The only variable used by XGB-D1 was 6-month change in log CEA. Predicted 1-year risk of recurrence was nearly zero for patients whose log CEA did not increase in the last 6 months, between 12.2% and 34.1% for patients whose log CEA increased between 0.10 and 0.40, and 43.6% for those with a log CEA increase >0.40. Compared with XGB, penalized regression approaches (lasso, ridge, and elastic net) performed poorly, with AUCs ranging from 0.58 to 0.69.

CONCLUSION

A flexible, machine learning approach that incorporated longitudinal CEA information yielded a simple and high-performing model for predicting recurrence on the basis of 6-month change in log CEA.

Collapse

Sisk R, Sperrin M, Peek N, van Smeden M, Martin GP. Imputation and missing indicators for handling missing data in the development and deployment of clinical prediction models: A simulation study. Stat Methods Med Res 2023;32:1461-1477. [PMID: 37105540 PMCID: PMC10515473 DOI: 10.1177/09622802231165001] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/29/2023]

Abstract

Background: In clinical prediction modelling, missing data can occur at any stage of the model pipeline; development, validation or deployment. Multiple imputation is often recommended yet challenging to apply at deployment; for example, the outcome cannot be in the imputation model, as recommended under multiple imputation. Regression imputation uses a fitted model to impute the predicted value of missing predictors from observed data, and could offer a pragmatic alternative at deployment. Moreover, the use of missing indicators has been proposed to handle informative missingness, but it is currently unknown how well this method performs in the context of clinical prediction models. Methods: We simulated data under various missing data mechanisms to compare the predictive performance of clinical prediction models developed using both imputation methods. We consider deployment scenarios where missing data is permitted or prohibited, imputation models that use or omit the outcome, and clinical prediction models that include or omit missing indicators. We assume that the missingness mechanism remains constant across the model pipeline. We also apply the proposed strategies to critical care data. Results: With complete data available at deployment, our findings were in line with existing recommendations; that the outcome should be used to impute development data when using multiple imputation and omitted under regression imputation. When missingness is allowed at deployment, omitting the outcome from the imputation model at the development was preferred. Missing indicators improved model performance in many cases but can be harmful under outcome-dependent missingness. Conclusion: We provide evidence that commonly taught principles of handling missing data via multiple imputation may not apply to clinical prediction models, particularly when data can be missing at deployment. We observed comparable predictive performance under multiple imputation and regression imputation. The performance of the missing data handling method must be evaluated on a study-by-study basis, and the most appropriate strategy for handling missing data at development should consider whether missing data are allowed at deployment. Some guidance is provided.

Collapse

Mao X, Wang Z, Yang S. Matrix completion under complex survey sampling. ANN I STAT MATH 2023;75:463-492. [PMID: 37645434 PMCID: PMC10465119 DOI: 10.1007/s10463-022-00851-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Revised: 08/12/2022] [Accepted: 08/17/2022] [Indexed: 01/10/2023]

Yu J, Liu X, Zhu Z, Yang Z, He J, Zhang L, Lu H. Prediction models for cardiovascular disease risk among people living with HIV: A systematic review and meta-analysis. Front Cardiovasc Med 2023;10:1138234. [PMID: 37034346 PMCID: PMC10077152 DOI: 10.3389/fcvm.2023.1138234] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Accepted: 03/08/2023] [Indexed: 04/11/2023] Open

Abstract

Background

HIV continues to be a major global health issue. The relative risk of cardiovascular disease (CVD) among people living with HIV (PLWH) was 2.16 compared to non-HIV-infections. The prediction of CVD is becoming an important issue in current HIV management. However, there is no consensus on optional CVD risk models for PLWH. Therefore, we aimed to systematically summarize and compare prediction models for CVD risk among PLWH.

Methods

Longitudinal studies that developed or validated prediction models for CVD risk among PLWH were systematically searched. Five databases were searched up to January 2022. The quality of the included articles was evaluated by using the Prediction model Risk Of Bias ASsessment Tool (PROBAST). We applied meta-analysis to pool the logit-transformed C-statistics for discrimination performance.

Results

Thirteen articles describing 17 models were included. All the included studies had a high risk of bias. In the meta-analysis, the pooled estimated C-statistic was 0.76 (95% CI: 0.72-0.81, I ² = 84.8%) for the Data collection on Adverse Effects of Anti-HIV Drugs Study risk equation (D:A:D) (2010), 0.75 (95% CI: 0.70-0.79, I ² = 82.4%) for the D:A:D (2010) 10-year risk version, 0.77 (95% CI: 0.74-0.80, I ² = 82.2%) for the full D:A:D (2016) model, 0.74 (95% CI: 0.68-0.79, I ² = 86.2%) for the reduced D:A:D (2016) model, 0.71 (95% CI: 0.61-0.79, I ² = 87.9%) for the Framingham Risk Score (FRS) for coronary heart disease (CHD) (1998), 0.74 (95% CI: 0.70-0.78, I ² = 87.8%) for the FRS CVD model (2008), 0.72 (95% CI: 0.67-0.76, I ² = 75.0%) for the pooled cohort equations of the American Heart Society/ American score (PCE), and 0.67 (95% CI: 0.56-0.77, I ² = 51.3%) for the Systematic COronary Risk Evaluation (SCORE). In the subgroup analysis, the discrimination of PCE was significantly better in the group aged ≤40 years than in the group aged 40-45 years (P = 0.024) and the group aged ≥45 years (P = 0.010). No models were developed or validated in Sub-Saharan Africa and the Asia region.

Conclusions

The full D:A:D (2016) model performed the best in terms of discrimination, followed by the D:A:D (2010) and PCE. However, there were no significant differences between any of the model pairings. Specific CVD risk models for older PLWH and for PLWH in Sub-Saharan Africa and the Asia region should be established.Systematic Review Registration: PROSPERO CRD42022322024.

Collapse

Du M, Haag DG, Lynch JW, Mittinty MN. Application of multilevel models for predicting pain following root canal treatment. Community Dent Oral Epidemiol 2022;51:418-427. [PMID: 36510289 DOI: 10.1111/cdoe.12807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2021] [Revised: 09/17/2022] [Accepted: 10/17/2022] [Indexed: 12/14/2022]

Abstract

OBJECTIVES

This study developed predictive models for one-week acute and six-month persistent pain following root canal treatment (RCT). An additional aim was to study the gain in predictive efficacy of models containing clinical factors only, over models containing sociodemographic characteristics.

METHODS

A secondary data analysis of 708 patients who received RCTs was conducted. Three sets of predictors were used: (1) combined set, containing all predictors in the data set; (2) clinical set and (3) sociodemographic set. Missing data were handled by multiple imputation using the missing indicator method. The multilevel least absolute selection and shrinkage operator (LASSO) regression was used to select predictors into the final multilevel logistic models. Three measures, the area under the receiver operating characteristic curve (AUROC) and precision-recall curve (AUPRC) and calibration curves, were used to assess the predictive performance of the models.

RESULTS

The selected-in factors in the final models, using LASSO regression, are related to pre- and intra-treatment clinical symptoms and pain experience. Predictive performance of the models remained the same with the inclusion (exclusion) of the socio-demographic factors. For predicting one-week outcome, the model built with combined set of predictors yielded the highest AUROC and AUPRC of 0.85 and 0.72, followed by the models built with clinical factors (AUROC = 0.82, AUPRC = 0.66). The lowest predictive ability was found in models with only sociodemographic characteristics (AUROC = 0.68, AUPRC = 0.40). Similar patterns were observed in predicting six-month outcome, where the AUROC for models with combined, clinical and sociodemographic sets of predictors were 0.85, 0.89 and 0.66, respectively, and the AUPRC were 0.48, 0.53 and 0.22, respectively.

CONCLUSIONS

Clinical factors such as the severity and experience of pre-operative and intra-operative pain were discovered important to the subsequent development of pain following RCTs. Adding sociodemographic characteristics to the models with clinical factors did not change the models' predictive performance or the proportion of explained variance.

Collapse

Shen WC, Chiang HY, Chen PS, Lin YT, Kuo CC, Wu PY. Risk of All-Cause Mortality, Cardiovascular Disease Mortality, and Cancer Mortality in Patients With Bullous Pemphigoid. JAMA Dermatol 2022;158:167-175. [PMID: 34964804 PMCID: PMC8717210 DOI: 10.1001/jamadermatol.2021.5125] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2021] [Accepted: 10/18/2021] [Indexed: 12/31/2022]

Abstract

IMPORTANCE

The role of bullous pemphigoid (BP) in cardiovascular disease (CVD) mortality remains controversial, and analyses of causes of death among patients with BP based on individual data remain lacking.

OBJECTIVE

To evaluate the risk of all-cause mortality, CVD mortality, and cancer mortality in patients with BP.

DESIGN, SETTING, AND PARTICIPANTS

This cohort study identified patients who received a diagnosis of and treatment for BP during their dermatology clinic visits at a tertiary medical center in central Taiwan between January 1, 2007, and December 31, 2017. Controls were patients without BP and were individually matched to cases (4:1) according to age, sex, and date of the dermatology clinic visit. Data were analyzed from March 6, 2019, to April 2, 2021.

EXPOSURES

Bullous pemphigoid was confirmed pathologically with typical direct immunofluorescence findings or clinically with typical clinical presentation, positive findings of an anti-basement membrane zone antibody test, and corticosteroid use for at least 28 cumulative days.

MAIN OUTCOMES AND MEASURES

Mortality outcomes confirmed by the National Death Registry.

RESULTS

Of 252 patients with BP and 1008 matched control patients (N = 1260), 685 (54.4%) were men and the median age was 78.0 (IQR, 70.3-84.8) years. Patients with BP had higher CVD mortality at 1 year (20 [7.9%] vs 13 [1.3%]), 3 years (28 [11.1%] vs 24 [2.4%]), and 5 years (31 [12.3%] vs 39 [3.9%]) compared with matched control patients. After adjusting for potential confounding variables, patients with BP had a 5-fold higher risk of CVD mortality at 1 year (hazard ratio [HR], 5.29 [95% CI, 2.40-11.68]), 3 years (HR, 5.79 [95% CI, 3.11-10.78]), and 5 years (HR, 4.95 [95% CI, 2.88-8.51]). Subgroup analyses revealed that the CVD mortality risk associated with BP was higher in patients without a history of hypertension (HR, 7.28 [95% CI, 3.87-13.69]) or CVD (HR, 6.59 [95% CI, 3.40-12.79]) and in patients without prior diuretic use (HR, 5.75 [95% CI, 3.15-10.50]) compared with matched control patients. In addition, all-cause mortality associated with BP was higher in patients without prior corticosteroid use than in control patients (HR 5.65 [95% CI, 4.19-7.61]).

CONCLUSIONS AND RELEVANCE

The findings of this cohort study suggest that BP was associated with a 5-fold higher risk of CVD mortality, particularly in patients without underlying hypertension or CVD or those without prior corticosteroid or diuretic use. Future studies should investigate the benefits of routine monitoring and timely management of CVD symptoms and signs in patients with BP.

Collapse

Berkelmans G, Read S, Gudbjörnsdottir S, Wild S, Franzen S, van der Graaf Y, Eliasson B, Visseren F, Paynter N, Dorresteijn J. Population median imputation was noninferior to complex approaches for imputing missing values in cardiovascular prediction models in clinical practice. J Clin Epidemiol 2022;145:70-80. [DOI: 10.1016/j.jclinepi.2022.01.011] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2021] [Revised: 12/05/2021] [Accepted: 01/17/2022] [Indexed: 02/06/2023]

Page GL, Quintana FA, Müller P. Clustering and Prediction With Variable Dimension Covariates. J Comput Graph Stat 2021. [DOI: 10.1080/10618600.2021.1999824] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]

Nijman S, Leeuwenberg AM, Beekers I, Verkouter I, Jacobs J, Bots ML, Asselbergs FW, Moons K, Debray T. Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review. J Clin Epidemiol 2021;142:218-229. [PMID: 34798287 DOI: 10.1016/j.jclinepi.2021.11.023] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2021] [Revised: 11/01/2021] [Accepted: 11/10/2021] [Indexed: 12/23/2022]

Abstract

OBJECTIVES

Missing data is a common problem during the development, evaluation, and implementation of prediction models. Although machine learning (ML) methods are often said to be capable of circumventing missing data, it is unclear how these methods are used in medical research. We aim to find out if and how well prediction model studies using machine learning report on their handling of missing data.

STUDY DESIGN AND SETTING

We systematically searched the literature on published papers between 2018 and 2019 about primary studies developing and/or validating clinical prediction models using any supervised ML methodology across medical fields. From the retrieved studies information about the amount and nature (e.g. missing completely at random, potential reasons for missingness) of missing data and the way they were handled were extracted.

RESULTS

We identified 152 machine learning-based clinical prediction model studies. A substantial amount of these 152 papers did not report anything on missing data (n = 56/152). A majority (n = 96/152) reported details on the handling of missing data (e.g., methods used), though many of these (n = 46/96) did not report the amount of the missingness in the data. In these 96 papers the authors only sometimes reported possible reasons for missingness (n = 7/96) and information about missing data mechanisms (n = 8/96). The most common approach for handling missing data was deletion (n = 65/96), mostly via complete-case analysis (CCA) (n = 43/96). Very few studies used multiple imputation (n = 8/96) or built-in mechanisms such as surrogate splits (n = 7/96) that directly address missing data during the development, validation, or implementation of the prediction model.

CONCLUSION

Though missing values are highly common in any type of medical research and certainly in the research based on routine healthcare data, a majority of the prediction model studies using machine learning does not report sufficient information on the presence and handling of missing data. Strategies in which patient data are simply omitted are unfortunately the most often used methods, even though it is generally advised against and well known that it likely causes bias and loss of analytical power in prediction model development and in the predictive accuracy estimates. Prediction model researchers should be much more aware of alternative methodologies to address missing data.

Collapse

van der Plas-Krijgsman WG, Giardiello D, Putter H, Steyerberg EW, Bastiaannet E, Stiggelbout AM, Mooijaart SP, Kroep JR, Portielje JEA, Liefers GJ, de Glas NA. Development and validation of the PORTRET tool to predict recurrence, overall survival, and other-cause mortality in older patients with breast cancer in the Netherlands: a population-based study. THE LANCET. HEALTHY LONGEVITY 2021;2:e704-e711. [PMID: 36098027 DOI: 10.1016/s2666-7568(21)00229-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2021] [Revised: 09/03/2021] [Accepted: 09/06/2021] [Indexed: 12/24/2022] Open

Abstract

BACKGROUND

Current prediction tools for breast cancer outcomes are not tailored to the older patient, in whom competing risk strongly influences treatment effects. We aimed to develop and validate a prediction tool for 5-year recurrence, overall mortality, and other-cause mortality for older patients (aged ≥65 years) with early invasive breast cancer and to estimate individualised expected benefits of adjuvant systemic treatment.

METHODS

We selected surgically treated patients with early invasive breast cancer (stage I-III) aged 65 years or older from the population-based FOCUS cohort in the Netherlands. We developed prediction models for 5-year recurrence, overall mortality, and other-cause mortality using cause-specific Cox proportional hazard models. External validation was performed in a Dutch Cancer registry cohort. Performance was evaluated with discrimination accuracy and calibration plots.

FINDINGS

We included 2744 female patients in the development cohort and 13631 female patients in the validation cohort. Median age was 74·8 years (range 65-98) in the development cohort and 76·0 years (70-101) in the validation cohort. 5-year follow-up was complete for more than 99% of all patients. We observed 343 and 1462 recurrences, and 831 and 3594 deaths, of which 586 and 2565 were without recurrence, in the development and validation cohort, respectively. The area under the receiver-operating-characteristic curve at 5 years in the external dataset was 0·76 (95% CI 0·75-0·76) for overall mortality, 0·76 (0·76-0·77) for recurrence, and 0·75 (0·74-0·75) for other-cause mortality.

INTERPRETATION

The PORTRET tool can accurately predict 5-year recurrence, overall mortality, and other-cause mortality in older patients with breast cancer. The tool can support shared decision making, especially since it provides individualised estimated benefits of adjuvant treatment.

FUNDING

Dutch Cancer Foundation and ZonMw.

Collapse

Beesley LJ, Bondarenko I, Elliot MR, Kurian AW, Katz SJ, Taylor JM. Multiple imputation with missing data indicators. Stat Methods Med Res 2021;30:2685-2700. [PMID: 34643465 DOI: 10.1177/09622802211047346] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]

Tsvetanova A, Sperrin M, Peek N, Buchan I, Hyland S, Martin GP. Missing data was handled inconsistently in UK prediction models: a review of method used. J Clin Epidemiol 2021;140:149-158. [PMID: 34520847 DOI: 10.1016/j.jclinepi.2021.09.008] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Revised: 08/17/2021] [Accepted: 09/07/2021] [Indexed: 10/20/2022]

A simple and efficient incremental missing data imputation method for evolving neo-fuzzy network. EVOLVING SYSTEMS 2021. [DOI: 10.1007/s12530-021-09376-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]

Gravesteijn BY, Sewalt CA, Venema E, Nieboer D, Steyerberg EW. Missing Data in Prediction Research: A Five-Step Approach for Multiple Imputation, Illustrated in the CENTER-TBI Study. J Neurotrauma 2021;38:1842-1857. [PMID: 33470157 DOI: 10.1089/neu.2020.7218] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open

Sisk R, Lin L, Sperrin M, Barrett JK, Tom B, Diaz-Ordaz K, Peek N, Martin GP. Informative presence and observation in routine health data: A review of methodology for clinical risk prediction. J Am Med Inform Assoc 2021;28:155-166. [PMID: 33164082 PMCID: PMC7810439 DOI: 10.1093/jamia/ocaa242] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2020] [Accepted: 09/17/2020] [Indexed: 12/20/2022] Open

Abstract

Objective

Informative presence (IP) is the phenomenon whereby the presence or absence of patient data is potentially informative with respect to their health condition, with informative observation (IO) being the longitudinal equivalent. These phenomena predominantly exist within routinely collected healthcare data, in which data collection is driven by the clinical requirements of patients and clinicians. The extent to which IP and IO are considered when using such data to develop clinical prediction models (CPMs) is unknown, as is the existing methodology aiming at handling these issues. This review aims to synthesize such existing methodology, thereby helping identify an agenda for future methodological work.

Materials and Methods

A systematic literature search was conducted by 2 independent reviewers using prespecified keywords.

Results

Thirty-six articles were included. We categorized the methods presented within as derived predictors (including some representation of the measurement process as a predictor in the model), modeling under IP, and latent structures. Including missing indicators or summary measures as predictors is the most commonly presented approach amongst the included studies (24 of 36 articles).

Discussion

This is the first review to collate the literature in this area under a prediction framework. A considerable body relevant of literature exists, and we present ways in which the described methods could be developed further. Guidance is required for specifying the conditions under which each method should be used to enable applied prediction modelers to use these methods.

Conclusions

A growing recognition of IP and IO exists within the literature, and methodology is increasingly becoming available to leverage these phenomena for prediction purposes. IP and IO should be approached differently in a prediction context than when the primary goal is explanation. The work included in this review has demonstrated theoretical and empirical benefits of incorporating IP and IO, and therefore we recommend that applied health researchers consider incorporating these methods in their work.

Collapse

Hoogland J, van Barreveld M, Debray TPA, Reitsma JB, Verstraelen TE, Dijkgraaf MGW, Zwinderman AH. Handling missing predictor values when validating and applying a prediction model to new patients. Stat Med 2020;39:3591-3607. [PMID: 32687233 PMCID: PMC7586995 DOI: 10.1002/sim.8682] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2019] [Revised: 05/10/2020] [Accepted: 06/10/2020] [Indexed: 12/23/2022]

Sperrin M, Martin GP. Multiple imputation with missing indicators as proxies for unmeasured variables: simulation study. BMC Med Res Methodol 2020;20:185. [PMID: 32640992 PMCID: PMC7346454 DOI: 10.1186/s12874-020-01068-x] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Accepted: 06/28/2020] [Indexed: 01/09/2023] Open

Sperrin M, Martin GP, Sisk R, Peek N. Missing data should be handled differently for prediction than for description or causal explanation. J Clin Epidemiol 2020;125:183-187. [PMID: 32540389 DOI: 10.1016/j.jclinepi.2020.03.028] [Citation(s) in RCA: 42] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2019] [Revised: 03/10/2020] [Accepted: 03/18/2020] [Indexed: 12/26/2022]

Kroncke BM, Smith DK, Zuo Y, Glazer AM, Roden DM, Blume JD. A Bayesian method to estimate variant-induced disease penetrance. PLoS Genet 2020;16:e1008862. [PMID: 32569262 PMCID: PMC7347235 DOI: 10.1371/journal.pgen.1008862] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2019] [Revised: 07/09/2020] [Accepted: 05/14/2020] [Indexed: 01/09/2023] Open

Su Y, Chen Y, Tian Z, Lu C, Chen L, Ma X. lncRNAs classifier to accurately predict the recurrence of thymic epithelial tumors. Thorac Cancer 2020;11:1773-1783. [PMID: 32374079 PMCID: PMC7327696 DOI: 10.1111/1759-7714.13439] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2020] [Revised: 03/28/2020] [Accepted: 03/30/2020] [Indexed: 12/12/2022] Open

Abstract

Background

Long non‐coding RNAs (lncRNAs), which have little or no ability to encode proteins, have attracted special attention due to their potential role in cancer disease. In this study we aimed to establish a lncRNAs classifier to improve the accuracy of recurrence prediction for thymic epithelial tumors (TETs).

Methods

TETs RNA sequencing (RNA‐seq) data set and the matched clinicopathologic information were downloaded from the Cancer Genome Atlas. Using univariate Cox regression and least absolute shrinkage and selection operator (LASSO) analysis, we developed a lncRNAs classifier related to recurrence. Functional analysis was conducted to investigate the potential biological processes of the lncRNAs target genes. The independent prognostic factors were identified by Cox regression model. Additionally, predictive ability and clinical application of the lncRNAs classifier were assessed, and compared with the Masaoka staging by receiver operating characteristic (ROC) analysis and decision curve analysis (DCA).

Results

Four recurrence‐free survival (RFS)‐related lncRNAs were identified, and the classifier consisting of the identified four lncRNAs was able to effectively divide the patients into high and low risk subgroups, with an area under curve (AUC) of 0.796 (three‐year RFS) and 0.788 (five‐year RFS), respectively. Multivariate analysis indicated that the lncRNAs classifier was an independent recurrence risk factor. The AUC of the lncRNAs classifier in predicting RFS was significantly higher than the Masaoka staging system. Decision curve analysis further demonstrated that the lncRNAs classifier had a larger net benefit than the Masaoka staging system.

Conclusions

A lncRNAs classifier for patients with TETs was an independent risk factor for RFS despite other clinicopathologic variables. It generated more accurate estimations of the recurrence probability when compared to the Masaoka staging system, but additional data is required before it can be used in clinical practice.

Collapse

Mertens BJA, Banzato E, de Wreede LC. Construction and assessment of prediction rules for binary outcome in the presence of missing predictor data using multiple imputation and cross-validation: Methodological approach and data-based evaluation. Biom J 2020;62:724-741. [PMID: 32052492 PMCID: PMC7217034 DOI: 10.1002/bimj.201800289] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2018] [Revised: 10/18/2019] [Accepted: 11/04/2019] [Indexed: 12/24/2022]

Groenwold RHH. Informative missingness in electronic health record systems: the curse of knowing. Diagn Progn Res 2020;4:8. [PMID: 32699824 PMCID: PMC7371469 DOI: 10.1186/s41512-020-00077-0] [Citation(s) in RCA: 43] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Accepted: 04/22/2020] [Indexed: 12/17/2022] Open

Mijderwijk HJ, Steyerberg EW, Steiger HJ, Fischer I, Kamp MA. Fundamentals of Clinical Prediction Modeling for the Neurosurgeon. Neurosurgery 2019;85:302-311. [DOI: 10.1093/neuros/nyz282] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2018] [Accepted: 05/26/2019] [Indexed: 01/18/2023] Open