1
|
D'Agostino McGowan L, Lotspeich SC, Hepler SA. The "Why" behind including "Y" in your imputation model. Stat Methods Med Res 2024:9622802241244608. [PMID: 38625810 DOI: 10.1177/09622802241244608] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/18/2024]
Abstract
Missing data is a common challenge when analyzing epidemiological data, and imputation is often used to address this issue. Here, we investigate the scenario where a covariate used in an analysis has missingness and will be imputed. There are recommendations to include the outcome from the analysis model in the imputation model for missing covariates, but it is not necessarily clear if this recommendation always holds and why this is sometimes true. We examine deterministic imputation (i.e. single imputation with fixed values) and stochastic imputation (i.e. single or multiple imputation with random values) methods and their implications for estimating the relationship between the imputed covariate and the outcome. We mathematically demonstrate that including the outcome variable in imputation models is not just a recommendation but a requirement to achieve unbiased results when using stochastic imputation methods. Moreover, we dispel common misconceptions about deterministic imputation models and demonstrate why the outcome should not be included in these models. This article aims to bridge the gap between imputation in theory and in practice, providing mathematical derivations to explain common statistical recommendations. We offer a better understanding of the considerations involved in imputing missing covariates and emphasize when it is necessary to include the outcome variable in the imputation model.
Collapse
Affiliation(s)
| | - Sarah C Lotspeich
- Department of Statistical Sciences, Wake Forest University, Winston-Salem, NC, USA
| | - Staci A Hepler
- Department of Statistical Sciences, Wake Forest University, Winston-Salem, NC, USA
| |
Collapse
|
2
|
Cheng P, Yang P, Zhang H, Wang H. Prediction Models for Return of Spontaneous Circulation in Patients with Cardiac Arrest: A Systematic Review and Critical Appraisal. Emerg Med Int 2023; 2023:6780941. [PMID: 38035124 PMCID: PMC10684323 DOI: 10.1155/2023/6780941] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 04/23/2023] [Accepted: 11/04/2023] [Indexed: 12/02/2023] Open
Abstract
Objectives Prediction models for the return of spontaneous circulation (ROSC) in patients with cardiac arrest play an important role in helping physicians evaluate the survival probability and providing medical decision-making reference. Although relevant models have been developed, their methodological rigor and model applicability are still unclear. Therefore, this study aims to summarize the evidence for ROSC prediction models and provide a reference for the development, validation, and application of ROSC prediction models. Methods PubMed, Cochrane Library, Embase, Elsevier, Web of Science, SpringerLink, Ovid, CNKI, Wanfang, and SinoMed were systematically searched for studies on ROSC prediction models. The search time limit was from the establishment of the database to August 30, 2022. Two reviewers independently screened the literature and extracted the data. The PROBAST was used to evaluate the quality of the included literature. Results A total of 8 relevant prediction models were included, and 6 models reported the AUC of 0.662-0.830 in the modeling population, which showed good overall applicability but high risk of bias. The main reasons were improper handling of missing values and variable screening, lack of external validation of the model, and insufficient information of overfitting. Age, gender, etiology, initial heart rhythm, EMS arrival time/BLS intervention time, location, bystander CPR, witnessed during sudden arrest, and ACLS duration/compression duration were the most commonly included predictors. Obvious chest injury, body temperature below 33°C, and possible etiologies were predictive factors for ROSC failure in patients with TOHCA. Age, gender, initial heart rhythm, reason for the hospital visit, length of hospital stay, and the location of occurrence in hospital were the predictors of ROSC in IHCA patients. Conclusion The performance of current ROSC prediction models varies greatly and has a high risk of bias, which should be selected with caution. Future studies can further optimize and externally validate the existing models.
Collapse
Affiliation(s)
- Pengfei Cheng
- Department of Nursing, Second Affiliated Hospital of Zhejiang University, Hangzhou 310009, China
| | - Pengyu Yang
- School of International Nursing, Hainan Medical University, Haikou 571199, China
| | - Hua Zhang
- School of International Nursing, Hainan Medical University, Haikou 571199, China
- Key Laboratory of Emergency and Trauma Ministry of Education, Hainan Medical University, Haikou 571199, China
| | - Haizhen Wang
- Department of Nursing, Second Affiliated Hospital of Zhejiang University, Hangzhou 310009, China
| |
Collapse
|
3
|
Godfrey CM, Shipe ME, Welty VF, Maiga AW, Aldrich MC, Montgomery C, Crockett J, Vaszar LT, Regis S, Isbell JM, Rickman OB, Pinkerman R, Lambright ES, Nesbitt JC, Maldonado F, Blume JD, Deppen SA, Grogan EL. The Thoracic Research Evaluation and Treatment 2.0 Model: A Lung Cancer Prediction Model for Indeterminate Nodules Referred for Specialist Evaluation. Chest 2023; 164:1305-1314. [PMID: 37421973 PMCID: PMC10635839 DOI: 10.1016/j.chest.2023.06.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2023] [Revised: 05/03/2023] [Accepted: 06/01/2023] [Indexed: 07/10/2023] Open
Abstract
BACKGROUND Appropriate risk stratification of indeterminate pulmonary nodules (IPNs) is necessary to direct diagnostic evaluation. Currently available models were developed in populations with lower cancer prevalence than that seen in thoracic surgery and pulmonology clinics and usually do not allow for missing data. We updated and expanded the Thoracic Research Evaluation and Treatment (TREAT) model into a more generalized, robust approach for lung cancer prediction in patients referred for specialty evaluation. RESEARCH QUESTION Can clinic-level differences in nodule evaluation be incorporated to improve lung cancer prediction accuracy in patients seeking immediate specialty evaluation compared with currently available models? STUDY DESIGN AND METHODS Clinical and radiographic data on patients with IPNs from six sites (N = 1,401) were collected retrospectively and divided into groups by clinical setting: pulmonary nodule clinic (n = 374; cancer prevalence, 42%), outpatient thoracic surgery clinic (n = 553; cancer prevalence, 73%), or inpatient surgical resection (n = 474; cancer prevalence, 90%). A new prediction model was developed using a missing data-driven pattern submodel approach. Discrimination and calibration were estimated with cross-validation and were compared with the original TREAT, Mayo Clinic, Herder, and Brock models. Reclassification was assessed with bias-corrected clinical net reclassification index and reclassification plots. RESULTS Two-thirds of patients had missing data; nodule growth and fluorodeoxyglucose-PET scan avidity were missing most frequently. The TREAT version 2.0 mean area under the receiver operating characteristic curve across missingness patterns was 0.85 compared with that of the original TREAT (0.80), Herder (0.73), Mayo Clinic (0.72), and Brock (0.68) models with improved calibration. The bias-corrected clinical net reclassification index was 0.23. INTERPRETATION The TREAT 2.0 model is more accurate and better calibrated for predicting lung cancer in high-risk IPNs than the Mayo, Herder, or Brock models. Nodule calculators such as TREAT 2.0 that account for varied lung cancer prevalence and that consider missing data may provide more accurate risk stratification for patients seeking evaluation at specialty nodule evaluation clinics.
Collapse
Affiliation(s)
- Caroline M Godfrey
- Department of Thoracic Surgery, Vanderbilt University Medical Center, Nashville, TN
| | - Maren E Shipe
- Department of Thoracic Surgery, Vanderbilt University Medical Center, Nashville, TN
| | - Valerie F Welty
- Department of Thoracic Surgery, Vanderbilt University Medical Center, Nashville, TN
| | - Amelia W Maiga
- Department of Thoracic Surgery, Vanderbilt University Medical Center, Nashville, TN; Division of Thoracic Surgery, Veterans Hospital, Tennessee Valley Healthcare System, Nashville, TN
| | - Melinda C Aldrich
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN
| | | | - Jerod Crockett
- Department of Thoracic Surgery, Vanderbilt University Medical Center, Nashville, TN
| | | | - Shawn Regis
- Department of Radiation Oncology, Lahey Hospital and Medical Center, Burlington, MA
| | - James M Isbell
- Department of Surgery, Memorial Sloan Kettering Cancer Center, New York, NY
| | - Otis B Rickman
- Division of Pulmonary Medicine, Vanderbilt University Medical Center, Nashville, TN
| | - Rhonda Pinkerman
- Division of Thoracic Surgery, Veterans Hospital, Tennessee Valley Healthcare System, Nashville, TN
| | - Eric S Lambright
- Department of Thoracic Surgery, Vanderbilt University Medical Center, Nashville, TN
| | - Jonathan C Nesbitt
- Department of Thoracic Surgery, Vanderbilt University Medical Center, Nashville, TN; Division of Thoracic Surgery, Veterans Hospital, Tennessee Valley Healthcare System, Nashville, TN
| | - Fabien Maldonado
- Division of Pulmonary Medicine, Vanderbilt University Medical Center, Nashville, TN
| | - Jeffrey D Blume
- School of Data Science, University of Virginia, Charlottesville, VA
| | - Stephen A Deppen
- Department of Thoracic Surgery, Vanderbilt University Medical Center, Nashville, TN
| | - Eric L Grogan
- Department of Thoracic Surgery, Vanderbilt University Medical Center, Nashville, TN; Division of Thoracic Surgery, Veterans Hospital, Tennessee Valley Healthcare System, Nashville, TN.
| |
Collapse
|
4
|
Ren B, Lipsitz SR, Weiss RD, Fitzmaurice GM. Multiple imputation for non-monotone missing not at random data using the no self-censoring model. Stat Methods Med Res 2023; 32:1973-1993. [PMID: 37647237 DOI: 10.1177/09622802231188520] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Abstract
Although approaches for handling missing data from longitudinal studies are well-developed when the patterns of missingness are monotone, fewer methods are available for non-monotone missingness. Moreover, the conventional missing at random assumption-a natural benchmark for monotone missingness-does not model realistic beliefs about the non-monotone missingness processes (Robins and Gill, 1997). This has provided the impetus for alternative non-monotone missing not at random mechanisms. The "no self-censoring" model is such a mechanism and assumes the probability an outcome variable is missing is independent of its value when conditioning on all other possibly missing outcome variables and their missingness indicators. As an alternative to "weighting" methods that become computationally demanding with increasing number of outcome variables, we propose a multiple imputation approach under no self-censoring. We focus on the case of binary outcomes and present results of simulation and asymptotic studies to investigate the performance of the proposed imputation approach. We describe a related approach to sensitivity analysis to departure from no self-censoring. We discuss the relationship between missing at random and no self-censoring and prove that one is not a special case of the other. Finally, we discuss extensions to non-binary data settings. The proposed methods are illustrated with application to a substance use disorder clinical trial.
Collapse
Affiliation(s)
- Boyu Ren
- Laboratory for Psychiatric Biostatistics, McLean Hospital, Belmont, MA, USA
- Department of Psychiatry, Harvard Medical School, Boston, MA, USA
| | - Stuart R Lipsitz
- Division of General Medicine, Brigham and Womens Hospital, Boston, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
| | - Roger D Weiss
- Department of Psychiatry, Harvard Medical School, Boston, MA, USA
- Division of Alcohol and Drug Abuse, McLean Hospital, Belmont, MA, USA
| | - Garrett M Fitzmaurice
- Laboratory for Psychiatric Biostatistics, McLean Hospital, Belmont, MA, USA
- Department of Psychiatry, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
5
|
Rodriguez PJ, Heagerty PJ, Clark S, Khor S, Chen Y, Haupt E, Hahn EE, Shankaran V, Bansal A. Using Machine Learning to Leverage Biomarker Change and Predict Colorectal Cancer Recurrence. JCO Clin Cancer Inform 2023; 7:e2300066. [PMID: 37963310 PMCID: PMC10681492 DOI: 10.1200/cci.23.00066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2023] [Revised: 06/12/2023] [Accepted: 07/12/2023] [Indexed: 11/16/2023] Open
Abstract
PURPOSE The risk of colorectal cancer (CRC) recurrence after primary treatment varies across individuals and over time. Using patients' most up-to-date information, including carcinoembryonic antigen (CEA) biomarker profiles, to predict risk could improve personalized decision making. METHODS We used electronic health record data from an integrated health system on a cohort of patients diagnosed with American Joint Committee on Cancer stage I-III CRC between 2008 and 2013 (N = 3,970) and monitored until recurrence or end of follow-up. We addressed missingness in recurrence outcomes and longitudinal CEA measures, and engineered CEA features using current and past biomarker values for inclusion in a risk prediction model. We used a discrete time Superlearner model to evaluate various algorithms for predicting recurrence. We evaluated the time-varying discrimination and calibration of the algorithms and assessed the role of individual predictors. RESULTS Recurrence was documented in 448 (11.3%) patients. XGBoost with depth = 1 (XGB-D1) predicted recurrence substantially better than all other algorithms at all time points, with AUC ranging from 0.87 (95% CI, 0.86 to 0.88) at 6 months to 0.94 (95% CI, 0.92 to 0.96) at 54 months. The only variable used by XGB-D1 was 6-month change in log CEA. Predicted 1-year risk of recurrence was nearly zero for patients whose log CEA did not increase in the last 6 months, between 12.2% and 34.1% for patients whose log CEA increased between 0.10 and 0.40, and 43.6% for those with a log CEA increase >0.40. Compared with XGB, penalized regression approaches (lasso, ridge, and elastic net) performed poorly, with AUCs ranging from 0.58 to 0.69. CONCLUSION A flexible, machine learning approach that incorporated longitudinal CEA information yielded a simple and high-performing model for predicting recurrence on the basis of 6-month change in log CEA.
Collapse
Affiliation(s)
- Patricia J. Rodriguez
- The Comparative Health Outcomes, Policy & Economics (CHOICE) Institute, University of Washington, Seattle, WA
| | | | - Samantha Clark
- The Comparative Health Outcomes, Policy & Economics (CHOICE) Institute, University of Washington, Seattle, WA
| | - Sara Khor
- The Comparative Health Outcomes, Policy & Economics (CHOICE) Institute, University of Washington, Seattle, WA
| | - Yilin Chen
- The Comparative Health Outcomes, Policy & Economics (CHOICE) Institute, University of Washington, Seattle, WA
| | - Eric Haupt
- Department of Research and Evaluation, Kaiser Permanente Southern California, Pasadena, CA
| | - Erin E. Hahn
- Department of Research and Evaluation, Kaiser Permanente Southern California, Pasadena, CA
- Department of Health Systems Science, Kaiser Permanente Bernard J. Tyson School of Medicine, Pasadena, CA
| | | | - Aasthaa Bansal
- The Comparative Health Outcomes, Policy & Economics (CHOICE) Institute, University of Washington, Seattle, WA
- Fred Hutchinson Cancer Center, Seattle, WA
| |
Collapse
|
6
|
Sisk R, Sperrin M, Peek N, van Smeden M, Martin GP. Imputation and missing indicators for handling missing data in the development and deployment of clinical prediction models: A simulation study. Stat Methods Med Res 2023; 32:1461-1477. [PMID: 37105540 PMCID: PMC10515473 DOI: 10.1177/09622802231165001] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/29/2023]
Abstract
Background: In clinical prediction modelling, missing data can occur at any stage of the model pipeline; development, validation or deployment. Multiple imputation is often recommended yet challenging to apply at deployment; for example, the outcome cannot be in the imputation model, as recommended under multiple imputation. Regression imputation uses a fitted model to impute the predicted value of missing predictors from observed data, and could offer a pragmatic alternative at deployment. Moreover, the use of missing indicators has been proposed to handle informative missingness, but it is currently unknown how well this method performs in the context of clinical prediction models. Methods: We simulated data under various missing data mechanisms to compare the predictive performance of clinical prediction models developed using both imputation methods. We consider deployment scenarios where missing data is permitted or prohibited, imputation models that use or omit the outcome, and clinical prediction models that include or omit missing indicators. We assume that the missingness mechanism remains constant across the model pipeline. We also apply the proposed strategies to critical care data. Results: With complete data available at deployment, our findings were in line with existing recommendations; that the outcome should be used to impute development data when using multiple imputation and omitted under regression imputation. When missingness is allowed at deployment, omitting the outcome from the imputation model at the development was preferred. Missing indicators improved model performance in many cases but can be harmful under outcome-dependent missingness. Conclusion: We provide evidence that commonly taught principles of handling missing data via multiple imputation may not apply to clinical prediction models, particularly when data can be missing at deployment. We observed comparable predictive performance under multiple imputation and regression imputation. The performance of the missing data handling method must be evaluated on a study-by-study basis, and the most appropriate strategy for handling missing data at development should consider whether missing data are allowed at deployment. Some guidance is provided.
Collapse
Affiliation(s)
- Rose Sisk
- Division of Informatics, Imaging and Data Science, Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre, University of Manchester, Manchester, UK
- Gendius Ltd, Macclesfield, UK
| | - Matthew Sperrin
- Division of Informatics, Imaging and Data Science, Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre, University of Manchester, Manchester, UK
- Alan Turing Institute, London, UK
| | - Niels Peek
- Division of Informatics, Imaging and Data Science, Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre, University of Manchester, Manchester, UK
- Alan Turing Institute, London, UK
- NIHR Manchester Biomedical Research Centre, Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre, University of Manchester, Manchester, UK
| | - Maarten van Smeden
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, the Netherlands
| | - Glen Philip Martin
- Division of Informatics, Imaging and Data Science, Faculty of Biology, Medicine and Health, Manchester Academic Health Science Centre, University of Manchester, Manchester, UK
| |
Collapse
|
7
|
Mao X, Wang Z, Yang S. Matrix completion under complex survey sampling. ANN I STAT MATH 2023; 75:463-492. [PMID: 37645434 PMCID: PMC10465119 DOI: 10.1007/s10463-022-00851-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Revised: 08/12/2022] [Accepted: 08/17/2022] [Indexed: 01/10/2023]
Abstract
Multivariate nonresponse is often encountered in complex survey sampling, and simply ignoring it leads to erroneous inference. In this paper, we propose a new matrix completion method for complex survey sampling. Different from existing works either conducting row-wise or column-wise imputation, the data matrix is treated as a whole which allows for exploiting both row and column patterns simultaneously. A column-space-decomposition model is adopted incorporating a low-rank structured matrix for the finite population with easy-to-obtain demographic information as covariates. Besides, we propose a computationally efficient projection strategy to identify the model parameters under complex survey sampling. Then, an augmented inverse probability weighting estimator is used to estimate the parameter of interest, and the corresponding asymptotic upper bound of the estimation error is derived. Simulation studies show that the proposed estimator has a smaller mean squared error than other competitors, and the corresponding variance estimator performs well. The proposed method is applied to assess the health status of the U.S. population.
Collapse
Affiliation(s)
- Xiaojun Mao
- School of Mathematical Sciences, Ministry of Education Key Laboratory of Scientific and Engineering Computing, Shanghai Jiao Tong University, Shanghai 200240, People’s Republic of China
| | - Zhonglei Wang
- Wang Yanan Institute for Studies in Economics and School of Economics, Xiamen University, Xiamen 361005, Fujian, People’s Republic of China
| | - Shu Yang
- Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA
| |
Collapse
|
8
|
Yu J, Liu X, Zhu Z, Yang Z, He J, Zhang L, Lu H. Prediction models for cardiovascular disease risk among people living with HIV: A systematic review and meta-analysis. Front Cardiovasc Med 2023; 10:1138234. [PMID: 37034346 PMCID: PMC10077152 DOI: 10.3389/fcvm.2023.1138234] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Accepted: 03/08/2023] [Indexed: 04/11/2023] Open
Abstract
Background HIV continues to be a major global health issue. The relative risk of cardiovascular disease (CVD) among people living with HIV (PLWH) was 2.16 compared to non-HIV-infections. The prediction of CVD is becoming an important issue in current HIV management. However, there is no consensus on optional CVD risk models for PLWH. Therefore, we aimed to systematically summarize and compare prediction models for CVD risk among PLWH. Methods Longitudinal studies that developed or validated prediction models for CVD risk among PLWH were systematically searched. Five databases were searched up to January 2022. The quality of the included articles was evaluated by using the Prediction model Risk Of Bias ASsessment Tool (PROBAST). We applied meta-analysis to pool the logit-transformed C-statistics for discrimination performance. Results Thirteen articles describing 17 models were included. All the included studies had a high risk of bias. In the meta-analysis, the pooled estimated C-statistic was 0.76 (95% CI: 0.72-0.81, I 2 = 84.8%) for the Data collection on Adverse Effects of Anti-HIV Drugs Study risk equation (D:A:D) (2010), 0.75 (95% CI: 0.70-0.79, I 2 = 82.4%) for the D:A:D (2010) 10-year risk version, 0.77 (95% CI: 0.74-0.80, I 2 = 82.2%) for the full D:A:D (2016) model, 0.74 (95% CI: 0.68-0.79, I 2 = 86.2%) for the reduced D:A:D (2016) model, 0.71 (95% CI: 0.61-0.79, I 2 = 87.9%) for the Framingham Risk Score (FRS) for coronary heart disease (CHD) (1998), 0.74 (95% CI: 0.70-0.78, I 2 = 87.8%) for the FRS CVD model (2008), 0.72 (95% CI: 0.67-0.76, I 2 = 75.0%) for the pooled cohort equations of the American Heart Society/ American score (PCE), and 0.67 (95% CI: 0.56-0.77, I 2 = 51.3%) for the Systematic COronary Risk Evaluation (SCORE). In the subgroup analysis, the discrimination of PCE was significantly better in the group aged ≤40 years than in the group aged 40-45 years (P = 0.024) and the group aged ≥45 years (P = 0.010). No models were developed or validated in Sub-Saharan Africa and the Asia region. Conclusions The full D:A:D (2016) model performed the best in terms of discrimination, followed by the D:A:D (2010) and PCE. However, there were no significant differences between any of the model pairings. Specific CVD risk models for older PLWH and for PLWH in Sub-Saharan Africa and the Asia region should be established.Systematic Review Registration: PROSPERO CRD42022322024.
Collapse
Affiliation(s)
- Junwen Yu
- School of Nursing, Fudan University, Shanghai, China
| | - Xiaoning Liu
- Department of Infectious Diseases, National Clinical Research Center for Infectious Diseases, Shenzhen Third People's Hospital, Guangdong, China
- National Heart & Lung Institute, Faculty of Medicine, Imperial College London, London, United Kingdom
| | - Zheng Zhu
- School of Nursing, Fudan University, Shanghai, China
- Fudan University Centre for Evidence-Based Nursing: A Joanna Briggs Institute Centre of Excellence, Shanghai, China
- NYU Rory Meyers College of Nursing, New York University, New York City, NY, United States
- Correspondence: Zheng Zhu Hongzhou Lu
| | - Zhongfang Yang
- School of Nursing, Fudan University, Shanghai, China
- Fudan University Centre for Evidence-Based Nursing: A Joanna Briggs Institute Centre of Excellence, Shanghai, China
- Shanghai Institute of Infectious Disease and Biosecurity, Fudan University, Shanghai, China
| | - Jiamin He
- School of Nursing, Fudan University, Shanghai, China
| | - Lin Zhang
- Shanghai Public Health Clinical Center, Fudan University, Shanghai, China
| | - Hongzhou Lu
- Department of Infectious Diseases, National Clinical Research Center for Infectious Diseases, Shenzhen Third People's Hospital, Guangdong, China
- Correspondence: Zheng Zhu Hongzhou Lu
| |
Collapse
|
9
|
Du M, Haag DG, Lynch JW, Mittinty MN. Application of multilevel models for predicting pain following root canal treatment. Community Dent Oral Epidemiol 2022; 51:418-427. [PMID: 36510289 DOI: 10.1111/cdoe.12807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2021] [Revised: 09/17/2022] [Accepted: 10/17/2022] [Indexed: 12/14/2022]
Abstract
OBJECTIVES This study developed predictive models for one-week acute and six-month persistent pain following root canal treatment (RCT). An additional aim was to study the gain in predictive efficacy of models containing clinical factors only, over models containing sociodemographic characteristics. METHODS A secondary data analysis of 708 patients who received RCTs was conducted. Three sets of predictors were used: (1) combined set, containing all predictors in the data set; (2) clinical set and (3) sociodemographic set. Missing data were handled by multiple imputation using the missing indicator method. The multilevel least absolute selection and shrinkage operator (LASSO) regression was used to select predictors into the final multilevel logistic models. Three measures, the area under the receiver operating characteristic curve (AUROC) and precision-recall curve (AUPRC) and calibration curves, were used to assess the predictive performance of the models. RESULTS The selected-in factors in the final models, using LASSO regression, are related to pre- and intra-treatment clinical symptoms and pain experience. Predictive performance of the models remained the same with the inclusion (exclusion) of the socio-demographic factors. For predicting one-week outcome, the model built with combined set of predictors yielded the highest AUROC and AUPRC of 0.85 and 0.72, followed by the models built with clinical factors (AUROC = 0.82, AUPRC = 0.66). The lowest predictive ability was found in models with only sociodemographic characteristics (AUROC = 0.68, AUPRC = 0.40). Similar patterns were observed in predicting six-month outcome, where the AUROC for models with combined, clinical and sociodemographic sets of predictors were 0.85, 0.89 and 0.66, respectively, and the AUPRC were 0.48, 0.53 and 0.22, respectively. CONCLUSIONS Clinical factors such as the severity and experience of pre-operative and intra-operative pain were discovered important to the subsequent development of pain following RCTs. Adding sociodemographic characteristics to the models with clinical factors did not change the models' predictive performance or the proportion of explained variance.
Collapse
Affiliation(s)
- Mi Du
- Department of Implantology, School and Hospital of Stomatology, Cheeloo College of Medicine, Shandong Key Laboratory of Oral Tissue Regeneration & Shandong Engineering Laboratory for Dental Materials and Oral Tissue Regeneration, Shandong Provincial Clinical Research Center for Oral Diseases, Shandong University, Jinan, China.,School of Public Health, The University of Adelaide, Adelaide, South Australia, Australia.,Robinson Research Institute, The University of Adelaide, Adelaide, South Australia, Australia
| | - Dandara Gabriela Haag
- School of Public Health, The University of Adelaide, Adelaide, South Australia, Australia.,Robinson Research Institute, The University of Adelaide, Adelaide, South Australia, Australia.,Australian Research Centre for Population Oral Health, Adelaide Dental School, The University of Adelaide, Adelaide, South Australia, Australia
| | - John W Lynch
- School of Public Health, The University of Adelaide, Adelaide, South Australia, Australia.,Robinson Research Institute, The University of Adelaide, Adelaide, South Australia, Australia.,Population Health Sciences, University of Bristol, Bristol, UK
| | - Murthy N Mittinty
- School of Public Health, The University of Adelaide, Adelaide, South Australia, Australia.,Robinson Research Institute, The University of Adelaide, Adelaide, South Australia, Australia
| |
Collapse
|
10
|
Shen WC, Chiang HY, Chen PS, Lin YT, Kuo CC, Wu PY. Risk of All-Cause Mortality, Cardiovascular Disease Mortality, and Cancer Mortality in Patients With Bullous Pemphigoid. JAMA Dermatol 2022; 158:167-175. [PMID: 34964804 PMCID: PMC8717210 DOI: 10.1001/jamadermatol.2021.5125] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2021] [Accepted: 10/18/2021] [Indexed: 12/31/2022]
Abstract
IMPORTANCE The role of bullous pemphigoid (BP) in cardiovascular disease (CVD) mortality remains controversial, and analyses of causes of death among patients with BP based on individual data remain lacking. OBJECTIVE To evaluate the risk of all-cause mortality, CVD mortality, and cancer mortality in patients with BP. DESIGN, SETTING, AND PARTICIPANTS This cohort study identified patients who received a diagnosis of and treatment for BP during their dermatology clinic visits at a tertiary medical center in central Taiwan between January 1, 2007, and December 31, 2017. Controls were patients without BP and were individually matched to cases (4:1) according to age, sex, and date of the dermatology clinic visit. Data were analyzed from March 6, 2019, to April 2, 2021. EXPOSURES Bullous pemphigoid was confirmed pathologically with typical direct immunofluorescence findings or clinically with typical clinical presentation, positive findings of an anti-basement membrane zone antibody test, and corticosteroid use for at least 28 cumulative days. MAIN OUTCOMES AND MEASURES Mortality outcomes confirmed by the National Death Registry. RESULTS Of 252 patients with BP and 1008 matched control patients (N = 1260), 685 (54.4%) were men and the median age was 78.0 (IQR, 70.3-84.8) years. Patients with BP had higher CVD mortality at 1 year (20 [7.9%] vs 13 [1.3%]), 3 years (28 [11.1%] vs 24 [2.4%]), and 5 years (31 [12.3%] vs 39 [3.9%]) compared with matched control patients. After adjusting for potential confounding variables, patients with BP had a 5-fold higher risk of CVD mortality at 1 year (hazard ratio [HR], 5.29 [95% CI, 2.40-11.68]), 3 years (HR, 5.79 [95% CI, 3.11-10.78]), and 5 years (HR, 4.95 [95% CI, 2.88-8.51]). Subgroup analyses revealed that the CVD mortality risk associated with BP was higher in patients without a history of hypertension (HR, 7.28 [95% CI, 3.87-13.69]) or CVD (HR, 6.59 [95% CI, 3.40-12.79]) and in patients without prior diuretic use (HR, 5.75 [95% CI, 3.15-10.50]) compared with matched control patients. In addition, all-cause mortality associated with BP was higher in patients without prior corticosteroid use than in control patients (HR 5.65 [95% CI, 4.19-7.61]). CONCLUSIONS AND RELEVANCE The findings of this cohort study suggest that BP was associated with a 5-fold higher risk of CVD mortality, particularly in patients without underlying hypertension or CVD or those without prior corticosteroid or diuretic use. Future studies should investigate the benefits of routine monitoring and timely management of CVD symptoms and signs in patients with BP.
Collapse
Affiliation(s)
- Wan-Chieh Shen
- Department of Dermatology, China Medical University Hospital, Taichung City, Taiwan
| | - Hsiu-Yin Chiang
- Big Data Center, China Medical University Hospital, Taichung City, Taiwan
| | - Pei-Shan Chen
- Big Data Center, China Medical University Hospital, Taichung City, Taiwan
| | - Yu-Ting Lin
- Big Data Center, China Medical University Hospital, Taichung City, Taiwan
| | - Chin-Chi Kuo
- Big Data Center, China Medical University Hospital, Taichung City, Taiwan
- Department of Medical Research, China Medical University Hospital, Taichung City, Taiwan
- Department of Nephrology, Department of Internal Medicine, China Medical University Hospital, Taichung City, Taiwan
- School of Medicine, China Medical University, Taichung City, Taiwan
| | - Po-Yuan Wu
- Department of Dermatology, China Medical University Hospital, Taichung City, Taiwan
- School of Medicine, China Medical University, Taichung City, Taiwan
| |
Collapse
|
11
|
Berkelmans G, Read S, Gudbjörnsdottir S, Wild S, Franzen S, van der Graaf Y, Eliasson B, Visseren F, Paynter N, Dorresteijn J. Population median imputation was noninferior to complex approaches for imputing missing values in cardiovascular prediction models in clinical practice. J Clin Epidemiol 2022; 145:70-80. [DOI: 10.1016/j.jclinepi.2022.01.011] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2021] [Revised: 12/05/2021] [Accepted: 01/17/2022] [Indexed: 02/06/2023]
|
12
|
Page GL, Quintana FA, Müller P. Clustering and Prediction With Variable Dimension Covariates. J Comput Graph Stat 2021. [DOI: 10.1080/10618600.2021.1999824] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Garritt L. Page
- Department of Statistics, Brigham Young University, Provo, UT
- BCAM—Basque Center for Applied Mathematics, Bilbao, Spain
| | - Fernando A. Quintana
- Departamento de Estadística, Pontificia Universidad Católica de Chile, Santiago, Chile
- Millennium Nucleus Center for the Discovery of Structures in Complex Data, Santiago, Chile
| | - Peter Müller
- Department of Mathematics, The University of Texas at Austin, TX
| |
Collapse
|
13
|
Nijman S, Leeuwenberg AM, Beekers I, Verkouter I, Jacobs J, Bots ML, Asselbergs FW, Moons K, Debray T. Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review. J Clin Epidemiol 2021; 142:218-229. [PMID: 34798287 DOI: 10.1016/j.jclinepi.2021.11.023] [Citation(s) in RCA: 29] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2021] [Revised: 11/01/2021] [Accepted: 11/10/2021] [Indexed: 12/23/2022]
Abstract
OBJECTIVES Missing data is a common problem during the development, evaluation, and implementation of prediction models. Although machine learning (ML) methods are often said to be capable of circumventing missing data, it is unclear how these methods are used in medical research. We aim to find out if and how well prediction model studies using machine learning report on their handling of missing data. STUDY DESIGN AND SETTING We systematically searched the literature on published papers between 2018 and 2019 about primary studies developing and/or validating clinical prediction models using any supervised ML methodology across medical fields. From the retrieved studies information about the amount and nature (e.g. missing completely at random, potential reasons for missingness) of missing data and the way they were handled were extracted. RESULTS We identified 152 machine learning-based clinical prediction model studies. A substantial amount of these 152 papers did not report anything on missing data (n = 56/152). A majority (n = 96/152) reported details on the handling of missing data (e.g., methods used), though many of these (n = 46/96) did not report the amount of the missingness in the data. In these 96 papers the authors only sometimes reported possible reasons for missingness (n = 7/96) and information about missing data mechanisms (n = 8/96). The most common approach for handling missing data was deletion (n = 65/96), mostly via complete-case analysis (CCA) (n = 43/96). Very few studies used multiple imputation (n = 8/96) or built-in mechanisms such as surrogate splits (n = 7/96) that directly address missing data during the development, validation, or implementation of the prediction model. CONCLUSION Though missing values are highly common in any type of medical research and certainly in the research based on routine healthcare data, a majority of the prediction model studies using machine learning does not report sufficient information on the presence and handling of missing data. Strategies in which patient data are simply omitted are unfortunately the most often used methods, even though it is generally advised against and well known that it likely causes bias and loss of analytical power in prediction model development and in the predictive accuracy estimates. Prediction model researchers should be much more aware of alternative methodologies to address missing data.
Collapse
Affiliation(s)
- Swj Nijman
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Heidelberglaan 100, Utrecht, 3584 CX , The Netherlands.
| | - A M Leeuwenberg
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Heidelberglaan 100, Utrecht, 3584 CX , The Netherlands
| | - I Beekers
- Department of Health, Ortec B.V. Zoetermeer, The Netherlands
| | - I Verkouter
- Department of Health, Ortec B.V. Zoetermeer, The Netherlands
| | - Jjl Jacobs
- Department of Health, Ortec B.V. Zoetermeer, The Netherlands
| | - M L Bots
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Heidelberglaan 100, Utrecht, 3584 CX , The Netherlands
| | - F W Asselbergs
- Department of Cardiology, University Medical Center Utrecht, Utrecht University, The Netherlands; Institute of Cardiovascular Science, Population Health Sciences, University College London, London, UK; Health Data Research UK, Institute of Health Informatics, University College London, London, UK
| | - Kgm Moons
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Heidelberglaan 100, Utrecht, 3584 CX , The Netherlands
| | - Tpa Debray
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Heidelberglaan 100, Utrecht, 3584 CX , The Netherlands; Health Data Research UK, Institute of Health Informatics, University College London, London, UK
| |
Collapse
|
14
|
van der Plas-Krijgsman WG, Giardiello D, Putter H, Steyerberg EW, Bastiaannet E, Stiggelbout AM, Mooijaart SP, Kroep JR, Portielje JEA, Liefers GJ, de Glas NA. Development and validation of the PORTRET tool to predict recurrence, overall survival, and other-cause mortality in older patients with breast cancer in the Netherlands: a population-based study. THE LANCET. HEALTHY LONGEVITY 2021; 2:e704-e711. [PMID: 36098027 DOI: 10.1016/s2666-7568(21)00229-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2021] [Revised: 09/03/2021] [Accepted: 09/06/2021] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Current prediction tools for breast cancer outcomes are not tailored to the older patient, in whom competing risk strongly influences treatment effects. We aimed to develop and validate a prediction tool for 5-year recurrence, overall mortality, and other-cause mortality for older patients (aged ≥65 years) with early invasive breast cancer and to estimate individualised expected benefits of adjuvant systemic treatment. METHODS We selected surgically treated patients with early invasive breast cancer (stage I-III) aged 65 years or older from the population-based FOCUS cohort in the Netherlands. We developed prediction models for 5-year recurrence, overall mortality, and other-cause mortality using cause-specific Cox proportional hazard models. External validation was performed in a Dutch Cancer registry cohort. Performance was evaluated with discrimination accuracy and calibration plots. FINDINGS We included 2744 female patients in the development cohort and 13631 female patients in the validation cohort. Median age was 74·8 years (range 65-98) in the development cohort and 76·0 years (70-101) in the validation cohort. 5-year follow-up was complete for more than 99% of all patients. We observed 343 and 1462 recurrences, and 831 and 3594 deaths, of which 586 and 2565 were without recurrence, in the development and validation cohort, respectively. The area under the receiver-operating-characteristic curve at 5 years in the external dataset was 0·76 (95% CI 0·75-0·76) for overall mortality, 0·76 (0·76-0·77) for recurrence, and 0·75 (0·74-0·75) for other-cause mortality. INTERPRETATION The PORTRET tool can accurately predict 5-year recurrence, overall mortality, and other-cause mortality in older patients with breast cancer. The tool can support shared decision making, especially since it provides individualised estimated benefits of adjuvant treatment. FUNDING Dutch Cancer Foundation and ZonMw.
Collapse
Affiliation(s)
| | - Daniele Giardiello
- Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, Netherlands; Division of Molecular Pathology, The Netherlands Cancer Institute-Antoni van Leeuwenhoek Hospital, Amsterdam, Netherlands; Eurac Research, Institute for Biomedicine, Bolzano, Italy
| | - Hein Putter
- Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, Netherlands
| | - Ewout W Steyerberg
- Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, Netherlands; Department of Public Health, Erasmus MC, Rotterdam, Netherlands
| | - Esther Bastiaannet
- Department of Medical Oncology, Leiden University Medical Center, Leiden, Netherlands; Department of Surgery, Leiden University Medical Center, Leiden, Netherlands
| | - Anne M Stiggelbout
- Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, Netherlands
| | - Simon P Mooijaart
- Department of Gerontology and Geriatrics, Leiden University Medical Center, Leiden, Netherlands
| | - Judith R Kroep
- Department of Medical Oncology, Leiden University Medical Center, Leiden, Netherlands
| | | | - Gerrit-Jan Liefers
- Department of Surgery, Leiden University Medical Center, Leiden, Netherlands.
| | - Nienke A de Glas
- Department of Medical Oncology, Leiden University Medical Center, Leiden, Netherlands
| |
Collapse
|
15
|
Beesley LJ, Bondarenko I, Elliot MR, Kurian AW, Katz SJ, Taylor JM. Multiple imputation with missing data indicators. Stat Methods Med Res 2021; 30:2685-2700. [PMID: 34643465 DOI: 10.1177/09622802211047346] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Multiple imputation is a well-established general technique for analyzing data with missing values. A convenient way to implement multiple imputation is sequential regression multiple imputation, also called chained equations multiple imputation. In this approach, we impute missing values using regression models for each variable, conditional on the other variables in the data. This approach, however, assumes that the missingness mechanism is missing at random, and it is not well-justified under not-at-random missingness without additional modification. In this paper, we describe how we can generalize the sequential regression multiple imputation imputation procedure to handle missingness not at random in the setting where missingness may depend on other variables that are also missing but not on the missing variable itself, conditioning on fully observed variables. We provide algebraic justification for several generalizations of standard sequential regression multiple imputation using Taylor series and other approximations of the target imputation distribution under missingness not at random. Resulting regression model approximations include indicators for missingness, interactions, or other functions of the missingness not at random missingness model and observed data. In a simulation study, we demonstrate that the proposed sequential regression multiple imputation modifications result in reduced bias in the final analysis compared to standard sequential regression multiple imputation, with an approximation strategy involving inclusion of an offset in the imputation model performing the best overall. The method is illustrated in a breast cancer study, where the goal is to estimate the prevalence of a specific genetic pathogenic variant.
Collapse
Affiliation(s)
| | | | - Michael R Elliot
- Department of Biostatistics, 1259University of Michigan, USA.,Survey Methodology Program, Institute for Social Research, USA
| | - Allison W Kurian
- Departments of Medicine and Epidemiology and Population Health, 6429Stanford University, USA
| | - Steven J Katz
- Department of Internal Medicine, 1259University of Michigan, USA
| | | |
Collapse
|
16
|
Tsvetanova A, Sperrin M, Peek N, Buchan I, Hyland S, Martin GP. Missing data was handled inconsistently in UK prediction models: a review of method used. J Clin Epidemiol 2021; 140:149-158. [PMID: 34520847 DOI: 10.1016/j.jclinepi.2021.09.008] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Revised: 08/17/2021] [Accepted: 09/07/2021] [Indexed: 10/20/2022]
Abstract
OBJECTIVES No clear guidance exists on handling missing data at each stage of developing, validating and implementing a clinical prediction model (CPM). We aimed to review the approaches to handling missing data that underly the CPMs currently recommended for use in UK healthcare. STUDY DESIGN AND SETTING A descriptive cross-sectional meta-epidemiological study aiming to identify CPMs recommended by the National Institute for Health and Care Excellence (NICE), which summarized how missing data is handled across their pipelines. RESULTS A total of 23 CPMs were included through "sampling strategy." Six missing data strategies were identified: complete case analysis (CCA), multiple imputation, imputation of mean values, k-nearest neighbours imputation, using an additional category for missingness, considering missing values as risk-factor-absent. 52% of the development articles and 48% of the validation articles did not report how missing data were handled. CCA was the most common approach used for development (40%) and validation (44%). At implementation, 57% of the CPMs required complete data entry, whilst 43% allowed missing values. Three CPMs had consistent paths in their pipelines. CONCLUSION A broad variety of methods for handling missing data underly the CPMs currently recommended for use in UK healthcare. Missing data handling strategies were generally inconsistent. Better quality assurance of CPMs needs greater clarity and consistency in handling of missing data.
Collapse
Affiliation(s)
- Antonia Tsvetanova
- Centre for Health Informatics, Division of Informatics, Imaging and Data Science, Faculty of Biology, Medicine and Health, University of Manchester, Manchester Academic Health Science Centre, Manchester, UK.
| | - Matthew Sperrin
- Centre for Health Informatics, Division of Informatics, Imaging and Data Science, Faculty of Biology, Medicine and Health, University of Manchester, Manchester Academic Health Science Centre, Manchester, UK
| | - Niels Peek
- Centre for Health Informatics, Division of Informatics, Imaging and Data Science, Faculty of Biology, Medicine and Health, University of Manchester, Manchester Academic Health Science Centre, Manchester, UK; NIHR Manchester Biomedical Research Centre, Faculty of Biology, Medicine and Health, University of Manchester, Manchester Academic Health Science Centre, Manchester, UK
| | - Iain Buchan
- Centre for Health Informatics, Division of Informatics, Imaging and Data Science, Faculty of Biology, Medicine and Health, University of Manchester, Manchester Academic Health Science Centre, Manchester, UK; Institute of Population Health, The University of Liverpool, Liverpool, UK
| | | | - Glen P Martin
- Centre for Health Informatics, Division of Informatics, Imaging and Data Science, Faculty of Biology, Medicine and Health, University of Manchester, Manchester Academic Health Science Centre, Manchester, UK
| |
Collapse
|
17
|
A simple and efficient incremental missing data imputation method for evolving neo-fuzzy network. EVOLVING SYSTEMS 2021. [DOI: 10.1007/s12530-021-09376-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
18
|
Gravesteijn BY, Sewalt CA, Venema E, Nieboer D, Steyerberg EW. Missing Data in Prediction Research: A Five-Step Approach for Multiple Imputation, Illustrated in the CENTER-TBI Study. J Neurotrauma 2021; 38:1842-1857. [PMID: 33470157 DOI: 10.1089/neu.2020.7218] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
In medical research, missing data is common. In acute diseases, such as traumatic brain injury (TBI), even well-conducted prospective studies may suffer from missing data in baseline characteristics and outcomes. Statistical models may simply drop patients with any missing values, potentially leaving a selected subset of the original cohort. Imputation is widely accepted by methodologists as an appropriate way to deal with missing data. We aim to provide practical guidance on handling missing data for prediction modeling. We hereto propose a five-step approach, centered around single and multiple imputation: 1) explore the missing data patterns; 2) choose a method of imputation; 3) perform imputation; 4) assess diagnostics of the imputation; and 5) analyze the imputed data sets. We illustrate these five steps with the estimation and validation of the IMPACT (International Mission on Prognosis and Analysis of Clinical Trials in Traumatic Brain Injury) prognostic model in 1375 patients from the CENTER-TBI database, included in 53 centers across 17 countries, with moderate or severe TBI in the prospective European CENTER-TBI study. Future prediction modeling studies in acute diseases may benefit from following the suggested five steps for optimal statistical analysis and interpretation, after maximal effort has been made to minimize missing data.
Collapse
Affiliation(s)
| | | | - Esmee Venema
- Department of Public Health, Erasmus Medical Center, Rotterdam, The Netherlands
| | - Daan Nieboer
- Department of Public Health, Erasmus Medical Center, Rotterdam, The Netherlands
| | - Ewout W Steyerberg
- Department of Public Health, Erasmus Medical Center, Rotterdam, The Netherlands.,Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands
| | | |
Collapse
|
19
|
Sisk R, Lin L, Sperrin M, Barrett JK, Tom B, Diaz-Ordaz K, Peek N, Martin GP. Informative presence and observation in routine health data: A review of methodology for clinical risk prediction. J Am Med Inform Assoc 2021; 28:155-166. [PMID: 33164082 PMCID: PMC7810439 DOI: 10.1093/jamia/ocaa242] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2020] [Accepted: 09/17/2020] [Indexed: 12/20/2022] Open
Abstract
Objective Informative presence (IP) is the phenomenon whereby the presence or absence of patient data is potentially informative with respect to their health condition, with informative observation (IO) being the longitudinal equivalent. These phenomena predominantly exist within routinely collected healthcare data, in which data collection is driven by the clinical requirements of patients and clinicians. The extent to which IP and IO are considered when using such data to develop clinical prediction models (CPMs) is unknown, as is the existing methodology aiming at handling these issues. This review aims to synthesize such existing methodology, thereby helping identify an agenda for future methodological work. Materials and Methods A systematic literature search was conducted by 2 independent reviewers using prespecified keywords. Results Thirty-six articles were included. We categorized the methods presented within as derived predictors (including some representation of the measurement process as a predictor in the model), modeling under IP, and latent structures. Including missing indicators or summary measures as predictors is the most commonly presented approach amongst the included studies (24 of 36 articles). Discussion This is the first review to collate the literature in this area under a prediction framework. A considerable body relevant of literature exists, and we present ways in which the described methods could be developed further. Guidance is required for specifying the conditions under which each method should be used to enable applied prediction modelers to use these methods. Conclusions A growing recognition of IP and IO exists within the literature, and methodology is increasingly becoming available to leverage these phenomena for prediction purposes. IP and IO should be approached differently in a prediction context than when the primary goal is explanation. The work included in this review has demonstrated theoretical and empirical benefits of incorporating IP and IO, and therefore we recommend that applied health researchers consider incorporating these methods in their work.
Collapse
Affiliation(s)
- Rose Sisk
- Division of Informatics, Imaging and Data Sciences, School of Health Sciences, University of Manchester, Manchester, United Kingdom
| | - Lijing Lin
- Division of Informatics, Imaging and Data Sciences, School of Health Sciences, University of Manchester, Manchester, United Kingdom
| | - Matthew Sperrin
- Division of Informatics, Imaging and Data Sciences, School of Health Sciences, University of Manchester, Manchester, United Kingdom
| | - Jessica K Barrett
- MRC Biostatistics Unit, University of Cambridge, Cambridge, United Kingdom.,Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, United Kingdom
| | - Brian Tom
- MRC Biostatistics Unit, University of Cambridge, Cambridge, United Kingdom
| | - Karla Diaz-Ordaz
- Department of Medical Statistics, London School of Hygiene and Tropical Medicine, London, United Kingdom
| | - Niels Peek
- Division of Informatics, Imaging and Data Sciences, School of Health Sciences, University of Manchester, Manchester, United Kingdom.,NIHR Biomedical Research Centre, Manchester Academic Health Science Centre, University of Manchester, Manchester, United Kingdom.,Alan Turing Institute, University of Manchester, London, United Kingdom
| | - Glen P Martin
- Division of Informatics, Imaging and Data Sciences, School of Health Sciences, University of Manchester, Manchester, United Kingdom
| |
Collapse
|
20
|
Hoogland J, van Barreveld M, Debray TPA, Reitsma JB, Verstraelen TE, Dijkgraaf MGW, Zwinderman AH. Handling missing predictor values when validating and applying a prediction model to new patients. Stat Med 2020; 39:3591-3607. [PMID: 32687233 PMCID: PMC7586995 DOI: 10.1002/sim.8682] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2019] [Revised: 05/10/2020] [Accepted: 06/10/2020] [Indexed: 12/23/2022]
Abstract
Missing data present challenges for development and real‐world application of clinical prediction models. While these challenges have received considerable attention in the development setting, there is only sparse research on the handling of missing data in applied settings. The main unique feature of handling missing data in these settings is that missing data methods have to be performed for a single new individual, precluding direct application of mainstay methods used during model development. Correspondingly, we propose that it is desirable to perform model validation using missing data methods that transfer to practice in single new patients. This article compares existing and new methods to account for missing data for a new individual in the context of prediction. These methods are based on (i) submodels based on observed data only, (ii) marginalization over the missing variables, or (iii) imputation based on fully conditional specification (also known as chained equations). They were compared in an internal validation setting to highlight the use of missing data methods that transfer to practice while validating a model. As a reference, they were compared to the use of multiple imputation by chained equations in a set of test patients, because this has been used in validation studies in the past. The methods were evaluated in a simulation study where performance was measured by means of optimism corrected C‐statistic and mean squared prediction error. Furthermore, they were applied in data from a large Dutch cohort of prophylactic implantable cardioverter defibrillator patients.
Collapse
Affiliation(s)
- Jeroen Hoogland
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
| | - Marit van Barreveld
- Department of Clinical Epidemiology, Biostatistics, & Bioinformatics, Academic Medical Center, Amsterdam University Medical Centers, Amsterdam, The Netherlands.,Heart Center, Department of Cardiology, Amsterdam University Medical Centers, University of Amsterdam, Amsterdam, The Netherlands
| | - Thomas P A Debray
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands.,Cochrane Netherlands, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
| | - Johannes B Reitsma
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands.,Cochrane Netherlands, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
| | - Tom E Verstraelen
- Heart Center, Department of Cardiology, Amsterdam University Medical Centers, University of Amsterdam, Amsterdam, The Netherlands
| | - Marcel G W Dijkgraaf
- Department of Clinical Epidemiology, Biostatistics, & Bioinformatics, Academic Medical Center, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| | - Aeilko H Zwinderman
- Department of Clinical Epidemiology, Biostatistics, & Bioinformatics, Academic Medical Center, Amsterdam University Medical Centers, Amsterdam, The Netherlands
| |
Collapse
|
21
|
Sperrin M, Martin GP. Multiple imputation with missing indicators as proxies for unmeasured variables: simulation study. BMC Med Res Methodol 2020; 20:185. [PMID: 32640992 PMCID: PMC7346454 DOI: 10.1186/s12874-020-01068-x] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Accepted: 06/28/2020] [Indexed: 01/09/2023] Open
Abstract
Background Within routinely collected health data, missing data for an individual might provide useful information in itself. This occurs, for example, in the case of electronic health records, where the presence or absence of data is informative. While the naive use of missing indicators to try to exploit such information can introduce bias, its use in conjunction with multiple imputation may unlock the potential value of missingness to reduce bias in causal effect estimation, particularly in missing not at random scenarios and where missingness might be associated with unmeasured confounders. Methods We conducted a simulation study to determine when the use of a missing indicator, combined with multiple imputation, would reduce bias for causal effect estimation, under a range of scenarios including unmeasured variables, missing not at random, and missing at random mechanisms. We use directed acyclic graphs and structural models to elucidate a variety of causal structures of interest. We handled missing data using complete case analysis, and multiple imputation with and without missing indicator terms. Results We find that multiple imputation combined with a missing indicator gives minimal bias for causal effect estimation in most scenarios. In particular the approach: 1) does not introduce bias in missing (completely) at random scenarios; 2) reduces bias in missing not at random scenarios where the missing mechanism depends on the missing variable itself; and 3) may reduce or increase bias when unmeasured confounding is present. Conclusion In the presence of missing data, careful use of missing indicators, combined with multiple imputation, can improve causal effect estimation when missingness is informative, and is not detrimental when missingness is at random.
Collapse
Affiliation(s)
- Matthew Sperrin
- Faculty of Biology, Medicine and Health, Vaughan House, University of Manchester, Manchester, M13 9PL, UK.
| | - Glen P Martin
- Faculty of Biology, Medicine and Health, Vaughan House, University of Manchester, Manchester, M13 9PL, UK
| |
Collapse
|
22
|
Sperrin M, Martin GP, Sisk R, Peek N. Missing data should be handled differently for prediction than for description or causal explanation. J Clin Epidemiol 2020; 125:183-187. [PMID: 32540389 DOI: 10.1016/j.jclinepi.2020.03.028] [Citation(s) in RCA: 42] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2019] [Revised: 03/10/2020] [Accepted: 03/18/2020] [Indexed: 12/26/2022]
Abstract
Missing data are much studied in epidemiology and statistics. Theoretical development and application of methods for handling missing data have mostly been conducted in the context of prospective research data and with a goal of description or causal explanation. However, it is now common to build predictive models using routinely collected data, where missing patterns may convey important information, and one might take a pragmatic approach to optimizing prediction. Therefore, different methods to handle missing data may be preferred. Furthermore, an underappreciated issue in prediction modeling is that the missing data method used in model development may not match the method used when a model is deployed. This may lead to overoptimistic assessments of model performance. For prediction, particularly with routinely collected data, methods for handling missing data that incorporate information within the missingness pattern should be explored and further developed. Where missing data methods differ between model development and model deployment, the implications of this must be explicitly evaluated. The trade-off between building a prediction model that is causally principled, and building a prediction model that maximizes the use of all available information, should be carefully considered and will depend on the intended use of the model.
Collapse
Affiliation(s)
- Matthew Sperrin
- Faculty of Biology, Medicine and Health, University of Manchester, Manchester, UK.
| | - Glen P Martin
- Faculty of Biology, Medicine and Health, University of Manchester, Manchester, UK
| | - Rose Sisk
- Faculty of Biology, Medicine and Health, University of Manchester, Manchester, UK
| | - Niels Peek
- Faculty of Biology, Medicine and Health, University of Manchester, Manchester, UK
| |
Collapse
|
23
|
Kroncke BM, Smith DK, Zuo Y, Glazer AM, Roden DM, Blume JD. A Bayesian method to estimate variant-induced disease penetrance. PLoS Genet 2020; 16:e1008862. [PMID: 32569262 PMCID: PMC7347235 DOI: 10.1371/journal.pgen.1008862] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2019] [Revised: 07/09/2020] [Accepted: 05/14/2020] [Indexed: 01/09/2023] Open
Abstract
A major challenge emerging in genomic medicine is how to assess best disease risk from rare or novel variants found in disease-related genes. The expanding volume of data generated by very large phenotyping efforts coupled to DNA sequence data presents an opportunity to reinterpret genetic liability of disease risk. Here we propose a framework to estimate the probability of disease given the presence of a genetic variant conditioned on features of that variant. We refer to this as the penetrance, the fraction of all variant heterozygotes that will present with disease. We demonstrate this methodology using a well-established disease-gene pair, the cardiac sodium channel gene SCN5A and the heart arrhythmia Brugada syndrome. From a review of 756 publications, we developed a pattern mixture algorithm, based on a Bayesian Beta-Binomial model, to generate SCN5A penetrance probabilities for the Brugada syndrome conditioned on variant-specific attributes. These probabilities are determined from variant-specific features (e.g. function, structural context, and sequence conservation) and from observations of affected and unaffected heterozygotes. Variant functional perturbation and structural context prove most predictive of Brugada syndrome penetrance.
Collapse
Affiliation(s)
- Brett M. Kroncke
- Department of Medicine Vanderbilt University Medical Center, Nashville, Tennessee, United States of America
- Vanderbilt Center for Arrhythmia Research and Therapeutics, Vanderbilt University Medical Center, Nashville, Tennessee, United States of America
- Department of Pharmacology Vanderbilt University, Nashville, Tennessee, United States of America
| | - Derek K. Smith
- Department of Biostatistics Vanderbilt University, Nashville, Tennessee, United States of America
| | - Yi Zuo
- Department of Biostatistics Vanderbilt University, Nashville, Tennessee, United States of America
| | - Andrew M. Glazer
- Department of Medicine Vanderbilt University Medical Center, Nashville, Tennessee, United States of America
- Vanderbilt Center for Arrhythmia Research and Therapeutics, Vanderbilt University Medical Center, Nashville, Tennessee, United States of America
| | - Dan M. Roden
- Department of Medicine Vanderbilt University Medical Center, Nashville, Tennessee, United States of America
- Vanderbilt Center for Arrhythmia Research and Therapeutics, Vanderbilt University Medical Center, Nashville, Tennessee, United States of America
- Department of Pharmacology Vanderbilt University, Nashville, Tennessee, United States of America
- Department of Biomedical Informatics Vanderbilt University Medical Center, Nashville, Tennessee, United States of America
| | - Jeffrey D. Blume
- Department of Biostatistics Vanderbilt University, Nashville, Tennessee, United States of America
| |
Collapse
|
24
|
Su Y, Chen Y, Tian Z, Lu C, Chen L, Ma X. lncRNAs classifier to accurately predict the recurrence of thymic epithelial tumors. Thorac Cancer 2020; 11:1773-1783. [PMID: 32374079 PMCID: PMC7327696 DOI: 10.1111/1759-7714.13439] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2020] [Revised: 03/28/2020] [Accepted: 03/30/2020] [Indexed: 12/12/2022] Open
Abstract
Background Long non‐coding RNAs (lncRNAs), which have little or no ability to encode proteins, have attracted special attention due to their potential role in cancer disease. In this study we aimed to establish a lncRNAs classifier to improve the accuracy of recurrence prediction for thymic epithelial tumors (TETs). Methods TETs RNA sequencing (RNA‐seq) data set and the matched clinicopathologic information were downloaded from the Cancer Genome Atlas. Using univariate Cox regression and least absolute shrinkage and selection operator (LASSO) analysis, we developed a lncRNAs classifier related to recurrence. Functional analysis was conducted to investigate the potential biological processes of the lncRNAs target genes. The independent prognostic factors were identified by Cox regression model. Additionally, predictive ability and clinical application of the lncRNAs classifier were assessed, and compared with the Masaoka staging by receiver operating characteristic (ROC) analysis and decision curve analysis (DCA). Results Four recurrence‐free survival (RFS)‐related lncRNAs were identified, and the classifier consisting of the identified four lncRNAs was able to effectively divide the patients into high and low risk subgroups, with an area under curve (AUC) of 0.796 (three‐year RFS) and 0.788 (five‐year RFS), respectively. Multivariate analysis indicated that the lncRNAs classifier was an independent recurrence risk factor. The AUC of the lncRNAs classifier in predicting RFS was significantly higher than the Masaoka staging system. Decision curve analysis further demonstrated that the lncRNAs classifier had a larger net benefit than the Masaoka staging system. Conclusions A lncRNAs classifier for patients with TETs was an independent risk factor for RFS despite other clinicopathologic variables. It generated more accurate estimations of the recurrence probability when compared to the Masaoka staging system, but additional data is required before it can be used in clinical practice.
Collapse
Affiliation(s)
- Yongchao Su
- Department of Thoracic Surgery, Sanya Central Hospital, Sanya, China
| | - Yongbing Chen
- Department of Thoracic Surgery, The Second Affiliated Hospital of Soochow University, Suzhou, China
| | - Zuochun Tian
- Department of Thoracic Surgery, Sanya Central Hospital, Sanya, China
| | - Chuangang Lu
- Department of Thoracic Surgery, Sanya Central Hospital, Sanya, China
| | - Liang Chen
- Department of Respiratory Medicine, Sanya Central Hospital, Sanya, China
| | - Ximiao Ma
- Department of Thoracic Surgery, Haikou People's Hospital, Haikou, China
| |
Collapse
|
25
|
Mertens BJA, Banzato E, de Wreede LC. Construction and assessment of prediction rules for binary outcome in the presence of missing predictor data using multiple imputation and cross-validation: Methodological approach and data-based evaluation. Biom J 2020; 62:724-741. [PMID: 32052492 PMCID: PMC7217034 DOI: 10.1002/bimj.201800289] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2018] [Revised: 10/18/2019] [Accepted: 11/04/2019] [Indexed: 12/24/2022]
Abstract
We investigate calibration and assessment of predictive rules when missing values are present in the predictors. Our paper has two key objectives. The first is to investigate how the calibration of the prediction rule can be combined with use of multiple imputation to account for missing predictor observations. The second objective is to propose such methods that can be implemented with current multiple imputation software, while allowing for unbiased predictive assessment through validation on new observations for which outcome is not yet available. We commence with a review of the methodological foundations of multiple imputation as a model estimation approach as opposed to a purely algorithmic description. We specifically contrast application of multiple imputation for parameter (effect) estimation with predictive calibration. Based on this review, two approaches are formulated, of which the second utilizes application of the classical Rubin's rules for parameter estimation, while the first approach averages probabilities from models fitted on single imputations to directly approximate the predictive density for future observations. We present implementations using current software that allow for validation and estimation of performance measures by cross-validation, as well as imputation of missing data in predictors on the future data where outcome is missing by definition. To simplify, we restrict discussion to binary outcome and logistic regression throughout. Method performance is verified through application on two real data sets. Accuracy (Brier score) and variance of predicted probabilities are investigated. Results show substantial reductions in variation of calibrated probabilities when using the first approach.
Collapse
Affiliation(s)
- Bart J A Mertens
- Medical Statistics, Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands
| | - Erika Banzato
- Department of Statistical Sciences, University of Padova, Padova, Italy
| | - Liesbeth C de Wreede
- Medical Statistics, Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands
| |
Collapse
|
26
|
Groenwold RHH. Informative missingness in electronic health record systems: the curse of knowing. Diagn Progn Res 2020; 4:8. [PMID: 32699824 PMCID: PMC7371469 DOI: 10.1186/s41512-020-00077-0] [Citation(s) in RCA: 43] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Accepted: 04/22/2020] [Indexed: 12/17/2022] Open
Abstract
Electronic health records provide a potentially valuable data source of information for developing clinical prediction models. However, missing data are common in routinely collected health data and often missingness is informative. Informative missingness can be incorporated in a clinical prediction model, for example by including a separate category of a predictor variable that has missing values. The predictive performance of such a model depends on the transportability of the missing data mechanism, which may be compromised once the model is deployed in practice and the predictive value of certain variables becomes known. Using synthetic data, this phenomenon is explained and illustrated.
Collapse
Affiliation(s)
- Rolf H. H. Groenwold
- grid.10419.3d0000000089452978Department of Clinical Epidemiology, Leiden University Medical Centre, Leiden, the Netherlands
- grid.10419.3d0000000089452978Department of Biomedical Data Sciences, Leiden University Medical Centre, Leiden, the Netherlands
| |
Collapse
|
27
|
Mijderwijk HJ, Steyerberg EW, Steiger HJ, Fischer I, Kamp MA. Fundamentals of Clinical Prediction Modeling for the Neurosurgeon. Neurosurgery 2019; 85:302-311. [DOI: 10.1093/neuros/nyz282] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2018] [Accepted: 05/26/2019] [Indexed: 01/18/2023] Open
Abstract
AbstractClinical prediction models in neurosurgery are increasingly reported. These models aim to provide an evidence-based approach to the estimation of the probability of a neurosurgical outcome by combining 2 or more prognostic variables. Model development and model reporting are often suboptimal. A basic understanding of the methodology of clinical prediction modeling is needed when interpreting these models. We address basic statistical background, 7 modeling steps, and requirements of these models such that they may fulfill their potential for major impact for our daily clinical practice and for future scientific work.
Collapse
Affiliation(s)
- Hendrik-Jan Mijderwijk
- Department of Neurosurgery, Heinrich-Heine University Medical Center, Düsseldorf, Germany
| | - Ewout W Steyerberg
- Department of Public Health, Erasmus MC, Rotterdam, The Netherlands
- Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, The Netherlands
| | - Hans-Jakob Steiger
- Department of Neurosurgery, Heinrich-Heine University Medical Center, Düsseldorf, Germany
| | - Igor Fischer
- Division of Informatics and Data Science, Department of Neurosurgery, Heinrich-Heine University, Düsseldorf, Germany
| | - Marcel A Kamp
- Department of Neurosurgery, Heinrich-Heine University Medical Center, Düsseldorf, Germany
| |
Collapse
|