1
|
Abedi V, Misra D, Chaudhary D, Avula V, Schirmer CM, Li J, Zand R. Machine Learning-Based Prediction of Stroke in Emergency Departments. Ther Adv Neurol Disord 2024; 17:17562864241239108. [PMID: 38572394 PMCID: PMC10989051 DOI: 10.1177/17562864241239108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2023] [Accepted: 02/07/2024] [Indexed: 04/05/2024] Open
Abstract
Background Stroke misdiagnosis, associated with poor outcomes, is estimated to occur in 9% of all stroke patients. Objectives We hypothesized that machine learning (ML) could assist in the diagnosis of ischemic stroke in emergency departments (EDs). Design The study was conducted and reported according to the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis guidelines. We performed model development and prospective temporal validation, using data from pre- and post-COVID periods; we also performed a case study on a small cohort of previously misdiagnosed stroke patients. Methods We used structured and unstructured electronic health records (EHRs) of 56,452 patient encounters from 13 hospitals in Pennsylvania, from September 2003 to January 2021. ML pipelines, including natural language processing, were created using pre-event clinical data and provider notes in the EDs. Results Using pre-event information, our model's area under the receiver operating characteristics curve (AUROC) ranged from 0.88 to 0.92 with a similar range accuracy (0.87-0.90). Using provider notes, we identified five models that reached a balanced performance in terms of AUROC, sensitivity, and specificity. Model AUROC ranged from 0.93 to 0.99. Model sensitivity and specificity reached 0.90 and 0.99, respectively. Four of the top five performing models were based on the post-COVID provider notes; however, no performance difference between models tested on pre- and post-COVID was observed. Conclusion This study leveraged pre-event and at-encounter level EHR for stroke prediction. The results indicate that available clinical information can be used for building EHR-based stroke prediction models and ED stroke alert systems.
Collapse
Affiliation(s)
- Vida Abedi
- Department of Public Health Sciences, College of Medicine, The Pennsylvania State University, Hershey, PA, USA
- Department of Molecular and Functional Genomics, Geisinger Health System, Danville, PA, USA
| | - Debdipto Misra
- Division of Informatics, Geisinger Health System, Danville, PA, USA
| | - Durgesh Chaudhary
- Geisinger Neuroscience Institute, Geisinger Health System, Danville, PA, USA
- Department of Neurology, College of Medicine, The Pennsylvania State University, Hershey, PA, USA
| | - Venkatesh Avula
- Department of Molecular and Functional Genomics, Geisinger Health System, Danville, PA, USA
| | - Clemens M. Schirmer
- Geisinger Neuroscience Institute, Geisinger Health System, Danville, PA, USA
| | - Jiang Li
- Department of Molecular and Functional Genomics, Geisinger Health System, Danville, PA, USA
| | - Ramin Zand
- Department of Neurology, Pennsylvania State University, 30 Hope Drive, PO Box 859, Hershey, PA 17033-0859, USA
- Geisinger Neuroscience Institute, Geisinger Health System, Danville, PA, USA
| |
Collapse
|
2
|
Abedi V, Lambert C, Chaudhary D, Rieder E, Avula V, Hwang W, Li J, Zand R. Defining the Age of Young Ischemic Stroke Using Data-Driven Approaches. J Clin Med 2023; 12:jcm12072600. [PMID: 37048683 PMCID: PMC10095415 DOI: 10.3390/jcm12072600] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Revised: 03/15/2023] [Accepted: 03/21/2023] [Indexed: 04/03/2023] Open
Abstract
Introduction: The cut-point for defining the age of young ischemic stroke (IS) is clinically and epidemiologically important, yet it is arbitrary and differs across studies. In this study, we leveraged electronic health records (EHRs) and data science techniques to estimate an optimal cut-point for defining the age of young IS. Methods: Patient-level EHRs were extracted from 13 hospitals in Pennsylvania, and used in two parallel approaches. The first approach included ICD9/10, from IS patients to group comorbidities, and computed similarity scores between every patient pair. We determined the optimal age of young IS by analyzing the trend of patient similarity with respect to their clinical profile for different ages of index IS. The second approach used the IS cohort and control (without IS), and built three sets of machine-learning models—generalized linear regression (GLM), random forest (RF), and XGBoost (XGB)—to classify patients for seventeen age groups. After extracting feature importance from the models, we determined the optimal age of young IS by analyzing the pattern of comorbidity with respect to the age of index IS. Both approaches were completed separately for male and female patients. Results: The stroke cohort contained 7555 ISs, and the control included 31,067 patients. In the first approach, the optimal age of young stroke was 53.7 and 51.0 years in female and male patients, respectively. In the second approach, we created 102 models, based on three algorithms, 17 age brackets, and two sexes. The optimal age was 53 (GLM), 52 (RF), and 54 (XGB) for female, and 52 (GLM and RF) and 53 (RF) for male patients. Different age and sex groups exhibited different comorbidity patterns. Discussion: Using a data-driven approach, we determined the age of young stroke to be 54 years for women and 52 years for men in our mainly rural population, in central Pennsylvania. Future validation studies should include more diverse populations.
Collapse
Affiliation(s)
- Vida Abedi
- Department of Molecular and Functional Genomics, Weis Center for Research, Geisinger Health System, Danville, PA 17822, USA
- Department of Public Health Sciences, College of Medicine, The Pennsylvania State University, Hershey, PA 17033, USA
| | - Clare Lambert
- Department of Neurology, Yale New Haven Hospital, New Haven, CT 06510, USA
| | - Durgesh Chaudhary
- Geisinger Neuroscience Institute, Geisinger Health System, Danville, PA 17822, USA
- Department of Neurology, College of Medicine, The Pennsylvania State University, Hershey, PA 17033, USA
| | - Emily Rieder
- Geisinger Commonwealth, School of Medicine, Scranton, PA 18509, USA
| | - Venkatesh Avula
- Department of Molecular and Functional Genomics, Weis Center for Research, Geisinger Health System, Danville, PA 17822, USA
| | - Wenke Hwang
- Department of Public Health Sciences, College of Medicine, The Pennsylvania State University, Hershey, PA 17033, USA
| | - Jiang Li
- Department of Molecular and Functional Genomics, Weis Center for Research, Geisinger Health System, Danville, PA 17822, USA
| | - Ramin Zand
- Geisinger Neuroscience Institute, Geisinger Health System, Danville, PA 17822, USA
- Department of Neurology, College of Medicine, The Pennsylvania State University, Hershey, PA 17033, USA
- Correspondence: ; Tel.: +1-(717)-531-1804; Fax: +1-(717)-531-0384
| |
Collapse
|
3
|
Abedi V, Razavi SM, Khan A, Avula V, Tompe A, Poursoroush A, Vafaei Sadr A, Li J, Zand R. Artificial Intelligence: A Shifting Paradigm in Cardio-Cerebrovascular Medicine. J Clin Med 2021; 10:5710. [PMID: 34884412 DOI: 10.3390/jcm10235710] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2021] [Accepted: 12/02/2021] [Indexed: 12/21/2022] Open
Abstract
The future of healthcare is an organic blend of technology, innovation, and human connection. As artificial intelligence (AI) is gradually becoming a go-to technology in healthcare to improve efficiency and outcomes, we must understand our limitations. We should realize that our goal is not only to provide faster and more efficient care, but also to deliver an integrated solution to ensure that the care is fair and not biased to a group of sub-population. In this context, the field of cardio-cerebrovascular diseases, which encompasses a wide range of conditions-from heart failure to stroke-has made some advances to provide assistive tools to care providers. This article aimed to provide an overall thematic review of recent development focusing on various AI applications in cardio-cerebrovascular diseases to identify gaps and potential areas of improvement. If well designed, technological engines have the potential to improve healthcare access and equitability while reducing overall costs, diagnostic errors, and disparity in a system that affects patients and providers and strives for efficiency.
Collapse
|
4
|
Li J, Yan XS, Chaudhary D, Avula V, Mudiganti S, Husby H, Shahjouei S, Afshar A, Stewart WF, Yeasin M, Zand R, Abedi V. Imputation of missing values for electronic health record laboratory data. NPJ Digit Med 2021; 4:147. [PMID: 34635760 DOI: 10.1038/s41746-021-00518-0] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2021] [Accepted: 09/13/2021] [Indexed: 11/08/2022] Open
Abstract
Laboratory data from Electronic Health Records (EHR) are often used in prediction models where estimation bias and model performance from missingness can be mitigated using imputation methods. We demonstrate the utility of imputation in two real-world EHR-derived cohorts of ischemic stroke from Geisinger and of heart failure from Sutter Health to: (1) characterize the patterns of missingness in laboratory variables; (2) simulate two missing mechanisms, arbitrary and monotone; (3) compare cross-sectional and multi-level multivariate missing imputation algorithms applied to laboratory data; (4) assess whether incorporation of latent information, derived from comorbidity data, can improve the performance of the algorithms. The latter was based on a case study of hemoglobin A1c under a univariate missing imputation framework. Overall, the pattern of missingness in EHR laboratory variables was not at random and was highly associated with patients' comorbidity data; and the multi-level imputation algorithm showed smaller imputation error than the cross-sectional method.
Collapse
|
5
|
Abedi V, Avula V, Razavi SM, Bavishi S, Chaudhary D, Shahjouei S, Wang M, Griessenauer CJ, Li J, Zand R. Predicting short and long-term mortality after acute ischemic stroke using EHR. J Neurol Sci 2021; 427:117560. [PMID: 34218182 DOI: 10.1016/j.jns.2021.117560] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2021] [Revised: 06/21/2021] [Accepted: 06/25/2021] [Indexed: 12/14/2022]
Abstract
OBJECTIVE Despite improvements in treatment, stroke remains a leading cause of mortality and long-term disability. In this study, we leveraged administrative data to build predictive models of short- and long-term post-stroke all-cause-mortality. METHODS The study was conducted and reported according to the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) guideline. We used patient-level data from electronic health records, three algorithms, and six prediction windows to develop models for post-stroke mortality. RESULTS We included 7144 patients from which 5347 had survived their ischemic stroke after two years. The proportion of mortality was between 8%(605/7144) within 1-month, to 25%(1797/7144) for the 2-years window. The three most common comorbidities were hypertension, dyslipidemia, and diabetes. The best Area Under the ROC curve(AUROC) was reached with the Random Forest model at 0.82 for the 1-month prediction window. The negative predictive value (NPV) was highest for the shorter prediction windows - 0.91 for the 1-month - and the best positive predictive value (PPV) was reached for the 6-months prediction window at 0.92. Age, hemoglobin levels, and body mass index were the top associated factors. Laboratory variables had higher importance when compared to past medical history and comorbidities. Hypercoagulation state, smoking, and end-stage renal disease were more strongly associated with long-term mortality. CONCLUSION All the selected algorithms could be trained to predict the short and long-term mortality after stroke. The factors associated with mortality differed depending on the prediction window. Our classifier highlighted the importance of controlling risk factors, as indicated by laboratory measures.
Collapse
|
6
|
Darabi N, Hosseinichimeh N, Noto A, Zand R, Abedi V. Machine Learning-Enabled 30-Day Readmission Model for Stroke Patients. Front Neurol 2021; 12:638267. [PMID: 33868147 PMCID: PMC8044392 DOI: 10.3389/fneur.2021.638267] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2020] [Accepted: 03/08/2021] [Indexed: 11/18/2022] Open
Abstract
Background and Purpose: Hospital readmissions impose a substantial burden on the healthcare system. Reducing readmissions after stroke could lead to improved quality of care especially since stroke is associated with a high rate of readmission. The goal of this study is to enhance our understanding of the predictors of 30-day readmission after ischemic stroke and develop models to identify high-risk individuals for targeted interventions. Methods: We used patient-level data from electronic health records (EHR), five machine learning algorithms (random forest, gradient boosting machine, extreme gradient boosting-XGBoost, support vector machine, and logistic regression-LR), data-driven feature selection strategy, and adaptive sampling to develop 15 models of 30-day readmission after ischemic stroke. We further identified important clinical variables. Results: We included 3,184 patients with ischemic stroke (mean age: 71 ± 13.90 years, men: 51.06%). Among the 61 clinical variables included in the model, the National Institutes of Health Stroke Scale score above 24, insert indwelling urinary catheter, hypercoagulable state, and percutaneous gastrostomy had the highest importance score. The Model's AUC (area under the curve) for predicting 30-day readmission was 0.74 (95%CI: 0.64-0.78) with PPV of 0.43 when the XGBoost algorithm was used with ROSE-sampling. The balance between specificity and sensitivity improved through the sampling strategy. The best sensitivity was achieved with LR when optimized with feature selection and ROSE-sampling (AUC: 0.64, sensitivity: 0.53, specificity: 0.69). Conclusions: Machine learning-based models can be designed to predict 30-day readmission after stroke using structured data from EHR. Among the algorithms analyzed, XGBoost with ROSE-sampling had the best performance in terms of AUC while LR with ROSE-sampling and feature selection had the best sensitivity. Clinical variables highly associated with 30-day readmission could be targeted for personalized interventions. Depending on healthcare systems' resources and criteria, models with optimized performance metrics can be implemented to improve outcomes.
Collapse
Affiliation(s)
- Negar Darabi
- Department of Industrial and Systems Engineering, Virginia Tech, Falls Church, VA, United States
| | - Niyousha Hosseinichimeh
- Department of Industrial and Systems Engineering, Virginia Tech, Falls Church, VA, United States
| | - Anthony Noto
- Geisinger Neuroscience Institute, Geisinger Health System, Danville, PA, United States
| | - Ramin Zand
- Geisinger Neuroscience Institute, Geisinger Health System, Danville, PA, United States
| | - Vida Abedi
- Department of Molecular and Functional Genomics, Geisinger Health System, Danville, PA, United States
- Biocomplexity Institute, Virginia Tech, Blacksburg, VA, United States
| |
Collapse
|
7
|
Abedi V, Avula V, Chaudhary D, Shahjouei S, Khan A, Griessenauer CJ, Li J, Zand R. Prediction of Long-Term Stroke Recurrence Using Machine Learning Models. J Clin Med 2021; 10:1286. [PMID: 33804724 DOI: 10.3390/jcm10061286] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2021] [Revised: 03/15/2021] [Accepted: 03/16/2021] [Indexed: 01/01/2023] Open
Abstract
Background: The long-term risk of recurrent ischemic stroke, estimated to be between 17% and 30%, cannot be reliably assessed at an individual level. Our goal was to study whether machine-learning can be trained to predict stroke recurrence and identify key clinical variables and assess whether performance metrics can be optimized. Methods: We used patient-level data from electronic health records, six interpretable algorithms (Logistic Regression, Extreme Gradient Boosting, Gradient Boosting Machine, Random Forest, Support Vector Machine, Decision Tree), four feature selection strategies, five prediction windows, and two sampling strategies to develop 288 models for up to 5-year stroke recurrence prediction. We further identified important clinical features and different optimization strategies. Results: We included 2091 ischemic stroke patients. Model area under the receiver operating characteristic (AUROC) curve was stable for prediction windows of 1, 2, 3, 4, and 5 years, with the highest score for the 1-year (0.79) and the lowest score for the 5-year prediction window (0.69). A total of 21 (7%) models reached an AUROC above 0.73 while 110 (38%) models reached an AUROC greater than 0.7. Among the 53 features analyzed, age, body mass index, and laboratory-based features (such as high-density lipoprotein, hemoglobin A1c, and creatinine) had the highest overall importance scores. The balance between specificity and sensitivity improved through sampling strategies. Conclusion: All of the selected six algorithms could be trained to predict the long-term stroke recurrence and laboratory-based variables were highly associated with stroke recurrence. The latter could be targeted for personalized interventions. Model performance metrics could be optimized, and models can be implemented in the same healthcare system as intelligent decision support for targeted intervention.
Collapse
|
8
|
Misra D, Avula V, Wolk DM, Farag HA, Li J, Mehta YB, Sandhu R, Karunakaran B, Kethireddy S, Zand R, Abedi V. Early Detection of Septic Shock Onset Using Interpretable Machine Learners. J Clin Med 2021; 10:301. [PMID: 33467539 DOI: 10.3390/jcm10020301] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2020] [Revised: 12/31/2020] [Accepted: 01/12/2021] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Developing a decision support system based on advances in machine learning is one area for strategic innovation in healthcare. Predicting a patient's progression to septic shock is an active field of translational research. The goal of this study was to develop a working model of a clinical decision support system for predicting septic shock in an acute care setting for up to 6 h from the time of admission in an integrated healthcare setting. METHOD Clinical data from Electronic Health Record (EHR), at encounter level, were used to build a predictive model for progression from sepsis to septic shock up to 6 h from the time of admission; that is, T = 1, 3, and 6 h from admission. Eight different machine learning algorithms (Random Forest, XGBoost, C5.0, Decision Trees, Boosted Logistic Regression, Support Vector Machine, Logistic Regression, Regularized Logistic, and Bayes Generalized Linear Model) were used for model development. Two adaptive sampling strategies were used to address the class imbalance. Data from two sources (clinical and billing codes) were used to define the case definition (septic shock) using the Centers for Medicare & Medicaid Services (CMS) Sepsis criteria. The model assessment was performed using Area under Receiving Operator Characteristics (AUROC), sensitivity, and specificity. Model predictions for each feature window (1, 3 and 6 h from admission) were consolidated. RESULTS Retrospective data from April 2005 to September 2018 were extracted from the EHR, Insurance Claims, Billing, and Laboratory Systems to create a dataset for septic shock detection. The clinical criteria and billing information were used to label patients into two classes-septic shock patients and sepsis patients at three different time points from admission, creating two different case-control cohorts. Data from 45,425 unique in-patient visits were used to build 96 prediction models comparing clinical-based definition versus billing-based information as the gold standard. Of the 24 consolidated models (based on eight machine learning algorithms and three feature windows), four models reached an AUROC greater than 0.9. Overall, all the consolidated models reached an AUROC of at least 0.8820 or higher. Based on the AUROC of 0.9483, the best model was based on Random Forest, with a sensitivity of 83.9% and specificity of 88.1%. The sepsis detection window at 6 h outperformed the 1 and 3-h windows. The sepsis definition based on clinical variables had improved performance when compared to the sepsis definition based on only billing information. CONCLUSION This study corroborated that machine learning models can be developed to predict septic shock using clinical and administrative data. However, the use of clinical information to define septic shock outperformed models developed based on only administrative data. Intelligent decision support tools can be developed and integrated into the EHR and improve clinical outcomes and facilitate the optimization of resources in real-time.
Collapse
|