1
|
Duckworth C, Burns D, Fernandez CL, Wright M, Leyland R, Stammers M, George M, Boniface M. Predicting onward care needs at admission to reduce discharge delay using explainable machine learning. Sci Rep 2025; 15:16033. [PMID: 40341633 PMCID: PMC12062306 DOI: 10.1038/s41598-025-00825-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2024] [Accepted: 04/30/2025] [Indexed: 05/10/2025] Open
Abstract
Early identification of patients who require onward referral to social care can prevent delays to discharge from hospital. We introduce an explainable machine learning (ML) model to identify potential social care needs at the first point of admission. This model was trained using routinely collected data on patient admissions, hospital spells and discharge at a large tertiary hospital in the UK between 2017 and 2023. The model performance (one-vs-rest AUROC = 0.915 [0.907 0.924] (95% confidence interval), is comparable to clinician's predictions of discharge care needs, despite working with only a subset of the information available to the clinician. We find that ML and clinicians perform better for identifying different types of care needs, highlighting the added value of a potential system supporting decision making. We also demonstrate the ability for ML to provide automated initial discharge need assessments, in the instance where initial clinical assessment is delayed and provide reasoning for the decision. Finally, we demonstrate that combining clinician and machine predictions, in a hybrid model, provides even more accurate early predictions of onward social care requirements (OVR AUROC = 0.936 [0.928 0.943]) and demonstrates the potential for human-in-the-loop decision support systems in clinical practice.
Collapse
Affiliation(s)
- Chris Duckworth
- IT Innovation Centre, Digital Health and Biomedical Engineering, University of Southampton, Southampton, UK.
| | - Dan Burns
- IT Innovation Centre, Digital Health and Biomedical Engineering, University of Southampton, Southampton, UK
| | | | - Mark Wright
- University Hospital Southampton Foundation Trust, Southampton, UK
| | - Rachael Leyland
- University Hospital Southampton Foundation Trust, Southampton, UK
| | - Matthew Stammers
- Southampton Emerging Therapies and Technologies Centre, University Hospital Southampton Foundation Trust, Southampton, UK
| | - Michael George
- Southampton Emerging Therapies and Technologies Centre, University Hospital Southampton Foundation Trust, Southampton, UK
| | - Michael Boniface
- IT Innovation Centre, Digital Health and Biomedical Engineering, University of Southampton, Southampton, UK
| |
Collapse
|
2
|
Brant A, Singh P, Yin X, Yang L, Nayar J, Jeji D, Matias Y, Corrado GS, Webster DR, Virmani S, Meenu A, Kannan NB, Krause J, Thng F, Peng L, Liu Y, Widner K, Ramasamy K. Performance of a Deep Learning Diabetic Retinopathy Algorithm in India. JAMA Netw Open 2025; 8:e250984. [PMID: 40105843 PMCID: PMC11923701 DOI: 10.1001/jamanetworkopen.2025.0984] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/02/2024] [Accepted: 12/08/2024] [Indexed: 03/20/2025] Open
Abstract
Importance While prospective studies have investigated the accuracy of artificial intelligence (AI) for detection of diabetic retinopathy (DR) and diabetic macular edema (DME), to date, little published data exist on the clinical performance of these algorithms. Objective To evaluate the clinical performance of an automated retinal disease assessment (ARDA) algorithm in the postdeployment setting at Aravind Eye Hospital in India. Design, Setting, and Participants This cross-sectional analysis involved an approximate 1% sample of fundus photographs from patients screened using ARDA. Images were graded via adjudication by US ophthalmologists for DR and DME, and ARDA's output was compared against the adjudicated grades at 45 sites in Southern India. Patients were randomly selected between January 1, 2019, and July 31, 2023. Main Outcomes and Measures Primary analyses were the sensitivity and specificity of ARDA for severe nonproliferative DR (NPDR) or proliferative DR (PDR). Secondary analyses focused on sensitivity and specificity for sight-threatening DR (STDR) (DME or severe NPDR or PDR). Results Among the 4537 patients with 4537 images with adjudicated grades, mean (SD) age was 55.2 (11.9) years and 2272 (50.1%) were male. Among the 3941 patients with gradable photographs, 683 (17.3%) had any DR, 146 (3.7%) had severe NPDR or PDR, 109 (2.8%) had PDR, and 398 (10.1%) had STDR. ARDA's sensitivity and specificity for severe NPDR or PDR were 97.0% (95% CI, 92.6%-99.2%) and 96.4% (95% CI, 95.7%-97.0%), respectively. Positive predictive value (PPV) was 50.7% and negative predictive value (NPV) was 99.9%. The clinically important miss rate for severe NPDR or PDR was 0% (eg, some patients with severe NPDR or PDR were interpreted as having moderate DR and referred to clinic). ARDA's sensitivity for STDR was 95.9% (95% CI, 93.0%-97.4%) and specificity was 94.9% (95% CI, 94.1%-95.7%); PPV and NPV were 67.9% and 99.5%, respectively. Conclusions and Relevance In this cross-sectional study investigating the clinical performance of ARDA, sensitivity and specificity for severe NPDR and PDR exceeded 96% and caught 100% of patients with severe NPDR and PDR for ophthalmology referral. This preliminary large-scale postmarketing report of the performance of ARDA after screening 600 000 patients in India underscores the importance of monitoring and publication an algorithm's clinical performance, consistent with recommendations by regulatory bodies.
Collapse
Affiliation(s)
| | | | - Xiang Yin
- Google LLC, Mountain View, California
| | - Lu Yang
- Google LLC, Mountain View, California
| | - Jay Nayar
- Google LLC, Mountain View, California
| | | | | | | | | | | | | | | | | | | | - Lily Peng
- Verily Life Sciences LLC, South San Francisco, California
| | - Yun Liu
- Google LLC, Mountain View, California
| | | | | |
Collapse
|
3
|
Tariq A, Kaur G, Su L, Gichoya J, Patel B, Banerjee I. Adaptable graph neural networks design to support generalizability for clinical event prediction. J Biomed Inform 2025; 163:104794. [PMID: 39956347 PMCID: PMC11917466 DOI: 10.1016/j.jbi.2025.104794] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2024] [Revised: 01/07/2025] [Accepted: 02/05/2025] [Indexed: 02/18/2025]
Abstract
OBJECTIVE While many machine learning and deep learning-based models for clinical event prediction leverage various data elements from electronic healthcare records such as patient demographics and billing codes, such models face severe challenges when tested outside of their institution of training. These challenges are rooted not only in differences in patient population characteristics, but medical practice patterns of different institutions. METHOD We propose a solution to this problem through systematically adaptable design of graph-based convolutional neural networks (GCNN) for clinical event prediction. Our solution relies on the unique property of GCNN where data encoded as graph edges is only implicitly used during the prediction process and can be adapted after model training without requiring model re-training. RESULTS Our adaptable GCNN-based prediction models outperformed all comparative models during external validation for two different clinical problems, while supporting multimodal data integration. For prediction of hospital discharge and mortality, the comparative fusion baseline model achieved 0.58 [0.52-0.59] and 0.81[0.80-0.82] AUROC on the external dataset while the GCNN achieved 0.70 [0.68-0.70] and 0.91 [0.90-0.92] respectively. For prediction of future unplanned transfusion, we observed even more gaps in performance due to missing/incomplete data in the external dataset - late fusion achieved 0.44[0.31-0.56] while the GCNN model achieved 0.70 [0.62-0.84]. CONCLUSION These results support our hypothesis that carefully designed GCNN-based models can overcome generalization challenges faced by prediction models.
Collapse
Affiliation(s)
- Amara Tariq
- Arizona Advanced AI (A3I) Hub, Mayo Clinic Arizona, United States.
| | - Gurkiran Kaur
- Department of Radiology, Mayo Clinic, AZ, United States
| | - Leon Su
- Department of Laboratory Medicine and Pathology, Mayo Clinic, AZ, United States
| | - Judy Gichoya
- Department of Radiology, Emory University, GA, United States
| | - Bhavik Patel
- Department of Radiology, Mayo Clinic, AZ, United States; School of Computing and Augmented Intelligence, Arizona State University, AZ, United States; Arizona Advanced AI (A3I) Hub, Mayo Clinic Arizona, United States
| | - Imon Banerjee
- Department of Radiology, Mayo Clinic, AZ, United States; School of Computing and Augmented Intelligence, Arizona State University, AZ, United States; Arizona Advanced AI (A3I) Hub, Mayo Clinic Arizona, United States
| |
Collapse
|
4
|
Meijerink LM, Dunias ZS, Leeuwenberg AM, de Hond AAH, Jenkins DA, Martin GP, Sperrin M, Peek N, Spijker R, Hooft L, Moons KGM, van Smeden M, Schuit E. Updating methods for artificial intelligence-based clinical prediction models: a scoping review. J Clin Epidemiol 2025; 178:111636. [PMID: 39662644 DOI: 10.1016/j.jclinepi.2024.111636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2024] [Revised: 12/02/2024] [Accepted: 12/03/2024] [Indexed: 12/13/2024]
Abstract
OBJECTIVES To give an overview of methods for updating artificial intelligence (AI)-based clinical prediction models based on new data. STUDY DESIGN AND SETTING We comprehensively searched Scopus and Embase up to August 2022 for articles that addressed developments, descriptions, or evaluations of prediction model updating methods. We specifically focused on articles in the medical domain involving AI-based prediction models that were updated based on new data, excluding regression-based updating methods as these have been extensively discussed elsewhere. We categorized and described the identified methods used to update the AI-based prediction model as well as the use cases in which they were used. RESULTS We included 78 articles. The majority of the included articles discussed updating for neural network methods (93.6%) with medical images as input data (65.4%). In many articles (51.3%) existing, pretrained models for broad tasks were updated to perform specialized clinical tasks. Other common reasons for model updating were to address changes in the data over time and cross-center differences; however, more unique use cases were also identified, such as updating a model from a broad population to a specific individual. We categorized the identified model updating methods into four categories: neural network-specific methods (described in 92.3% of the articles), ensemble-specific methods (2.5%), model-agnostic methods (9.0%), and other (1.3%). Variations of neural network-specific methods are further categorized based on the following: (1) the part of the original neural network that is kept, (2) whether and how the original neural network is extended with new parameters, and (3) to what extent the original neural network parameters are adjusted to the new data. The most frequently occurring method (n = 30) involved selecting the first layer(s) of an existing neural network, appending new, randomly initialized layers, and then optimizing the entire neural network. CONCLUSION We identified many ways to adjust or update AI-based prediction models based on new data, within a large variety of use cases. Updating methods for AI-based prediction models other than neural networks (eg, random forest) appear to be underexplored in clinical prediction research. PLAIN LANGUAGE SUMMARY AI-based prediction models are increasingly used in health care, helping clinicians with diagnosing diseases, guiding treatment decisions, and informing patients. However, these prediction models do not always work well when applied to hospitals, patient populations, or times different from those used to develop the models. Developing new models for every situation is neither practical nor desired, as it wastes resources, time, and existing knowledge. A more efficient approach is to adjust existing models to new contexts ('updating'), but there is limited guidance on how to do this for AI-based clinical prediction models. To address this, we reviewed 78 studies in detail to understand how researchers are currently updating AI-based clinical prediction models, and the types of situations in which these updating methods are used. Our findings provide a comprehensive overview of the available methods to update existing models. This is intended to serve as guidance and inspiration for researchers. Ultimately, this can lead to better reuse of existing models and improve the quality and efficiency of AI-based prediction models in health care.
Collapse
Affiliation(s)
- Lotta M Meijerink
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands.
| | - Zoë S Dunias
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
| | - Artuur M Leeuwenberg
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
| | - Anne A H de Hond
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
| | - David A Jenkins
- Division of Informatics, Imaging and Data Sciences, University of Manchester, Manchester, United Kingdom
| | - Glen P Martin
- Division of Informatics, Imaging and Data Sciences, University of Manchester, Manchester, United Kingdom
| | - Matthew Sperrin
- Division of Informatics, Imaging and Data Sciences, University of Manchester, Manchester, United Kingdom
| | - Niels Peek
- Department of Public Health and Primary Care, The Healthcare Improvement Studies Institute, University of Cambridge, Cambridge, United Kingdom
| | - René Spijker
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
| | - Lotty Hooft
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
| | - Karel G M Moons
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
| | - Maarten van Smeden
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
| | - Ewoud Schuit
- Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands
| |
Collapse
|
5
|
Ogwel B, Mzazi VH, Awuor AO, Okonji C, Anyango RO, Oreso C, Ochieng JB, Munga S, Nasrin D, Tickell KD, Pavlinac PB, Kotloff KL, Omore R. Derivation and validation of a clinical predictive model for longer duration diarrhea among pediatric patients in Kenya using machine learning algorithms. BMC Med Inform Decis Mak 2025; 25:28. [PMID: 39815316 PMCID: PMC11737202 DOI: 10.1186/s12911-025-02855-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2024] [Accepted: 01/03/2025] [Indexed: 01/18/2025] Open
Abstract
BACKGROUND Despite the adverse health outcomes associated with longer duration diarrhea (LDD), there are currently no clinical decision tools for timely identification and better management of children with increased risk. This study utilizes machine learning (ML) to derive and validate a predictive model for LDD among children presenting with diarrhea to health facilities. METHODS LDD was defined as a diarrhea episode lasting ≥ 7 days. We used 7 ML algorithms to build prognostic models for the prediction of LDD among children < 5 years using de-identified data from Vaccine Impact on Diarrhea in Africa study (N = 1,482) in model development and data from Enterics for Global Health Shigella study (N = 682) in temporal validation of the champion model. Features included demographic, medical history and clinical examination data collected at enrolment in both studies. We conducted split-sampling and employed K-fold cross-validation with over-sampling technique in the model development. Moreover, critical predictors of LDD and their impact on prediction were obtained using an explainable model agnostic approach. The champion model was determined based on the area under the curve (AUC) metric. Model calibrations were assessed using Brier, Spiegelhalter's z-test and its accompanying p-value. RESULTS There was a significant difference in prevalence of LDD between the development and temporal validation cohorts (478 [32.3%] vs 69 [10.1%]; p < 0.001). The following variables were associated with LDD in decreasing order: pre-enrolment diarrhea days (55.1%), modified Vesikari score(18.2%), age group (10.7%), vomit days (8.8%), respiratory rate (6.5%), vomiting (6.4%), vomit frequency (6.2%), rotavirus vaccination (6.1%), skin pinch (2.4%) and stool frequency (2.4%). While all models showed good prediction capability, the random forest model achieved the best performance (AUC [95% Confidence Interval]: 83.0 [78.6-87.5] and 71.0 [62.5-79.4]) on the development and temporal validation datasets, respectively. While the random forest model showed slight deviations from perfect calibration, these deviations were not statistically significant (Brier score = 0.17, Spiegelhalter p-value = 0.219). CONCLUSIONS Our study suggests ML derived algorithms could be used to rapidly identify children at increased risk of LDD. Integrating ML derived models into clinical decision-making may allow clinicians to target these children with closer observation and enhanced management.
Collapse
Affiliation(s)
- Billy Ogwel
- Kenya Medical Research Institute- Center for Global Health Research (KEMRI-CGHR), P.O Box 1578-40100, Kisumu, Kenya.
- Department of Information Systems, University of South Africa, Pretoria, South Africa.
| | - Vincent H Mzazi
- Department of Information Systems, University of South Africa, Pretoria, South Africa
| | - Alex O Awuor
- Kenya Medical Research Institute- Center for Global Health Research (KEMRI-CGHR), P.O Box 1578-40100, Kisumu, Kenya
| | - Caleb Okonji
- Kenya Medical Research Institute- Center for Global Health Research (KEMRI-CGHR), P.O Box 1578-40100, Kisumu, Kenya
| | - Raphael O Anyango
- Kenya Medical Research Institute- Center for Global Health Research (KEMRI-CGHR), P.O Box 1578-40100, Kisumu, Kenya
| | - Caren Oreso
- Kenya Medical Research Institute- Center for Global Health Research (KEMRI-CGHR), P.O Box 1578-40100, Kisumu, Kenya
| | - John B Ochieng
- Kenya Medical Research Institute- Center for Global Health Research (KEMRI-CGHR), P.O Box 1578-40100, Kisumu, Kenya
| | - Stephen Munga
- Kenya Medical Research Institute- Center for Global Health Research (KEMRI-CGHR), P.O Box 1578-40100, Kisumu, Kenya
| | - Dilruba Nasrin
- Department of Medicine, Center for Vaccine Development and Global Health, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Kirkby D Tickell
- Department of Global Health, University of Washington, Seattle, USA
| | | | - Karen L Kotloff
- Department of Medicine, Center for Vaccine Development and Global Health, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Richard Omore
- Kenya Medical Research Institute- Center for Global Health Research (KEMRI-CGHR), P.O Box 1578-40100, Kisumu, Kenya
| |
Collapse
|
6
|
Dorfner FJ, Patel JB, Kalpathy-Cramer J, Gerstner ER, Bridge CP. A review of deep learning for brain tumor analysis in MRI. NPJ Precis Oncol 2025; 9:2. [PMID: 39753730 PMCID: PMC11698745 DOI: 10.1038/s41698-024-00789-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Accepted: 12/17/2024] [Indexed: 01/06/2025] Open
Abstract
Recent progress in deep learning (DL) is producing a new generation of tools across numerous clinical applications. Within the analysis of brain tumors in magnetic resonance imaging, DL finds applications in tumor segmentation, quantification, and classification. It facilitates objective and reproducible measurements crucial for diagnosis, treatment planning, and disease monitoring. Furthermore, it holds the potential to pave the way for personalized medicine through the prediction of tumor type, grade, genetic mutations, and patient survival outcomes. In this review, we explore the transformative potential of DL for brain tumor care and discuss existing applications, limitations, and future directions and opportunities.
Collapse
Affiliation(s)
- Felix J Dorfner
- Athinoula A. Martinos Center for Biomedical Imaging, 149 13th St, Charlestown, MA, 02129, USA
| | - Jay B Patel
- Athinoula A. Martinos Center for Biomedical Imaging, 149 13th St, Charlestown, MA, 02129, USA
| | | | - Elizabeth R Gerstner
- Athinoula A. Martinos Center for Biomedical Imaging, 149 13th St, Charlestown, MA, 02129, USA
- Massachusetts General Hospital Cancer Center, Boston, MA, 02114, USA
| | - Christopher P Bridge
- Athinoula A. Martinos Center for Biomedical Imaging, 149 13th St, Charlestown, MA, 02129, USA.
| |
Collapse
|
7
|
Ogwel B, Mzazi VH, Awuor AO, Okonji C, Anyango RO, Oreso C, Ochieng JB, Munga S, Nasrin D, Tickell KD, Pavlinac PB, Kotloff KL, Omore R. Predictive modelling of linear growth faltering among pediatric patients with Diarrhea in Rural Western Kenya: an explainable machine learning approach. BMC Med Inform Decis Mak 2024; 24:368. [PMID: 39623435 PMCID: PMC11613762 DOI: 10.1186/s12911-024-02779-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2024] [Accepted: 11/22/2024] [Indexed: 12/06/2024] Open
Abstract
INTRODUCTION Stunting affects one-fifth of children globally with diarrhea accounting for an estimated 13.5% of stunting. Identifying risk factors for its precursor, linear growth faltering (LGF), is critical to designing interventions. Moreover, developing new predictive models for LGF using more recent data offers opportunity to enhance model accuracy, interpretability and capture new insights. We employed machine learning (ML) to derive and validate a predictive model for LGF among children enrolled with diarrhea in the Vaccine Impact on Diarrhea in Africa (VIDA) study and the Enterics for Global Heath (EFGH) - Shigella study in rural western Kenya. METHODS We used 7 diverse ML algorithms to retrospectively build prognostic models for the prediction of LGF (≥ 0.5 decrease in height/length for age z-score [HAZ]) among children 6-35 months. We used de-identified data from the VIDA study (n = 1,106) combined with synthetic data (n = 8,894) in model development, which entailed split-sampling and K-fold cross-validation with over-sampling technique, and data from EFGH-Shigella study (n = 655) for temporal validation. Potential predictors (n = 65) included demographic, household-level characteristics, illness history, anthropometric and clinical data were identified using boruta feature selection with an explanatory model analysis used to enhance interpretability. RESULTS The prevalence of LGF in the development and temporal validation cohorts was 187 (16.9%) and 147 (22.4%), respectively. Feature selection identified the following 6 variables used in model development, ranked by importance: age (16.6%), temperature (6.0%), respiratory rate (4.1%), SAM (3.4%), rotavirus vaccination (3.3%), and skin turgor (2.1%). While all models showed good prediction capability, the gradient boosting model achieved the best performance (area under the curve % [95% Confidence Interval]: 83.5 [81.6-85.4] and 65.6 [60.8-70.4]) on the development and temporal validation datasets, respectively. CONCLUSION Our findings accentuate the enduring relevance of established predictors of LGF whilst demonstrating the practical utility of ML algorithms for rapid identification of at-risk children.
Collapse
Affiliation(s)
- Billy Ogwel
- Kenya Medical Research Institute- Center for Global Health Research (KEMRI-CGHR), P.O Box 1578-40100, Kisumu, Kenya.
- Department of Information Systems, University of South Africa, Pretoria, South Africa.
| | - Vincent H Mzazi
- Department of Information Systems, University of South Africa, Pretoria, South Africa
| | - Alex O Awuor
- Kenya Medical Research Institute- Center for Global Health Research (KEMRI-CGHR), P.O Box 1578-40100, Kisumu, Kenya
| | - Caleb Okonji
- Kenya Medical Research Institute- Center for Global Health Research (KEMRI-CGHR), P.O Box 1578-40100, Kisumu, Kenya
| | - Raphael O Anyango
- Kenya Medical Research Institute- Center for Global Health Research (KEMRI-CGHR), P.O Box 1578-40100, Kisumu, Kenya
| | - Caren Oreso
- Kenya Medical Research Institute- Center for Global Health Research (KEMRI-CGHR), P.O Box 1578-40100, Kisumu, Kenya
| | - John B Ochieng
- Kenya Medical Research Institute- Center for Global Health Research (KEMRI-CGHR), P.O Box 1578-40100, Kisumu, Kenya
| | - Stephen Munga
- Kenya Medical Research Institute- Center for Global Health Research (KEMRI-CGHR), P.O Box 1578-40100, Kisumu, Kenya
| | - Dilruba Nasrin
- Department of Medicine, Center for Vaccine Development and Global Health, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Kirkby D Tickell
- Department of Global Health, University of Washington, Seattle, USA
| | | | - Karen L Kotloff
- Department of Medicine, Center for Vaccine Development and Global Health, University of Maryland School of Medicine, Baltimore, MD, USA
| | - Richard Omore
- Kenya Medical Research Institute- Center for Global Health Research (KEMRI-CGHR), P.O Box 1578-40100, Kisumu, Kenya
| |
Collapse
|
8
|
Andersen ES, Birk-Korch JB, Hansen RS, Fly LH, Röttger R, Arcani DMC, Brasen CL, Brandslund I, Madsen JS. Monitoring performance of clinical artificial intelligence in health care: a scoping review. JBI Evid Synth 2024; 22:2423-2446. [PMID: 39658865 PMCID: PMC11630661 DOI: 10.11124/jbies-24-00042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2024]
Abstract
OBJECTIVE The objective of this review was to provide an overview of the diverse methods described, tested, or implemented for monitoring performance of clinical artificial intelligence (AI) systems, while also summarizing the arguments given for or against these methods. INTRODUCTION The integration of AI in clinical decision-making is steadily growing. Performances of AI systems evolve over time, necessitating ongoing performance monitoring. However, the evidence on specific monitoring methods is sparse and heterogeneous. Thus, an overview of the evidence on this topic is warranted to guide further research on clinical AI monitoring. INCLUSION CRITERIA We included publications detailing metrics or statistical processes employed in systematic, continuous, or repeated initiatives aimed at evaluating or predicting the clinical performance of AI models with direct implications for patient management in health care. No limitations on language or publication date were enforced. METHODS We performed systematic database searches in MEDLINE (Ovid), Embase (Ovid), Scopus, and ProQuest Dissertations and Theses Global, supplemented by backward and forward citation searches and gray literature searches. Two or more independent reviewers conducted title and abstract screening, full-text evaluation, and data extraction using a tool developed by the authors. During extraction, the methods identified were divided into subcategories. The results are presented narratively and summarized in tables and graphs. RESULTS Thirty-nine sources of evidence were included in the review, with the most abundant source types being opinion papers/narrative reviews (33%) and simulation studies (33%). One guideline on the topic was identified, offering limited guidance on specific metrics and statistical methods. The number of sources included increased year by year, with almost 4 times as many sources included in 2023 compared with 2019. The most commonly reported performance metrics were traditional metrics from the medical literature, including area under the receiver operating characteristics curve (AUROC), sensitivity, specificity, and predictive values, although few arguments were given supporting these choices. Some studies reported on metrics and statistical processing specifically designed to monitor clinical AI. CONCLUSION This review provides a summary of the methods described for monitoring AI in health care. It reveals a relative scarcity of evidence and guidance for specific practical implementation of performance monitoring of clinical AI. This underscores the imperative for further research, discussion, and guidance regarding the specifics of implementing monitoring for clinical AI. The steady increase in the number of relevant sources published per year suggests that this area of research is gaining increased focus, and the amount of evidence and guidance available will likely increase significantly over the coming years. REVIEW REGISTRATION Open Science Framework https://osf.io/afkrn.
Collapse
Affiliation(s)
- Eline Sandvig Andersen
- Department of Biochemistry and Immunology, Lillebaelt Hospital – University Hospital of Southern Denmark, Vejle, Denmark
- Department of Regional Health Research, University of Southern Denmark, Lillebælt Hospital (Kolding and Vejle), Denmark
| | - Johan Baden Birk-Korch
- Department of Biochemistry and Immunology, Lillebaelt Hospital – University Hospital of Southern Denmark, Vejle, Denmark
- Department of Regional Health Research, University of Southern Denmark, Lillebælt Hospital (Kolding and Vejle), Denmark
| | | | - Line Haugaard Fly
- Department of Biochemistry and Immunology, Lillebaelt Hospital – University Hospital of Southern Denmark, Vejle, Denmark
- Department of Regional Health Research, University of Southern Denmark, Lillebælt Hospital (Kolding and Vejle), Denmark
| | - Richard Röttger
- Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
| | - Diana Maria Cespedes Arcani
- Department of Thoracic Surgery, the First Affiliated Hospital, School of Medicine, Zhejiang University, Hangzhou, China
| | - Claus Lohman Brasen
- Department of Biochemistry and Immunology, Lillebaelt Hospital – University Hospital of Southern Denmark, Vejle, Denmark
- Department of Regional Health Research, University of Southern Denmark, Lillebælt Hospital (Kolding and Vejle), Denmark
| | - Ivan Brandslund
- Department of Biochemistry and Immunology, Lillebaelt Hospital – University Hospital of Southern Denmark, Vejle, Denmark
- Department of Regional Health Research, University of Southern Denmark, Lillebælt Hospital (Kolding and Vejle), Denmark
| | - Jonna Skov Madsen
- Department of Biochemistry and Immunology, Lillebaelt Hospital – University Hospital of Southern Denmark, Vejle, Denmark
- Department of Regional Health Research, University of Southern Denmark, Lillebælt Hospital (Kolding and Vejle), Denmark
| |
Collapse
|
9
|
Mazurenko O, Hirsh AT, Harle CA, Shen J, McNamee C, Vest JR. Comparing the performance of screening surveys versus predictive models in identifying patients in need of health-related social need services in the emergency department. PLoS One 2024; 19:e0312193. [PMID: 39565746 PMCID: PMC11578524 DOI: 10.1371/journal.pone.0312193] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2024] [Accepted: 10/02/2024] [Indexed: 11/22/2024] Open
Abstract
BACKGROUND Health-related social needs (HRSNs), such as housing instability, food insecurity, and financial strain, are increasingly prevalent among patients. Healthcare organizations must first correctly identify patients with HRSNs to refer them to appropriate services or offer resources to address their HRSNs. Yet, current identification methods are suboptimal, inconsistently applied, and cost prohibitive. Machine learning (ML) predictive modeling applied to existing data sources may be a solution to systematically and effectively identify patients with HRSNs. The performance of ML predictive models using data from electronic health records (EHRs) and other sources has not been compared to other methods of identifying patients needing HRSN services. METHODS A screening questionnaire that included housing instability, food insecurity, transportation barriers, legal issues, and financial strain was administered to adult ED patients at a large safety-net hospital in the mid-Western United States (n = 1,101). We identified those patients likely in need of HRSN-related services within the next 30 days using positive indications from referrals, encounters, scheduling data, orders, or clinical notes. We built an XGBoost classification algorithm using responses from the screening questionnaire to predict HRSN needs (screening questionnaire model). Additionally, we extracted features from the past 12 months of existing EHR, administrative, and health information exchange data for the survey respondents. We built ML predictive models with these EHR data using XGBoost (ML EHR model). Out of concerns of potential bias, we built both the screening question model and the ML EHR model with and without demographic features. Models were assessed on the validation set using sensitivity, specificity, and Area Under the Curve (AUC) values. Models were compared using the Delong test. RESULTS Almost half (41%) of the patients had a positive indicator for a likely HRSN service need within the next 30 days, as identified through referrals, encounters, scheduling data, orders, or clinical notes. The screening question model had suboptimal performance, with an AUC = 0.580 (95%CI = 0.546, 0.611). Including gender and age resulted in higher performance in the screening question model (AUC = 0.640; 95%CI = 0.609, 0.672). The ML EHR models had higher performance. Without including age and gender, the ML EHR model had an AUC = 0.765 (95%CI = 0.737, 0.792). Adding age and gender did not improve the model (AUC = 0.722; 95%CI = 0.744, 0.800). The screening questionnaire models indicated bias with the highest performance for White non-Hispanic patients. The performance of the ML EHR-based model also differed by race and ethnicity. CONCLUSION ML predictive models leveraging several robust EHR data sources outperformed models using screening questions only. Nevertheless, all models indicated biases. Additional work is needed to design predictive models for effectively identifying all patients with HRSNs.
Collapse
Affiliation(s)
- Olena Mazurenko
- Department of Health Policy & Management, Indiana University Richard M. Fairbanks School of Public Health–Indianapolis, Indianapolis, Indiana, United States of America
- Regenstrief Institute, Indianapolis, Indiana, United States of America
| | - Adam T. Hirsh
- School of Science, Indiana University–Indianapolis, Indianapolis, Indiana, United States of America
| | - Christopher A. Harle
- Department of Health Policy & Management, Indiana University Richard M. Fairbanks School of Public Health–Indianapolis, Indianapolis, Indiana, United States of America
- Regenstrief Institute, Indianapolis, Indiana, United States of America
| | - Joanna Shen
- Regenstrief Institute, Indianapolis, Indiana, United States of America
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, United States of America
| | - Cassidy McNamee
- Department of Health Policy & Management, Indiana University Richard M. Fairbanks School of Public Health–Indianapolis, Indianapolis, Indiana, United States of America
| | - Joshua R. Vest
- Department of Health Policy & Management, Indiana University Richard M. Fairbanks School of Public Health–Indianapolis, Indianapolis, Indiana, United States of America
- Regenstrief Institute, Indianapolis, Indiana, United States of America
| |
Collapse
|
10
|
Chinni BK, Manlhiot C. Emerging Analytical Approaches for Personalized Medicine Using Machine Learning In Pediatric and Congenital Heart Disease. Can J Cardiol 2024; 40:1880-1896. [PMID: 39097187 DOI: 10.1016/j.cjca.2024.07.026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2024] [Revised: 07/25/2024] [Accepted: 07/26/2024] [Indexed: 08/05/2024] Open
Abstract
Precision and personalized medicine, the process by which patient management is tailored to individual circumstances, are now terms that are familiar to cardiologists, despite it still being an emerging field. Although precision medicine relies most often on the underlying biology and pathophysiology of a patient's condition, personalized medicine relies on digital biomarkers generated through algorithms. Given the complexity of the underlying data, these digital biomarkers are most often generated through machine-learning algorithms. There are a number of analytic considerations regarding the creation of digital biomarkers that are discussed in this review, including data preprocessing, time dependency and gating, dimensionality reduction, and novel methods, both in the realm of supervised and unsupervised machine learning. Some of these considerations, such as sample size requirements and measurements of model performance, are particularly challenging in small and heterogeneous populations with rare outcomes such as children with congenital heart disease. Finally, we review analytic considerations for the deployment of digital biomarkers in clinical settings, including the emerging field of clinical artificial intelligence (AI) operations, computational needs for deployment, efforts to increase the explainability of AI, algorithmic drift, and the needs for distributed surveillance and federated learning. We conclude this review by discussing a recent simulation study that shows that, despite these analytic challenges and complications, the use of digital biomarkers in managing clinical care might have substantial benefits regarding individual patient outcomes.
Collapse
Affiliation(s)
- Bhargava K Chinni
- The Blalock-Taussig-Thomas Pediatric and Congenital Heart Center, Department of Pediatrics, Johns Hopkins School of Medicine, Johns Hopkins University, Baltimore, Maryland, USA
| | - Cedric Manlhiot
- The Blalock-Taussig-Thomas Pediatric and Congenital Heart Center, Department of Pediatrics, Johns Hopkins School of Medicine, Johns Hopkins University, Baltimore, Maryland, USA; Research Institute, SickKids Hospital, Department of Pediatrics, University of Toronto, Toronto, Ontario, Canada.
| |
Collapse
|
11
|
Rajagopal A, Ayanian S, Ryu AJ, Qian R, Legler SR, Peeler EA, Issa M, Coons TJ, Kawamoto K. Machine Learning Operations in Health Care: A Scoping Review. MAYO CLINIC PROCEEDINGS. DIGITAL HEALTH 2024; 2:421-437. [PMID: 40206123 PMCID: PMC11975983 DOI: 10.1016/j.mcpdig.2024.06.009] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/11/2025]
Abstract
The use of machine learning tools in health care is rapidly expanding. However, the processes that support these tools in deployment, that is, machine learning operations, are still emerging. The purpose of this work was not only to provide a comprehensive synthesis of existing literature in the field but also to identify gaps and offer insights for adoption in clinical practice. A scoping review was conducted using the MEDLINE, PubMed, Google Scholar, Embase, and Scopus databases. We used MeSH and non-MeSH search terms to identify pertinent articles, with the authors performing 2 screening phases and assigning relevance scores: 148 English language articles most salient to the review were eligible for inclusion; 98 offered the most unique information and these were supplemented by 50 additional sources, yielding 148 references. From the 148 references, we distilled 7 key topic areas, based on a synthesis of the available literature and how that aligned with practitioner needs. The 7 topic areas were machine learning model monitoring; automated retraining systems; ethics, equity, and bias; clinical workflow integration; infrastructure, human resources, and technology stack; regulatory considerations; and financial considerations. This review provides an overview of best practices and knowledge gaps of this domain in health care and identifies the strengths and weaknesses of the literature, which may be useful to health care machine learning practitioners and consumers.
Collapse
Affiliation(s)
- Anjali Rajagopal
- Department of Medicine, Artificial Intelligence and Innovation, Mayo Clinic Rochester, MN
| | - Shant Ayanian
- Division of Hospital Internal Medicine, Department of Medicine, Mayo Clinic, Rochester, MN
| | - Alexander J. Ryu
- Division of Hospital Internal Medicine, Department of Medicine, Mayo Clinic, Rochester, MN
| | - Ray Qian
- Division of Hospital Internal Medicine, Department of Medicine, Mayo Clinic, Rochester, MN
| | - Sean R. Legler
- Division of Hospital Internal Medicine, Department of Medicine, Mayo Clinic, Rochester, MN
| | - Eric A. Peeler
- Division of Hospital Internal Medicine, Department of Medicine, Mayo Clinic, Rochester, MN
| | - Meltiady Issa
- Division of Hospital Internal Medicine, Department of Medicine, Mayo Clinic, Rochester, MN
| | - Trevor J. Coons
- Heart, Vascular and Thoracic Institute, Cleveland Clinic Abu Dhabi, United Arab Emirates
| | - Kensaku Kawamoto
- Department of Biomedical Informatics, University of Utah, Salt Lake City, UT
| |
Collapse
|
12
|
Nicora G, Catalano M, Bortolotto C, Achilli MF, Messana G, Lo Tito A, Consonni A, Cutti S, Comotto F, Stella GM, Corsico A, Perlini S, Bellazzi R, Bruno R, Preda L. Bayesian Networks in the Management of Hospital Admissions: A Comparison between Explainable AI and Black Box AI during the Pandemic. J Imaging 2024; 10:117. [PMID: 38786571 PMCID: PMC11122655 DOI: 10.3390/jimaging10050117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2024] [Revised: 04/24/2024] [Accepted: 05/06/2024] [Indexed: 05/25/2024] Open
Abstract
Artificial Intelligence (AI) and Machine Learning (ML) approaches that could learn from large data sources have been identified as useful tools to support clinicians in their decisional process; AI and ML implementations have had a rapid acceleration during the recent COVID-19 pandemic. However, many ML classifiers are "black box" to the final user, since their underlying reasoning process is often obscure. Additionally, the performance of such models suffers from poor generalization ability in the presence of dataset shifts. Here, we present a comparison between an explainable-by-design ("white box") model (Bayesian Network (BN)) versus a black box model (Random Forest), both studied with the aim of supporting clinicians of Policlinico San Matteo University Hospital in Pavia (Italy) during the triage of COVID-19 patients. Our aim is to evaluate whether the BN predictive performances are comparable with those of a widely used but less explainable ML model such as Random Forest and to test the generalization ability of the ML models across different waves of the pandemic.
Collapse
Affiliation(s)
- Giovanna Nicora
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, 27100 Pavia, Italy; (G.N.); (R.B.)
| | - Michele Catalano
- Diagnostic Imaging and Radiotherapy Unit, Department of Clinical, Surgical, Diagnostic and Pediatric Sciences, University of Pavia, 27100 Pavia, Italy; (M.C.); (M.F.A.); (G.M.); (A.L.T.); (A.C.); (L.P.)
| | - Chandra Bortolotto
- Diagnostic Imaging and Radiotherapy Unit, Department of Clinical, Surgical, Diagnostic and Pediatric Sciences, University of Pavia, 27100 Pavia, Italy; (M.C.); (M.F.A.); (G.M.); (A.L.T.); (A.C.); (L.P.)
- Radiology Institute, Fondazione IRCCS Policlinico San Matteo, 27100 Pavia, Italy
| | - Marina Francesca Achilli
- Diagnostic Imaging and Radiotherapy Unit, Department of Clinical, Surgical, Diagnostic and Pediatric Sciences, University of Pavia, 27100 Pavia, Italy; (M.C.); (M.F.A.); (G.M.); (A.L.T.); (A.C.); (L.P.)
| | - Gaia Messana
- Diagnostic Imaging and Radiotherapy Unit, Department of Clinical, Surgical, Diagnostic and Pediatric Sciences, University of Pavia, 27100 Pavia, Italy; (M.C.); (M.F.A.); (G.M.); (A.L.T.); (A.C.); (L.P.)
| | - Antonio Lo Tito
- Diagnostic Imaging and Radiotherapy Unit, Department of Clinical, Surgical, Diagnostic and Pediatric Sciences, University of Pavia, 27100 Pavia, Italy; (M.C.); (M.F.A.); (G.M.); (A.L.T.); (A.C.); (L.P.)
| | - Alessio Consonni
- Diagnostic Imaging and Radiotherapy Unit, Department of Clinical, Surgical, Diagnostic and Pediatric Sciences, University of Pavia, 27100 Pavia, Italy; (M.C.); (M.F.A.); (G.M.); (A.L.T.); (A.C.); (L.P.)
| | - Sara Cutti
- Medical Direction, Fondazione IRCCS Policlinico San Matteo, 27100 Pavia, Italy;
| | | | - Giulia Maria Stella
- Department of Internal Medicine and Therapeutics, University of Pavia, 27100 Pavia, Italy; (G.M.S.); (A.C.); (S.P.)
- Unit of Respiratory Diseases, Fondazione IRCCS Policlinico San Matteo, 27100 Pavia, Italy
| | - Angelo Corsico
- Department of Internal Medicine and Therapeutics, University of Pavia, 27100 Pavia, Italy; (G.M.S.); (A.C.); (S.P.)
- Unit of Respiratory Diseases, Fondazione IRCCS Policlinico San Matteo, 27100 Pavia, Italy
| | - Stefano Perlini
- Department of Internal Medicine and Therapeutics, University of Pavia, 27100 Pavia, Italy; (G.M.S.); (A.C.); (S.P.)
- Department of Emergency, Fondazione IRCCS Policlinico San Matteo, 27100 Pavia, Italy
| | - Riccardo Bellazzi
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, 27100 Pavia, Italy; (G.N.); (R.B.)
| | - Raffaele Bruno
- Department of Clinical, Surgical, Diagnostic and Pediatric Sciences, University of Pavia, 27100 Pavia, Italy;
- Unit of Infectious Diseases, Fondazione IRCCS Policlinico San Matteo, 27100 Pavia, Italy
| | - Lorenzo Preda
- Diagnostic Imaging and Radiotherapy Unit, Department of Clinical, Surgical, Diagnostic and Pediatric Sciences, University of Pavia, 27100 Pavia, Italy; (M.C.); (M.F.A.); (G.M.); (A.L.T.); (A.C.); (L.P.)
- Radiology Institute, Fondazione IRCCS Policlinico San Matteo, 27100 Pavia, Italy
| |
Collapse
|
13
|
Adleberg J, Benitez CL, Primiano N, Patel A, Mogel D, Kalra R, Adhia A, Berns M, Chin C, Tanghe S, Yi P, Zech J, Kohli A, Martin-Carreras T, Corcuera-Solano I, Huang M, Ngeow J. Fully Automated Measurement of the Insall-Salvati Ratio with Artificial Intelligence. JOURNAL OF IMAGING INFORMATICS IN MEDICINE 2024; 37:601-610. [PMID: 38343226 PMCID: PMC11031523 DOI: 10.1007/s10278-023-00955-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 09/17/2023] [Accepted: 09/19/2023] [Indexed: 04/20/2024]
Abstract
Patella alta (PA) and patella baja (PB) affect 1-2% of the world population, but are often underreported, leading to potential complications like osteoarthritis. The Insall-Salvati ratio (ISR) is commonly used to diagnose patellar height abnormalities. Artificial intelligence (AI) keypoint models show promising accuracy in measuring and detecting these abnormalities.An AI keypoint model is developed and validated to study the Insall-Salvati ratio on a random population sample of lateral knee radiographs. A keypoint model was trained and internally validated with 689 lateral knee radiographs from five sites in a multi-hospital urban healthcare system after IRB approval. A total of 116 lateral knee radiographs from a sixth site were used for external validation. Distance error (mm), Pearson correlation, and Bland-Altman plots were used to evaluate model performance. On a random sample of 2647 different lateral knee radiographs, mean and standard deviation were used to calculate the normal distribution of ISR. A keypoint detection model had mean distance error of 2.57 ± 2.44 mm on internal validation data and 2.73 ± 2.86 mm on external validation data. Pearson correlation between labeled and predicted Insall-Salvati ratios was 0.82 [95% CI 0.76-0.86] on internal validation and 0.75 [0.66-0.82] on external validation. For the population sample of 2647 patients, there was mean ISR of 1.11 ± 0.21. Patellar height abnormalities were underreported in radiology reports from the population sample. AI keypoint models consistently measure ISR on knee radiographs. Future models can enable radiologists to study musculoskeletal measurements on larger population samples and enhance our understanding of normal and abnormal ranges.
Collapse
Affiliation(s)
- J Adleberg
- Department of Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| | - C L Benitez
- Department of Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - N Primiano
- Department of Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - A Patel
- Department of Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - D Mogel
- Department of Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - R Kalra
- Department of Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - A Adhia
- Department of Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - M Berns
- Department of Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - C Chin
- Department of Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - S Tanghe
- Department of Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - P Yi
- University of Maryland, Baltimore, MD, USA
| | - J Zech
- Columbia University Medical Center, New York, NY, USA
| | - A Kohli
- UT Southwestern, Dallas, TX, USA
| | | | - I Corcuera-Solano
- Department of Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - M Huang
- Department of Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - J Ngeow
- Department of Radiology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
14
|
Kore A, Abbasi Bavil E, Subasri V, Abdalla M, Fine B, Dolatabadi E, Abdalla M. Empirical data drift detection experiments on real-world medical imaging data. Nat Commun 2024; 15:1887. [PMID: 38424096 PMCID: PMC10904813 DOI: 10.1038/s41467-024-46142-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Accepted: 02/14/2024] [Indexed: 03/02/2024] Open
Abstract
While it is common to monitor deployed clinical artificial intelligence (AI) models for performance degradation, it is less common for the input data to be monitored for data drift - systemic changes to input distributions. However, when real-time evaluation may not be practical (eg., labeling costs) or when gold-labels are automatically generated, we argue that tracking data drift becomes a vital addition for AI deployments. In this work, we perform empirical experiments on real-world medical imaging to evaluate three data drift detection methods' ability to detect data drift caused (a) naturally (emergence of COVID-19 in X-rays) and (b) synthetically. We find that monitoring performance alone is not a good proxy for detecting data drift and that drift-detection heavily depends on sample size and patient features. Our work discusses the need and utility of data drift detection in various scenarios and highlights gaps in knowledge for the practical application of existing methods.
Collapse
Affiliation(s)
- Ali Kore
- Vector Institute, Toronto, Canada
| | | | - Vallijah Subasri
- Peter Munk Cardiac Center, University Health Network, Toronto, ON, Canada
| | - Moustafa Abdalla
- Department of Surgery, Harvard Medical School, Massachusetts General Hospital, Boston, USA
| | - Benjamin Fine
- Institute for Better Health, Trillium Health Partners, Mississauga, Canada
- Department of Medical Imaging, University of Toronto, Toronto, Canada
| | - Elham Dolatabadi
- Vector Institute, Toronto, Canada
- School of Health Policy and Management, Faculty of Health, York University, Toronto, Canada
| | - Mohamed Abdalla
- Institute for Better Health, Trillium Health Partners, Mississauga, Canada.
| |
Collapse
|
15
|
Kagerbauer SM, Ulm B, Podtschaske AH, Andonov DI, Blobner M, Jungwirth B, Graessner M. Susceptibility of AutoML mortality prediction algorithms to model drift caused by the COVID pandemic. BMC Med Inform Decis Mak 2024; 24:34. [PMID: 38308256 PMCID: PMC10837894 DOI: 10.1186/s12911-024-02428-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Accepted: 01/16/2024] [Indexed: 02/04/2024] Open
Abstract
BACKGROUND Concept drift and covariate shift lead to a degradation of machine learning (ML) models. The objective of our study was to characterize sudden data drift as caused by the COVID pandemic. Furthermore, we investigated the suitability of certain methods in model training to prevent model degradation caused by data drift. METHODS We trained different ML models with the H2O AutoML method on a dataset comprising 102,666 cases of surgical patients collected in the years 2014-2019 to predict postoperative mortality using preoperatively available data. Models applied were Generalized Linear Model with regularization, Default Random Forest, Gradient Boosting Machine, eXtreme Gradient Boosting, Deep Learning and Stacked Ensembles comprising all base models. Further, we modified the original models by applying three different methods when training on the original pre-pandemic dataset: (Rahmani K, et al, Int J Med Inform 173:104930, 2023) we weighted older data weaker, (Morger A, et al, Sci Rep 12:7244, 2022) used only the most recent data for model training and (Dilmegani C, 2023) performed a z-transformation of the numerical input parameters. Afterwards, we tested model performance on a pre-pandemic and an in-pandemic data set not used in the training process, and analysed common features. RESULTS The models produced showed excellent areas under receiver-operating characteristic and acceptable precision-recall curves when tested on a dataset from January-March 2020, but significant degradation when tested on a dataset collected in the first wave of the COVID pandemic from April-May 2020. When comparing the probability distributions of the input parameters, significant differences between pre-pandemic and in-pandemic data were found. The endpoint of our models, in-hospital mortality after surgery, did not differ significantly between pre- and in-pandemic data and was about 1% in each case. However, the models varied considerably in the composition of their input parameters. None of our applied modifications prevented a loss of performance, although very different models emerged from it, using a large variety of parameters. CONCLUSIONS Our results show that none of our tested easy-to-implement measures in model training can prevent deterioration in the case of sudden external events. Therefore, we conclude that, in the presence of concept drift and covariate shift, close monitoring and critical review of model predictions are necessary.
Collapse
Affiliation(s)
- Simone Maria Kagerbauer
- Department of Anaesthesiology and Intensive Care Medicine, School of Medicine, Technical University of Munich, Munich, Germany.
- Department of Anaesthesiology and Intensive Care Medicine, School of Medicine, University of Ulm, Albert-Einstein-Allee 23, Ulm, 89081, Germany.
| | - Bernhard Ulm
- Department of Anaesthesiology and Intensive Care Medicine, School of Medicine, Technical University of Munich, Munich, Germany
- Department of Anaesthesiology and Intensive Care Medicine, School of Medicine, University of Ulm, Albert-Einstein-Allee 23, Ulm, 89081, Germany
| | - Armin Horst Podtschaske
- Department of Anaesthesiology and Intensive Care Medicine, School of Medicine, Technical University of Munich, Munich, Germany
| | - Dimislav Ivanov Andonov
- Department of Anaesthesiology and Intensive Care Medicine, School of Medicine, Technical University of Munich, Munich, Germany
| | - Manfred Blobner
- Department of Anaesthesiology and Intensive Care Medicine, School of Medicine, Technical University of Munich, Munich, Germany
- Department of Anaesthesiology and Intensive Care Medicine, School of Medicine, University of Ulm, Albert-Einstein-Allee 23, Ulm, 89081, Germany
| | - Bettina Jungwirth
- Department of Anaesthesiology and Intensive Care Medicine, School of Medicine, University of Ulm, Albert-Einstein-Allee 23, Ulm, 89081, Germany
| | - Martin Graessner
- Department of Anaesthesiology and Intensive Care Medicine, School of Medicine, Technical University of Munich, Munich, Germany
- Department of Anaesthesiology and Intensive Care Medicine, School of Medicine, University of Ulm, Albert-Einstein-Allee 23, Ulm, 89081, Germany
| |
Collapse
|
16
|
Schinkel M, Boerman AW, Paranjape K, Wiersinga WJ, Nanayakkara PWB. Detecting changes in the performance of a clinical machine learning tool over time. EBioMedicine 2023; 97:104823. [PMID: 37793210 PMCID: PMC10550508 DOI: 10.1016/j.ebiom.2023.104823] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2023] [Revised: 09/21/2023] [Accepted: 09/21/2023] [Indexed: 10/06/2023] Open
Abstract
BACKGROUND Excessive use of blood cultures (BCs) in Emergency Departments (EDs) results in low yields and high contamination rates, associated with increased antibiotic use and unnecessary diagnostics. Our team previously developed and validated a machine learning model to predict BC outcomes and enhance diagnostic stewardship. While the model showed promising initial results, concerns over performance drift due to evolving patient demographics, clinical practices, and outcome rates warrant continual monitoring and evaluation of such models. METHODS A real-time evaluation of the model's performance was conducted between October 2021 and September 2022. The model was integrated into Amsterdam UMC's Electronic Health Record system, predicting BC outcomes for all adult patients with BC draws in real time. The model's performance was assessed monthly using metrics including the Area Under the Curve (AUC), Area Under the Precision-Recall Curve (AUPRC), and Brier scores. Statistical Process Control (SPC) charts were used to monitor variation over time. FINDINGS Across 3.035 unique adult patient visits, the model achieved an average AUC of 0.78, AUPRC of 0.41, and a Brier score of 0.10 for predicting the outcome of BCs drawn in the ED. While specific population characteristics changed over time, no statistical points outside the statistical control range were detected in the AUC, AUPRC, and Brier scores, indicating stable model performance. The average BC positivity rate during the study period was 13.4%. INTERPRETATION Despite significant changes in clinical practice, our BC stewardship tool exhibited stable performance, suggesting its robustness to changing environments. Using SPC charts for various metrics enables simple and effective monitoring of potential performance drift. The assessment of the variation of outcome rates and population changes may guide the specific interventions, such as intercept correction or recalibration, that may be needed to maintain a stable model performance over time. This study suggested no need to recalibrate or correct our BC stewardship tool. FUNDING No funding to disclose.
Collapse
Affiliation(s)
- Michiel Schinkel
- Center for Experimental and Molecular Medicine (CEMM), Amsterdam UMC, University of Amsterdam, Amsterdam, the Netherlands; Division of Acute Medicine, Department of Internal Medicine, Amsterdam UMC, VU University, Amsterdam, the Netherlands.
| | - Anneroos W Boerman
- Division of Acute Medicine, Department of Internal Medicine, Amsterdam UMC, VU University, Amsterdam, the Netherlands; Department of Clinical Chemistry, Amsterdam UMC, VU University, Amsterdam, the Netherlands
| | - Ketan Paranjape
- Division of Acute Medicine, Department of Internal Medicine, Amsterdam UMC, VU University, Amsterdam, the Netherlands
| | - W Joost Wiersinga
- Center for Experimental and Molecular Medicine (CEMM), Amsterdam UMC, University of Amsterdam, Amsterdam, the Netherlands; Division of Infectious Diseases, Department of Internal Medicine, Amsterdam UMC, University of Amsterdam, Amsterdam, the Netherlands
| | - Prabath W B Nanayakkara
- Division of Acute Medicine, Department of Internal Medicine, Amsterdam UMC, VU University, Amsterdam, the Netherlands
| |
Collapse
|
17
|
Li Y, Wang Y. Temporal convolution attention model for sepsis clinical assistant diagnosis prediction. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:13356-13378. [PMID: 37501491 DOI: 10.3934/mbe.2023595] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
Sepsis is an organ failure disease caused by an infection acquired in an intensive care unit (ICU), which leads to a high mortality rate. Developing intelligent monitoring and early warning systems for sepsis is a key research area in the field of smart healthcare. Early and accurate identification of patients at high risk of sepsis can help doctors make the best clinical decisions and reduce the mortality rate of patients with sepsis. However, the scientific understanding of sepsis remains inadequate, leading to slow progress in sepsis research. With the accumulation of electronic medical records (EMRs) in hospitals, data mining technologies that can identify patient risk patterns from the vast amount of sepsis-related EMRs and the development of smart surveillance and early warning models show promise in reducing mortality. Based on the Medical Information Mart for Intensive Care Ⅲ, a massive dataset of ICU EMRs published by MIT and Beth Israel Deaconess Medical Center, we propose a Temporal Convolution Attention Model for Sepsis Clinical Assistant Diagnosis Prediction (TCASP) to predict the incidence of sepsis infection in ICU patients. First, sepsis patient data is extracted from the EMRs. Then, the incidence of sepsis is predicted based on various physiological features of sepsis patients in the ICU. Finally, the TCASP model is utilized to predict the time of the first sepsis infection in ICU patients. The experiments show that the proposed model achieves an area under the receiver operating characteristic curve (AUROC) score of 86.9% (an improvement of 6.4% ) and an area under the precision-recall curve (AUPRC) score of 63.9% (an improvement of 3.9% ) compared to five state-of-the-art models.
Collapse
Affiliation(s)
- Yong Li
- College of Computer Science and Engineering, Northwest Normal University, 967 Anning East Road, Lanzhou 730070, China
| | - Yang Wang
- College of Computer Science and Engineering, Northwest Normal University, 967 Anning East Road, Lanzhou 730070, China
| |
Collapse
|