1
|
Ferri P, Lomonaco V, Passaro LC, Félix-De Castro A, Sánchez-Cuesta P, Sáez C, García-Gómez JM. Deep continual learning for medical call incidents text classification under the presence of dataset shifts. Comput Biol Med 2024; 175:108548. [PMID: 38718666 DOI: 10.1016/j.compbiomed.2024.108548] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2023] [Revised: 04/11/2024] [Accepted: 04/28/2024] [Indexed: 05/15/2024]
Abstract
The aim of this work is to develop and evaluate a deep classifier that can effectively prioritize Emergency Medical Call Incidents (EMCI) according to their life-threatening level under the presence of dataset shifts. We utilized a dataset consisting of 1982746 independent EMCI instances obtained from the Health Services Department of the Region of Valencia (Spain), with a time span from 2009 to 2019 (excluding 2013). The dataset includes free text dispatcher observations recorded during the call, as well as a binary variable indicating whether the event was life-threatening. To evaluate the presence of dataset shifts, we examined prior probability shifts, covariate shifts, and concept shifts. Subsequently, we designed and implemented four deep Continual Learning (CL) strategies-cumulative learning, continual fine-tuning, experience replay, and synaptic intelligence-alongside three deep CL baselines-joint training, static approach, and single fine-tuning-based on DistilBERT models. Our results demonstrated evidence of prior probability shifts, covariate shifts, and concept shifts in the data. Applying CL techniques had a statistically significant (α=0.05) positive impact on both backward and forward knowledge transfer, as measured by the F1-score, compared to non-continual approaches. We can argue that the utilization of CL techniques in the context of EMCI is effective in adapting deep learning classifiers to changes in data distributions, thereby maintaining the stability of model performance over time. To our knowledge, this study represents the first exploration of a CL approach using real EMCI data.
Collapse
Affiliation(s)
- Pablo Ferri
- Biomedical Data Science Laboratory (BDSLab), Instituto de Aplicaciones de las Tecnologías de la Información y de las Comunicaciones Avanzadas (ITACA), Universitat Politècnica de València (UPV), Valencia, Spain.
| | - Vincenzo Lomonaco
- Department of Computer Science, University of Pisa (Unipi), Pisa, Italy.
| | - Lucia C Passaro
- Department of Computer Science, University of Pisa (Unipi), Pisa, Italy.
| | - Antonio Félix-De Castro
- Conselleria de Sanitat Universal i Salut Pública, Generalitat Valenciana (GVA), Valencia, Spain.
| | | | - Carlos Sáez
- Biomedical Data Science Laboratory (BDSLab), Instituto de Aplicaciones de las Tecnologías de la Información y de las Comunicaciones Avanzadas (ITACA), Universitat Politècnica de València (UPV), Valencia, Spain.
| | - Juan M García-Gómez
- Biomedical Data Science Laboratory (BDSLab), Instituto de Aplicaciones de las Tecnologías de la Información y de las Comunicaciones Avanzadas (ITACA), Universitat Politècnica de València (UPV), Valencia, Spain.
| |
Collapse
|
2
|
Steinfeldt J, Wild B, Buergel T, Pietzner M, Upmeier Zu Belzen J, Vauvelle A, Hegselmann S, Denaxas S, Hemingway H, Langenberg C, Landmesser U, Deanfield J, Eils R. Medical history predicts phenome-wide disease onset and enables the rapid response to emerging health threats. Nat Commun 2024; 15:4257. [PMID: 38763986 PMCID: PMC11102902 DOI: 10.1038/s41467-024-48568-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Accepted: 05/03/2024] [Indexed: 05/21/2024] Open
Abstract
The COVID-19 pandemic exposed a global deficiency of systematic, data-driven guidance to identify high-risk individuals. Here, we illustrate the utility of routinely recorded medical history to predict the risk for 1883 diseases across clinical specialties and support the rapid response to emerging health threats such as COVID-19. We developed a neural network to learn from health records of 502,460 UK Biobank. Importantly, we observed discriminative improvements over basic demographic predictors for 1774 (94.3%) endpoints. After transferring the unmodified risk models to the All of US cohort, we replicated these improvements for 1347 (89.8%) of 1500 investigated endpoints, demonstrating generalizability across healthcare systems and historically underrepresented groups. Ultimately, we showed how this approach could have been used to identify individuals vulnerable to severe COVID-19. Our study demonstrates the potential of medical history to support guidance for emerging pandemics by systematically estimating risk for thousands of diseases at once at minimal cost.
Collapse
Affiliation(s)
- Jakob Steinfeldt
- Department of Cardiology, Angiology and Intensive Care Medicine, Deutsches Herzzentrum der Charité (DHZC), Berlin, Germany
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt Universität zu Berlin, Klinik/Centrum, Charitéplatz 1, 10117, Berlin, Germany
- Computational Medicine, Berlin Institute of Health (BIH), Charite - University Medicine Berlin, Berlin, Germany
- Friede Springer Cardiovascular Prevention Center@Charite, Charite - University Medicine Berlin, Berlin, Germany
- Institute of Cardiovascular Sciences, University College London, London, UK
| | - Benjamin Wild
- Center for Digital Health, Berlin Institute of Health (BIH), Charite - University Medicine Berlin, Berlin, Germany
| | - Thore Buergel
- Institute of Cardiovascular Sciences, University College London, London, UK
- Center for Digital Health, Berlin Institute of Health (BIH), Charite - University Medicine Berlin, Berlin, Germany
| | - Maik Pietzner
- Computational Medicine, Berlin Institute of Health (BIH), Charite - University Medicine Berlin, Berlin, Germany
- MRC Epidemiology Unit, Institute of Metabolic Science, University of Cambridge, Cambridge, UK
- Precision Health University Research Institute, Queen Mary University of London and Barts NHS Trust, London, UK
| | - Julius Upmeier Zu Belzen
- Center for Digital Health, Berlin Institute of Health (BIH), Charite - University Medicine Berlin, Berlin, Germany
| | - Andre Vauvelle
- Institute of Health Informatics, University College London, London, UK
| | - Stefan Hegselmann
- Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Massachusetts, USA
- Pattern Recognition and Image Analysis Lab, University of Münster, Münster, Germany
| | - Spiros Denaxas
- Institute of Health Informatics, University College London, London, UK
- British Heart Foundation Data Science Centre, London, UK
- Health Data Research UK, London, UK
- National Institute for Health Research, Biomedical Research Centre at University College London Hospitals National Institute for Health Research, Biomedical Research Centre, London, UK
| | - Harry Hemingway
- Institute of Health Informatics, University College London, London, UK
- Health Data Research UK, London, UK
- National Institute for Health Research, Biomedical Research Centre at University College London Hospitals National Institute for Health Research, Biomedical Research Centre, London, UK
| | - Claudia Langenberg
- Computational Medicine, Berlin Institute of Health (BIH), Charite - University Medicine Berlin, Berlin, Germany
- MRC Epidemiology Unit, Institute of Metabolic Science, University of Cambridge, Cambridge, UK
- Precision Health University Research Institute, Queen Mary University of London and Barts NHS Trust, London, UK
| | - Ulf Landmesser
- Department of Cardiology, Angiology and Intensive Care Medicine, Deutsches Herzzentrum der Charité (DHZC), Berlin, Germany
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt Universität zu Berlin, Klinik/Centrum, Charitéplatz 1, 10117, Berlin, Germany
- Friede Springer Cardiovascular Prevention Center@Charite, Charite - University Medicine Berlin, Berlin, Germany
- Berlin Institute of Health (BIH), Charite - University Medicine Berlin, Berlin, Germany
- DZHK (German Centre for Cardiovascular Research), Partner Site Berlin, Berlin, Berlin, Germany
| | - John Deanfield
- Institute of Cardiovascular Sciences, University College London, London, UK
| | - Roland Eils
- Center for Digital Health, Berlin Institute of Health (BIH), Charite - University Medicine Berlin, Berlin, Germany.
- Health Data Science Unit, Heidelberg University Hospital and BioQuant, Heidelberg, Germany.
| |
Collapse
|
3
|
Nernekli K, Persad AR, Hori YS, Yener U, Celtikci E, Sahin MC, Sozer A, Sozer B, Park DJ, Chang SD. Automatic Segmentation of Vestibular Schwannomas: A Systematic Review. World Neurosurg 2024; 188:35-44. [PMID: 38685346 DOI: 10.1016/j.wneu.2024.04.145] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2024] [Accepted: 04/23/2024] [Indexed: 05/02/2024]
Abstract
BACKGROUND Vestibular schwannomas (VSs) are benign tumors often monitored over time, with measurement techniques for assessing growth rates subject to significant interobserver variability. Automatic segmentation of these tumors could provide a more reliable and efficient for tracking their progression, especially given the irregular shape and growth patterns of VS. METHODS Various studies and segmentation techniques employing different Convolutional Neural Network architectures and models, such as U-Net and convolutional-attention transformer segmentation, were analyzed. Models were evaluated based on their performance across diverse datasets, and challenges, including domain shift and data sharing, were scrutinized. RESULTS Automatic segmentation methods offer a promising alternative to conventional measurement techniques, offering potential benefits in precision and efficiency. However, these methods are not without challenges, notably the "domain shift" that occurs when models trained on specific datasets underperform when applied to different datasets. Techniques such as domain adaptation, domain generalization, and data diversity were discussed as potential solutions. CONCLUSIONS Accurate measurement of VS growth is a complex process, with volumetric analysis currently appearing more reliable than linear measurements. Automatic segmentation, despite its challenges, offers a promising avenue for future investigation. Robust well-generalized models could potentially improve the efficiency of tracking tumor growth, thereby augmenting clinical decision-making. Further work needs to be done to develop more robust models, address the domain shift, and enable secure data sharing for wider applicability.
Collapse
Affiliation(s)
- Kerem Nernekli
- Department of Radiology, Stanford University School of Medicine, Stanford, California, USA
| | - Amit R Persad
- Department of Neurosurgery, Stanford University School of Medicine, Stanford, California, USA
| | - Yusuke S Hori
- Department of Neurosurgery, Stanford University School of Medicine, Stanford, California, USA
| | - Ulas Yener
- Department of Neurosurgery, Stanford University School of Medicine, Stanford, California, USA
| | - Emrah Celtikci
- Department of Neurosurgery, Gazi University, Ankara, Turkey
| | | | - Alperen Sozer
- Department of Neurosurgery, Gazi University, Ankara, Turkey
| | - Batuhan Sozer
- Department of Neurosurgery, Gazi University, Ankara, Turkey
| | - David J Park
- Department of Neurosurgery, Stanford University School of Medicine, Stanford, California, USA.
| | - Steven D Chang
- Department of Neurosurgery, Stanford University School of Medicine, Stanford, California, USA
| |
Collapse
|
4
|
Guo LL, Morse KE, Aftandilian C, Steinberg E, Fries J, Posada J, Fleming SL, Lemmon J, Jessa K, Shah N, Sung L. Characterizing the limitations of using diagnosis codes in the context of machine learning for healthcare. BMC Med Inform Decis Mak 2024; 24:51. [PMID: 38355486 PMCID: PMC10868117 DOI: 10.1186/s12911-024-02449-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2023] [Accepted: 01/30/2024] [Indexed: 02/16/2024] Open
Abstract
BACKGROUND Diagnostic codes are commonly used as inputs for clinical prediction models, to create labels for prediction tasks, and to identify cohorts for multicenter network studies. However, the coverage rates of diagnostic codes and their variability across institutions are underexplored. The primary objective was to describe lab- and diagnosis-based labels for 7 selected outcomes at three institutions. Secondary objectives were to describe agreement, sensitivity, and specificity of diagnosis-based labels against lab-based labels. METHODS This study included three cohorts: SickKids from The Hospital for Sick Children, and StanfordPeds and StanfordAdults from Stanford Medicine. We included seven clinical outcomes with lab-based definitions: acute kidney injury, hyperkalemia, hypoglycemia, hyponatremia, anemia, neutropenia and thrombocytopenia. For each outcome, we created four lab-based labels (abnormal, mild, moderate and severe) based on test result and one diagnosis-based label. Proportion of admissions with a positive label were presented for each outcome stratified by cohort. Using lab-based labels as the gold standard, agreement using Cohen's Kappa, sensitivity and specificity were calculated for each lab-based severity level. RESULTS The number of admissions included were: SickKids (n = 59,298), StanfordPeds (n = 24,639) and StanfordAdults (n = 159,985). The proportion of admissions with a positive diagnosis-based label was significantly higher for StanfordPeds compared to SickKids across all outcomes, with odds ratio (99.9% confidence interval) for abnormal diagnosis-based label ranging from 2.2 (1.7-2.7) for neutropenia to 18.4 (10.1-33.4) for hyperkalemia. Lab-based labels were more similar by institution. When using lab-based labels as the gold standard, Cohen's Kappa and sensitivity were lower at SickKids for all severity levels compared to StanfordPeds. CONCLUSIONS Across multiple outcomes, diagnosis codes were consistently different between the two pediatric institutions. This difference was not explained by differences in test results. These results may have implications for machine learning model development and deployment.
Collapse
Affiliation(s)
- Lin Lawrence Guo
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada
| | - Keith E Morse
- Division of Pediatric Hospital Medicine, Department of Pediatrics, Stanford University, Palo Alto, CA, USA
| | - Catherine Aftandilian
- Division of Hematology/Oncology, Department of Pediatrics, Stanford University, Palo Alto, CA, USA
| | - Ethan Steinberg
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA
| | - Jason Fries
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA
| | - Jose Posada
- Universidad del Norte, Barranquilla, Colombia
| | - Scott Lanyon Fleming
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA
| | - Joshua Lemmon
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada
| | - Karim Jessa
- Information Services, The Hospital for Sick Children, Toronto, ON, Canada
| | - Nigam Shah
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA
| | - Lillian Sung
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada.
- Division of Haematology/Oncology, The Hospital for Sick Children, 555 University Avenue, M5G1X8, Toronto, ON, Canada.
| |
Collapse
|
5
|
Bhaskhar N, Ip W, Chen JH, Rubin DL. Clinical outcome prediction using observational supervision with electronic health records and audit logs. J Biomed Inform 2023; 147:104522. [PMID: 37827476 DOI: 10.1016/j.jbi.2023.104522] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2023] [Revised: 10/04/2023] [Accepted: 10/06/2023] [Indexed: 10/14/2023]
Abstract
OBJECTIVE Audit logs in electronic health record (EHR) systems capture interactions of providers with clinical data. We determine if machine learning (ML) models trained using audit logs in conjunction with clinical data ("observational supervision") outperform ML models trained using clinical data alone in clinical outcome prediction tasks, and whether they are more robust to temporal distribution shifts in the data. MATERIALS AND METHODS Using clinical and audit log data from Stanford Healthcare, we trained and evaluated various ML models including logistic regression, support vector machine (SVM) classifiers, neural networks, random forests, and gradient boosted machines (GBMs) on clinical EHR data, with and without audit logs for two clinical outcome prediction tasks: major adverse kidney events within 120 days of ICU admission (MAKE-120) in acute kidney injury (AKI) patients and 30-day readmission in acute stroke patients. We further tested the best performing models using patient data acquired during different time-intervals to evaluate the impact of temporal distribution shifts on model performance. RESULTS Performance generally improved for all models when trained with clinical EHR data and audit log data compared with those trained with only clinical EHR data, with GBMs tending to have the overall best performance. GBMs trained with clinical EHR data and audit logs outperformed GBMs trained without audit logs in both clinical outcome prediction tasks: AUROC 0.88 (95% CI: 0.85-0.91) vs. 0.79 (95% CI: 0.77-0.81), respectively, for MAKE-120 prediction in AKI patients, and AUROC 0.74 (95% CI: 0.71-0.77) vs. 0.63 (95% CI: 0.62-0.64), respectively, for 30-day readmission prediction in acute stroke patients. The performance of GBM models trained using audit log and clinical data degraded less in later time-intervals than models trained using only clinical data. CONCLUSION Observational supervision with audit logs improved the performance of ML models trained to predict important clinical outcomes in patients with AKI and acute stroke, and improved robustness to temporal distribution shifts.
Collapse
Affiliation(s)
- Nandita Bhaskhar
- Department of Electrical Engineering, Stanford University, Stanford, CA 94305, USA.
| | - Wui Ip
- Department of Pediatrics, Stanford School of Medicine, Palo Alto, CA 94305, USA
| | - Jonathan H Chen
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA 94305, USA; Division of Hospital Medicine, Stanford School of Medicine, Palo Alto, CA 94305, USA; Clinical Excellence Research Center, Stanford School of Medicine, Palo Alto, CA 94305, USA
| | - Daniel L Rubin
- Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA; Department of Radiology, Stanford University, Stanford, CA 94305, USA; Department of Medicine, Stanford School of Medicine, Palo Alto, CA 94305, USA
| |
Collapse
|
6
|
Xie K, Terman SW, Gallagher RS, Hill CE, Davis KA, Litt B, Roth D, Ellis CA. Generalization of finetuned transformer language models to new clinical contexts. JAMIA Open 2023; 6:ooad070. [PMID: 37600072 PMCID: PMC10432353 DOI: 10.1093/jamiaopen/ooad070] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Revised: 08/01/2023] [Accepted: 08/03/2023] [Indexed: 08/22/2023] Open
Abstract
Objective We have previously developed a natural language processing pipeline using clinical notes written by epilepsy specialists to extract seizure freedom, seizure frequency text, and date of last seizure text for patients with epilepsy. It is important to understand how our methods generalize to new care contexts. Materials and methods We evaluated our pipeline on unseen notes from nonepilepsy-specialist neurologists and non-neurologists without any additional algorithm training. We tested the pipeline out-of-institution using epilepsy specialist notes from an outside medical center with only minor preprocessing adaptations. We examined reasons for discrepancies in performance in new contexts by measuring physical and semantic similarities between documents. Results Our ability to classify patient seizure freedom decreased by at least 0.12 agreement when moving from epilepsy specialists to nonspecialists or other institutions. On notes from our institution, textual overlap between the extracted outcomes and the gold standard annotations attained from manual chart review decreased by at least 0.11 F1 when an answer existed but did not change when no answer existed; here our models generalized on notes from the outside institution, losing at most 0.02 agreement. We analyzed textual differences and found that syntactic and semantic differences in both clinically relevant sentences and surrounding contexts significantly influenced model performance. Discussion and conclusion Model generalization performance decreased on notes from nonspecialists; out-of-institution generalization on epilepsy specialist notes required small changes to preprocessing but was especially good for seizure frequency text and date of last seizure text, opening opportunities for multicenter collaborations using these outcomes.
Collapse
Affiliation(s)
- Kevin Xie
- Department of Bioengineering, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
- Center for Neuroengineering and Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
| | - Samuel W Terman
- Department of Neurology, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Ryan S Gallagher
- Center for Neuroengineering and Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
- Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
| | - Chloe E Hill
- Department of Neurology, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Kathryn A Davis
- Center for Neuroengineering and Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
- Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
- Department of Neurology, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
| | - Brian Litt
- Department of Bioengineering, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
- Center for Neuroengineering and Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
- Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
- Department of Neurology, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
| | - Dan Roth
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
| | - Colin A Ellis
- Center for Neuroengineering and Therapeutics, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
- Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
- Department of Neurology, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
| |
Collapse
|
7
|
Chen RJ, Wang JJ, Williamson DFK, Chen TY, Lipkova J, Lu MY, Sahai S, Mahmood F. Algorithmic fairness in artificial intelligence for medicine and healthcare. Nat Biomed Eng 2023; 7:719-742. [PMID: 37380750 PMCID: PMC10632090 DOI: 10.1038/s41551-023-01056-8] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2021] [Accepted: 04/13/2023] [Indexed: 06/30/2023]
Abstract
In healthcare, the development and deployment of insufficiently fair systems of artificial intelligence (AI) can undermine the delivery of equitable care. Assessments of AI models stratified across subpopulations have revealed inequalities in how patients are diagnosed, treated and billed. In this Perspective, we outline fairness in machine learning through the lens of healthcare, and discuss how algorithmic biases (in data acquisition, genetic variation and intra-observer labelling variability, in particular) arise in clinical workflows and the resulting healthcare disparities. We also review emerging technology for mitigating biases via disentanglement, federated learning and model explainability, and their role in the development of AI-based software as a medical device.
Collapse
Affiliation(s)
- Richard J Chen
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA, USA
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Judy J Wang
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Boston University School of Medicine, Boston, MA, USA
| | - Drew F K Williamson
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Tiffany Y Chen
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Jana Lipkova
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Ming Y Lu
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA, USA
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA
- Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Sharifa Sahai
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Cancer Program, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Faisal Mahmood
- Department of Pathology, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
- Cancer Program, Broad Institute of Harvard and Massachusetts Institute of Technology, Cambridge, MA, USA.
- Cancer Data Science Program, Dana-Farber Cancer Institute, Boston, MA, USA.
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA.
- Harvard Data Science Initiative, Harvard University, Cambridge, MA, USA.
| |
Collapse
|
8
|
Guo LL, Steinberg E, Fleming SL, Posada J, Lemmon J, Pfohl SR, Shah N, Fries J, Sung L. EHR foundation models improve robustness in the presence of temporal distribution shift. Sci Rep 2023; 13:3767. [PMID: 36882576 PMCID: PMC9992466 DOI: 10.1038/s41598-023-30820-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2022] [Accepted: 03/02/2023] [Indexed: 03/09/2023] Open
Abstract
Temporal distribution shift negatively impacts the performance of clinical prediction models over time. Pretraining foundation models using self-supervised learning on electronic health records (EHR) may be effective in acquiring informative global patterns that can improve the robustness of task-specific models. The objective was to evaluate the utility of EHR foundation models in improving the in-distribution (ID) and out-of-distribution (OOD) performance of clinical prediction models. Transformer- and gated recurrent unit-based foundation models were pretrained on EHR of up to 1.8 M patients (382 M coded events) collected within pre-determined year groups (e.g., 2009-2012) and were subsequently used to construct patient representations for patients admitted to inpatient units. These representations were used to train logistic regression models to predict hospital mortality, long length of stay, 30-day readmission, and ICU admission. We compared our EHR foundation models with baseline logistic regression models learned on count-based representations (count-LR) in ID and OOD year groups. Performance was measured using area-under-the-receiver-operating-characteristic curve (AUROC), area-under-the-precision-recall curve, and absolute calibration error. Both transformer and recurrent-based foundation models generally showed better ID and OOD discrimination relative to count-LR and often exhibited less decay in tasks where there is observable degradation of discrimination performance (average AUROC decay of 3% for transformer-based foundation model vs. 7% for count-LR after 5-9 years). In addition, the performance and robustness of transformer-based foundation models continued to improve as pretraining set size increased. These results suggest that pretraining EHR foundation models at scale is a useful approach for developing clinical prediction models that perform well in the presence of temporal distribution shift.
Collapse
Affiliation(s)
- Lin Lawrence Guo
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada
| | - Ethan Steinberg
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA
| | - Scott Lanyon Fleming
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA
| | - Jose Posada
- Universidad del Norte, Barranquilla, Colombia
| | - Joshua Lemmon
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada
| | - Stephen R Pfohl
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA
| | - Nigam Shah
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA
| | - Jason Fries
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA
| | - Lillian Sung
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada. .,Division of Haematology/Oncology, The Hospital for Sick Children, 555 University Avenue, Toronto, ON, M5G1X8, Canada.
| |
Collapse
|
9
|
Lemmon J, Guo LL, Posada J, Pfohl SR, Fries J, Fleming SL, Aftandilian C, Shah N, Sung L. Evaluation of Feature Selection Methods for Preserving Machine Learning Performance in the Presence of Temporal Dataset Shift in Clinical Medicine. Methods Inf Med 2023; 62:60-70. [PMID: 36812932 DOI: 10.1055/s-0043-1762904] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/24/2023]
Abstract
BACKGROUND Temporal dataset shift can cause degradation in model performance as discrepancies between training and deployment data grow over time. The primary objective was to determine whether parsimonious models produced by specific feature selection methods are more robust to temporal dataset shift as measured by out-of-distribution (OOD) performance, while maintaining in-distribution (ID) performance. METHODS Our dataset consisted of intensive care unit patients from MIMIC-IV categorized by year groups (2008-2010, 2011-2013, 2014-2016, and 2017-2019). We trained baseline models using L2-regularized logistic regression on 2008-2010 to predict in-hospital mortality, long length of stay (LOS), sepsis, and invasive ventilation in all year groups. We evaluated three feature selection methods: L1-regularized logistic regression (L1), Remove and Retrain (ROAR), and causal feature selection. We assessed whether a feature selection method could maintain ID performance (2008-2010) and improve OOD performance (2017-2019). We also assessed whether parsimonious models retrained on OOD data performed as well as oracle models trained on all features in the OOD year group. RESULTS The baseline model showed significantly worse OOD performance with the long LOS and sepsis tasks when compared with the ID performance. L1 and ROAR retained 3.7 to 12.6% of all features, whereas causal feature selection generally retained fewer features. Models produced by L1 and ROAR exhibited similar ID and OOD performance as the baseline models. The retraining of these models on 2017-2019 data using features selected from training on 2008-2010 data generally reached parity with oracle models trained directly on 2017-2019 data using all available features. Causal feature selection led to heterogeneous results with the superset maintaining ID performance while improving OOD calibration only on the long LOS task. CONCLUSIONS While model retraining can mitigate the impact of temporal dataset shift on parsimonious models produced by L1 and ROAR, new methods are required to proactively improve temporal robustness.
Collapse
Affiliation(s)
- Joshua Lemmon
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Lin Lawrence Guo
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Jose Posada
- Biomedical Informatics Research, Stanford University, Palo Alto, California, United States.,Department of Systems Engineering, Universidad del Norte, Barranquilla, Atlantico, Colombia
| | - Stephen R Pfohl
- Biomedical Informatics Research, Stanford University, Palo Alto, California, United States
| | - Jason Fries
- Biomedical Informatics Research, Stanford University, Palo Alto, California, United States
| | - Scott Lanyon Fleming
- Biomedical Informatics Research, Stanford University, Palo Alto, California, United States
| | - Catherine Aftandilian
- Division of Pediatric Hematology/Oncology, Stanford University, Palo Alto, California, United States
| | - Nigam Shah
- Biomedical Informatics Research, Stanford University, Palo Alto, California, United States
| | - Lillian Sung
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, Ontario, Canada.,Division of Haematology/Oncology, The Hospital for Sick Children, Toronto, Ontario, Canada
| |
Collapse
|
10
|
Thakur A, Armstrong J, Youssef A, Eyre D, Clifton DA. Self-Aware SGD: Reliable Incremental Adaptation Framework For Clinical AI Models. IEEE J Biomed Health Inform 2023; PP:1624-1634. [PMID: 37022415 DOI: 10.1109/jbhi.2023.3237592] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/09/2023]
Abstract
Healthcare is dynamic as demographics, diseases, and therapeutics constantly evolve. This dynamic nature induces inevitable distribution shifts in populations targeted by clinical AI models, often rendering them ineffective. Incremental learning provides an effective method of adapting deployed clinical models to accommodate these contemporary distribution shifts. However, since incremental learning involves modifying a deployed or in-use model, it can be considered unreliable as any adverse modification due to maliciously compromised or incorrectly labelled data can make the model unsuitable for the targeted application. This paper introduces self-aware stochastic gradient descent (SGD), an incremental deep learning algorithm that utilises a contextual bandit-like sanity check to only allow reliable modifications to a model. The contextual bandit analyses incremental gradient updates to isolate and filter unreliable gradients. This behaviour allows self-aware SGD to balance incremental training and integrity of a deployed model. Experimental evaluations on the Oxford University Hospital datasets highlight that self-aware SGD can provide reliable incremental updates for overcoming distribution shifts in challenging conditions induced by label noise.
Collapse
|
11
|
Xu W, Huo J, Cheng G, Fu J, Huang X, Feng J, Jiang J. Association between different concentrations of human serum albumin and 28-day mortality in intensive care patients with sepsis: A propensity score matching analysis. Front Pharmacol 2022; 13:1037893. [PMID: 36578542 PMCID: PMC9792095 DOI: 10.3389/fphar.2022.1037893] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Accepted: 11/30/2022] [Indexed: 12/14/2022] Open
Abstract
Background: Human serum albumin (HSA) is a commonly used medication for the treatment of sepsis. However, there is no conclusive evidence as to whether different concentrations of HSA are associated with patient prognosis. This study aimed to evaluate the association between different concentrations of HSA and 28-day mortality in patients with sepsis. Methods: The data for this retrospective study were collected from the Medical Information Mart for Intensive Care IV database. Patients with sepsis were divided into two groups according to the concentration of HSA received: 25% and 5% HSA. The primary outcome of this study was the 28-day mortality in patients with sepsis. To ensure the robustness of our findings, we used multivariate Cox regression, propensity score matching, double-robust estimation, and inverse probability weighting models. Results: A total of 76,943 patients were screened, of whom 5,009 were enrolled. 1,258 and 3,751 patients received 25% and 5% HSA, respectively. The 28-day mortality rate was 38.2% (481/1,258) for patients in the 25% HSA group and 8.7% (325/3,751) for patients in the 5% HSA group. After propensity score matching, 1,648 patients were identified. The inverse probability weighting model suggested that 5% HSA received was associated with lower 28-day mortality (hazard ratio [HR]: 0.63, 95% confidence interval [CI]: 0.54-0.73, p < 0.001). Subgroup and sensitivity analysis confirmed the robustness of the results. Conclusion: In patients with sepsis, 5% HSA received may be associated with a lower risk of 28-day mortality than 25% HSA. Further randomized controlled trials are required to confirm this association.
Collapse
Affiliation(s)
- Weigan Xu
- Department of Emergency, First People’s Hospital of Foshan, Foshan, China,The Poison Treatment Centre of Foshan, First People’s Hospital of Foshan, Foshan, China
| | - Jianyang Huo
- Department of Emergency, First People’s Hospital of Foshan, Foshan, China
| | - Guojun Cheng
- Department of Emergency, First People’s Hospital of Foshan, Foshan, China
| | - Juan Fu
- Department of Emergency, First People’s Hospital of Foshan, Foshan, China
| | - Xiangqing Huang
- Department of Emergency, First People’s Hospital of Foshan, Foshan, China
| | - Jinxia Feng
- Department of Emergency, First People’s Hospital of Foshan, Foshan, China
| | - Jun Jiang
- The Poison Treatment Centre of Foshan, First People’s Hospital of Foshan, Foshan, China,*Correspondence: Jun Jiang,
| |
Collapse
|
12
|
Galuzio PP, Cherif A. Recent Advances and Future Perspectives in the Use of Machine Learning and Mathematical Models in Nephrology. Adv Chronic Kidney Dis 2022; 29:472-479. [PMID: 36253031 DOI: 10.1053/j.ackd.2022.07.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Revised: 06/21/2022] [Accepted: 07/07/2022] [Indexed: 01/25/2023]
Abstract
We reviewed some of the latest advancements in the use of mathematical models in nephrology. We looked over 2 distinct categories of mathematical models that are widely used in biological research and pointed out some of their strengths and weaknesses when applied to health care, especially in the context of nephrology. A mechanistic dynamical system allows the representation of causal relations among the system variables but with a more complex and longer development/implementation phase. Artificial intelligence/machine learning provides predictive tools that allow identifying correlative patterns in large data sets, but they are usually harder-to-interpret black boxes. Chronic kidney disease (CKD), a major worldwide health problem, generates copious quantities of data that can be leveraged by choice of the appropriate model; also, there is a large number of dialysis parameters that need to be determined at every treatment session that can benefit from predictive mechanistic models. Following important steps in the use of mathematical methods in medical science might be in the intersection of seemingly antagonistic frameworks, by leveraging the strength of each to provide better care.
Collapse
Affiliation(s)
| | - Alhaji Cherif
- Research Division, Renal Research Institute, New York, NY.
| |
Collapse
|
13
|
Monteith S, Glenn T, Geddes J, Whybrow PC, Achtyes E, Bauer M. Expectations for Artificial Intelligence (AI) in Psychiatry. Curr Psychiatry Rep 2022; 24:709-721. [PMID: 36214931 PMCID: PMC9549456 DOI: 10.1007/s11920-022-01378-5] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 09/15/2022] [Indexed: 01/29/2023]
Abstract
PURPOSE OF REVIEW Artificial intelligence (AI) is often presented as a transformative technology for clinical medicine even though the current technology maturity of AI is low. The purpose of this narrative review is to describe the complex reasons for the low technology maturity and set realistic expectations for the safe, routine use of AI in clinical medicine. RECENT FINDINGS For AI to be productive in clinical medicine, many diverse factors that contribute to the low maturity level need to be addressed. These include technical problems such as data quality, dataset shift, black-box opacity, validation and regulatory challenges, and human factors such as a lack of education in AI, workflow changes, automation bias, and deskilling. There will also be new and unanticipated safety risks with the introduction of AI. The solutions to these issues are complex and will take time to discover, develop, validate, and implement. However, addressing the many problems in a methodical manner will expedite the safe and beneficial use of AI to augment medical decision making in psychiatry.
Collapse
Affiliation(s)
- Scott Monteith
- Michigan State University College of Human Medicine, Traverse City Campus, Traverse City, MI, 49684, USA.
| | | | - John Geddes
- Department of Psychiatry, University of Oxford, Warneford Hospital, Oxford, UK
| | - Peter C. Whybrow
- Department of Psychiatry and Biobehavioral Sciences, Semel Institute for Neuroscience and Human Behavior, University of California Los Angeles (UCLA), Los Angeles, CA USA
| | - Eric Achtyes
- Michigan State University College of Human Medicine, Grand Rapids, MI 49684 USA ,Network180, Grand Rapids, MI USA
| | - Michael Bauer
- Department of Psychiatry and Psychotherapy, University Hospital Carl Gustav Carus Medical Faculty, Technische Universität Dresden, Dresden, Germany
| |
Collapse
|
14
|
Wang HE, Landers M, Adams R, Subbaswamy A, Kharrazi H, Gaskin DJ, Saria S. OUP accepted manuscript. J Am Med Inform Assoc 2022; 29:1323-1333. [PMID: 35579328 PMCID: PMC9277650 DOI: 10.1093/jamia/ocac065] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Revised: 03/23/2022] [Accepted: 04/26/2022] [Indexed: 11/12/2022] Open
Abstract
Objective Health care providers increasingly rely upon predictive algorithms when making
important treatment decisions, however, evidence indicates that these tools can lead to
inequitable outcomes across racial and socio-economic groups. In this study, we
introduce a bias evaluation checklist that allows model developers and health care
providers a means to systematically appraise a model’s potential to introduce bias. Materials and Methods Our methods include developing a bias evaluation checklist, a scoping literature review
to identify 30-day hospital readmission prediction models, and assessing the selected
models using the checklist. Results We selected 4 models for evaluation: LACE, HOSPITAL, Johns Hopkins ACG, and HATRIX. Our
assessment identified critical ways in which these algorithms can perpetuate health care
inequalities. We found that LACE and HOSPITAL have the greatest potential for
introducing bias, Johns Hopkins ACG has the most areas of uncertainty, and HATRIX has
the fewest causes for concern. Discussion Our approach gives model developers and health care providers a practical and
systematic method for evaluating bias in predictive models. Traditional bias
identification methods do not elucidate sources of bias and are thus insufficient for
mitigation efforts. With our checklist, bias can be addressed and eliminated before a
model is fully developed or deployed. Conclusion The potential for algorithms to perpetuate biased outcomes is not isolated to
readmission prediction models; rather, we believe our results have implications for
predictive models across health care. We offer a systematic method for evaluating
potential bias with sufficient flexibility to be utilized across models and
applications.
Collapse
Affiliation(s)
| | | | | | | | - Hadi Kharrazi
- Department of Health Policy and Management, Johns Hopkins Bloomberg School
of Public Health, Baltimore, Maryland, USA
| | - Darrell J Gaskin
- Department of Health Policy and Management, Johns Hopkins Bloomberg School
of Public Health, Baltimore, Maryland, USA
| | - Suchi Saria
- Corresponding Author: Suchi Saria, PhD, Department of Computer
Science and Statistics, Whiting School of Engineering, Johns Hopkins University, Malone
Hall, 3400 N Charles St, Baltimore, MD 21218, USA;
| |
Collapse
|