1
|
Schaekermann M, Spitz T, Pyles M, Cole-Lewis H, Wulczyn E, Pfohl SR, Martin D, Jaroensri R, Keeling G, Liu Y, Farquhar S, Xue Q, Lester J, Hughes C, Strachan P, Tan F, Bui P, Mermel CH, Peng LH, Matias Y, Corrado GS, Webster DR, Virmani S, Semturs C, Liu Y, Horn I, Cameron Chen PH. Health equity assessment of machine learning performance (HEAL): a framework and dermatology AI model case study. EClinicalMedicine 2024; 70:102479. [PMID: 38685924 PMCID: PMC11056401 DOI: 10.1016/j.eclinm.2024.102479] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 01/16/2024] [Accepted: 01/25/2024] [Indexed: 05/02/2024] Open
Abstract
Background Artificial intelligence (AI) has repeatedly been shown to encode historical inequities in healthcare. We aimed to develop a framework to quantitatively assess the performance equity of health AI technologies and to illustrate its utility via a case study. Methods Here, we propose a methodology to assess whether health AI technologies prioritise performance for patient populations experiencing worse outcomes, that is complementary to existing fairness metrics. We developed the Health Equity Assessment of machine Learning performance (HEAL) framework designed to quantitatively assess the performance equity of health AI technologies via a four-step interdisciplinary process to understand and quantify domain-specific criteria, and the resulting HEAL metric. As an illustrative case study (analysis conducted between October 2022 and January 2023), we applied the HEAL framework to a dermatology AI model. A set of 5420 teledermatology cases (store-and-forward cases from patients of 20 years or older, submitted from primary care providers in the USA and skin cancer clinics in Australia), enriched for diversity in age, sex and race/ethnicity, was used to retrospectively evaluate the AI model's HEAL metric, defined as the likelihood that the AI model performs better for subpopulations with worse average health outcomes as compared to others. The likelihood that AI performance was anticorrelated to pre-existing health outcomes was estimated using bootstrap methods as the probability that the negated Spearman's rank correlation coefficient (i.e., "R") was greater than zero. Positive values of R suggest that subpopulations with poorer health outcomes have better AI model performance. Thus, the HEAL metric, defined as p (R >0), measures how likely the AI technology is to prioritise performance for subpopulations with worse average health outcomes as compared to others (presented as a percentage below). Health outcomes were quantified as disability-adjusted life years (DALYs) when grouping by sex and age, and years of life lost (YLLs) when grouping by race/ethnicity. AI performance was measured as top-3 agreement with the reference diagnosis from a panel of 3 dermatologists per case. Findings Across all dermatologic conditions, the HEAL metric was 80.5% for prioritizing AI performance of racial/ethnic subpopulations based on YLLs, and 92.1% and 0.0% respectively for prioritizing AI performance of sex and age subpopulations based on DALYs. Certain dermatologic conditions were significantly associated with greater AI model performance compared to a reference category of less common conditions. For skin cancer conditions, the HEAL metric was 73.8% for prioritizing AI performance of age subpopulations based on DALYs. Interpretation Analysis using the proposed HEAL framework showed that the dermatology AI model prioritised performance for race/ethnicity, sex (all conditions) and age (cancer conditions) subpopulations with respect to pre-existing health disparities. More work is needed to investigate ways of promoting equitable AI performance across age for non-cancer conditions and to better understand how AI models can contribute towards improving equity in health outcomes. Funding Google LLC.
Collapse
Affiliation(s)
| | | | - Malcolm Pyles
- Advanced Clinical, Deerfield, IL, USA
- Department of Dermatology, Cleveland Clinic, Cleveland, OH, USA
| | | | | | | | | | | | | | - Yuan Liu
- Google Health, Mountain View, CA, USA
| | | | | | - Jenna Lester
- Advanced Clinical, Deerfield, IL, USA
- Department of Dermatology, University of California, San Francisco, CA, USA
| | | | | | | | - Peggy Bui
- Google Health, Mountain View, CA, USA
| | | | | | | | | | | | | | | | - Yun Liu
- Google Health, Mountain View, CA, USA
| | - Ivor Horn
- Google Health, Mountain View, CA, USA
| | | |
Collapse
|
2
|
Lemmon J, Guo LL, Steinberg E, Morse KE, Fleming SL, Aftandilian C, Pfohl SR, Posada JD, Shah N, Fries J, Sung L. Self-supervised machine learning using adult inpatient data produces effective models for pediatric clinical prediction tasks. J Am Med Inform Assoc 2023; 30:2004-2011. [PMID: 37639620 PMCID: PMC10654865 DOI: 10.1093/jamia/ocad175] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Revised: 08/03/2023] [Accepted: 08/15/2023] [Indexed: 08/31/2023] Open
Abstract
OBJECTIVE Development of electronic health records (EHR)-based machine learning models for pediatric inpatients is challenged by limited training data. Self-supervised learning using adult data may be a promising approach to creating robust pediatric prediction models. The primary objective was to determine whether a self-supervised model trained in adult inpatients was noninferior to logistic regression models trained in pediatric inpatients, for pediatric inpatient clinical prediction tasks. MATERIALS AND METHODS This retrospective cohort study used EHR data and included patients with at least one admission to an inpatient unit. One admission per patient was randomly selected. Adult inpatients were 18 years or older while pediatric inpatients were more than 28 days and less than 18 years. Admissions were temporally split into training (January 1, 2008 to December 31, 2019), validation (January 1, 2020 to December 31, 2020), and test (January 1, 2021 to August 1, 2022) sets. Primary comparison was a self-supervised model trained in adult inpatients versus count-based logistic regression models trained in pediatric inpatients. Primary outcome was mean area-under-the-receiver-operating-characteristic-curve (AUROC) for 11 distinct clinical outcomes. Models were evaluated in pediatric inpatients. RESULTS When evaluated in pediatric inpatients, mean AUROC of self-supervised model trained in adult inpatients (0.902) was noninferior to count-based logistic regression models trained in pediatric inpatients (0.868) (mean difference = 0.034, 95% CI=0.014-0.057; P < .001 for noninferiority and P = .006 for superiority). CONCLUSIONS Self-supervised learning in adult inpatients was noninferior to logistic regression models trained in pediatric inpatients. This finding suggests transferability of self-supervised models trained in adult patients to pediatric patients, without requiring costly model retraining.
Collapse
Affiliation(s)
- Joshua Lemmon
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON M5G1X8, Canada
| | - Lin Lawrence Guo
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON M5G1X8, Canada
| | - Ethan Steinberg
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA 94305, United States
| | - Keith E Morse
- Division of Pediatric Hospital Medicine, Department of Pediatrics, Stanford University, Palo Alto, CA 94304, United States
| | - Scott Lanyon Fleming
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA 94305, United States
| | - Catherine Aftandilian
- Division of Hematology/Oncology, Department of Pediatrics, Stanford University, Palo Alto, CA 94304, United States
| | | | - Jose D Posada
- Universidad del Norte, Barranquilla 081007, Colombia
| | - Nigam Shah
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA 94305, United States
| | - Jason Fries
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA 94305, United States
| | - Lillian Sung
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON M5G1X8, Canada
- Division of Haematology/Oncology, The Hospital for Sick Children, Toronto, ON M5G1X8, Canada
| |
Collapse
|
3
|
Arora A, Alderman JE, Palmer J, Ganapathi S, Laws E, McCradden MD, Oakden-Rayner L, Pfohl SR, Ghassemi M, McKay F, Treanor D, Rostamzadeh N, Mateen B, Gath J, Adebajo AO, Kuku S, Matin R, Heller K, Sapey E, Sebire NJ, Cole-Lewis H, Calvert M, Denniston A, Liu X. The value of standards for health datasets in artificial intelligence-based applications. Nat Med 2023; 29:2929-2938. [PMID: 37884627 PMCID: PMC10667100 DOI: 10.1038/s41591-023-02608-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Accepted: 09/22/2023] [Indexed: 10/28/2023]
Abstract
Artificial intelligence as a medical device is increasingly being applied to healthcare for diagnosis, risk stratification and resource allocation. However, a growing body of evidence has highlighted the risk of algorithmic bias, which may perpetuate existing health inequity. This problem arises in part because of systemic inequalities in dataset curation, unequal opportunity to participate in research and inequalities of access. This study aims to explore existing standards, frameworks and best practices for ensuring adequate data diversity in health datasets. Exploring the body of existing literature and expert views is an important step towards the development of consensus-based guidelines. The study comprises two parts: a systematic review of existing standards, frameworks and best practices for healthcare datasets; and a survey and thematic analysis of stakeholder views of bias, health equity and best practices for artificial intelligence as a medical device. We found that the need for dataset diversity was well described in literature, and experts generally favored the development of a robust set of guidelines, but there were mixed views about how these could be implemented practically. The outputs of this study will be used to inform the development of standards for transparency of data diversity in health datasets (the STANDING Together initiative).
Collapse
Affiliation(s)
- Anmol Arora
- School of Clinical Medicine, University of Cambridge, Cambridge, UK
| | - Joseph E Alderman
- Institute of Inflammation and Ageing, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK
- University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
- National Institute for Health and Care Research Birmingham Biomedical Research Centre, University of Birmingham, Birmingham, UK
| | - Joanne Palmer
- University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
- National Institute for Health and Care Research Birmingham Biomedical Research Centre, University of Birmingham, Birmingham, UK
| | | | - Elinor Laws
- Institute of Inflammation and Ageing, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK
- University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
- National Institute for Health and Care Research Birmingham Biomedical Research Centre, University of Birmingham, Birmingham, UK
| | - Melissa D McCradden
- Department of Bioethics, The Hospital for Sick Children, Toronto, Ontario, Canada
- Genetics and Genome Biology, Peter Gilgan Centre for Research and Learning, Toronto, Ontario, Canada
- Dalla Lana School of Public Health, Toronto, Ontario, Canada
| | - Lauren Oakden-Rayner
- The Australian Institute for Machine Learning, University of Adelaide, Adelaide, South Australia, Australia
| | | | - Marzyeh Ghassemi
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
- Institute for Medical Engineering & Science, Massachusetts Institute of Technology, Cambridge, MA, USA
- Vector Institute, Toronto, Ontario, Canada
| | - Francis McKay
- The Ethox Centre and the Wellcome Centre for Ethics and Humanities, Nuffield Department of Population Health, University of Oxford, Oxford, UK
| | - Darren Treanor
- Leeds Teaching Hospitals NHS Trust, Leeds, UK
- University of Leeds, Leeds, UK
- Department of Clinical Pathology and Department of Clinical and Experimental Medicine, Linköping University, Linköping, Sweden
- Center for Medical Image Science and Visualization, Linköping University, Linköping, Sweden
| | | | - Bilal Mateen
- Institute for Health Informatics, University College London, London, UK
- Wellcome Trust, London, UK
| | - Jacqui Gath
- Patient and Public Involvement and Engagement (PPIE) Group, STANDING Together, Birmingham, UK
| | - Adewole O Adebajo
- Patient and Public Involvement and Engagement (PPIE) Group, STANDING Together, Birmingham, UK
| | | | - Rubeta Matin
- Oxford University Hospitals NHS Foundation Trust, Oxford, UK
| | | | - Elizabeth Sapey
- Institute of Inflammation and Ageing, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK
- University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
- National Institute for Health and Care Research Birmingham Biomedical Research Centre, University of Birmingham, Birmingham, UK
- PIONEER, HDR UK Hub in Acute Care, Institute of Inflammation and Ageing, University of Birmingham, Birmingham, UK
| | - Neil J Sebire
- National Institute for Health and Care Research, Great Ormond Street Hospital Biomedical Research Centre, London, UK
- Great Ormond Street Institute of Child Health, University Hospital London, London, UK
| | | | - Melanie Calvert
- National Institute for Health and Care Research Birmingham Biomedical Research Centre, University of Birmingham, Birmingham, UK
- Birmingham Health Partners Centre for Regulatory Science and Innovation, University of Birmingham, Birmingham, UK
- Centre for Patient Reported Outcomes Research, Institute of Applied Health Research, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK
- National Institute for Health and Care Research Applied Research Collaboration West Midlands, University of Birmingham, Birmingham, UK
- National Institute for Health and Care Research Birmingham-Oxford Blood and Transplant Research Unit in Precision Transplant and Cellular Therapeutics, University of Birmingham, Birmingham, UK
- DEMAND Hub, University of Birmingham, Birmingham, UK
- UK SPINE, University of Birmingham, Birmingham, UK
| | - Alastair Denniston
- Institute of Inflammation and Ageing, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK
- University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
- National Institute for Health and Care Research Birmingham Biomedical Research Centre, University of Birmingham, Birmingham, UK
- Birmingham Health Partners Centre for Regulatory Science and Innovation, University of Birmingham, Birmingham, UK
- National Institute for Health and Care Research Biomedical Research Centre, Moorfields Eye Hospital/University College London, London, UK
| | - Xiaoxuan Liu
- Institute of Inflammation and Ageing, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK.
- University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK.
- National Institute for Health and Care Research Birmingham Biomedical Research Centre, University of Birmingham, Birmingham, UK.
| |
Collapse
|
4
|
Guo LL, Steinberg E, Fleming SL, Posada J, Lemmon J, Pfohl SR, Shah N, Fries J, Sung L. EHR foundation models improve robustness in the presence of temporal distribution shift. Sci Rep 2023; 13:3767. [PMID: 36882576 PMCID: PMC9992466 DOI: 10.1038/s41598-023-30820-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2022] [Accepted: 03/02/2023] [Indexed: 03/09/2023] Open
Abstract
Temporal distribution shift negatively impacts the performance of clinical prediction models over time. Pretraining foundation models using self-supervised learning on electronic health records (EHR) may be effective in acquiring informative global patterns that can improve the robustness of task-specific models. The objective was to evaluate the utility of EHR foundation models in improving the in-distribution (ID) and out-of-distribution (OOD) performance of clinical prediction models. Transformer- and gated recurrent unit-based foundation models were pretrained on EHR of up to 1.8 M patients (382 M coded events) collected within pre-determined year groups (e.g., 2009-2012) and were subsequently used to construct patient representations for patients admitted to inpatient units. These representations were used to train logistic regression models to predict hospital mortality, long length of stay, 30-day readmission, and ICU admission. We compared our EHR foundation models with baseline logistic regression models learned on count-based representations (count-LR) in ID and OOD year groups. Performance was measured using area-under-the-receiver-operating-characteristic curve (AUROC), area-under-the-precision-recall curve, and absolute calibration error. Both transformer and recurrent-based foundation models generally showed better ID and OOD discrimination relative to count-LR and often exhibited less decay in tasks where there is observable degradation of discrimination performance (average AUROC decay of 3% for transformer-based foundation model vs. 7% for count-LR after 5-9 years). In addition, the performance and robustness of transformer-based foundation models continued to improve as pretraining set size increased. These results suggest that pretraining EHR foundation models at scale is a useful approach for developing clinical prediction models that perform well in the presence of temporal distribution shift.
Collapse
Affiliation(s)
- Lin Lawrence Guo
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada
| | - Ethan Steinberg
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA
| | - Scott Lanyon Fleming
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA
| | - Jose Posada
- Universidad del Norte, Barranquilla, Colombia
| | - Joshua Lemmon
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada
| | - Stephen R Pfohl
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA
| | - Nigam Shah
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA
| | - Jason Fries
- Stanford Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA, USA
| | - Lillian Sung
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada. .,Division of Haematology/Oncology, The Hospital for Sick Children, 555 University Avenue, Toronto, ON, M5G1X8, Canada.
| |
Collapse
|
5
|
Lemmon J, Guo LL, Posada J, Pfohl SR, Fries J, Fleming SL, Aftandilian C, Shah N, Sung L. Evaluation of Feature Selection Methods for Preserving Machine Learning Performance in the Presence of Temporal Dataset Shift in Clinical Medicine. Methods Inf Med 2023; 62:60-70. [PMID: 36812932 DOI: 10.1055/s-0043-1762904] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/24/2023]
Abstract
BACKGROUND Temporal dataset shift can cause degradation in model performance as discrepancies between training and deployment data grow over time. The primary objective was to determine whether parsimonious models produced by specific feature selection methods are more robust to temporal dataset shift as measured by out-of-distribution (OOD) performance, while maintaining in-distribution (ID) performance. METHODS Our dataset consisted of intensive care unit patients from MIMIC-IV categorized by year groups (2008-2010, 2011-2013, 2014-2016, and 2017-2019). We trained baseline models using L2-regularized logistic regression on 2008-2010 to predict in-hospital mortality, long length of stay (LOS), sepsis, and invasive ventilation in all year groups. We evaluated three feature selection methods: L1-regularized logistic regression (L1), Remove and Retrain (ROAR), and causal feature selection. We assessed whether a feature selection method could maintain ID performance (2008-2010) and improve OOD performance (2017-2019). We also assessed whether parsimonious models retrained on OOD data performed as well as oracle models trained on all features in the OOD year group. RESULTS The baseline model showed significantly worse OOD performance with the long LOS and sepsis tasks when compared with the ID performance. L1 and ROAR retained 3.7 to 12.6% of all features, whereas causal feature selection generally retained fewer features. Models produced by L1 and ROAR exhibited similar ID and OOD performance as the baseline models. The retraining of these models on 2017-2019 data using features selected from training on 2008-2010 data generally reached parity with oracle models trained directly on 2017-2019 data using all available features. Causal feature selection led to heterogeneous results with the superset maintaining ID performance while improving OOD calibration only on the long LOS task. CONCLUSIONS While model retraining can mitigate the impact of temporal dataset shift on parsimonious models produced by L1 and ROAR, new methods are required to proactively improve temporal robustness.
Collapse
Affiliation(s)
- Joshua Lemmon
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Lin Lawrence Guo
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Jose Posada
- Biomedical Informatics Research, Stanford University, Palo Alto, California, United States.,Department of Systems Engineering, Universidad del Norte, Barranquilla, Atlantico, Colombia
| | - Stephen R Pfohl
- Biomedical Informatics Research, Stanford University, Palo Alto, California, United States
| | - Jason Fries
- Biomedical Informatics Research, Stanford University, Palo Alto, California, United States
| | - Scott Lanyon Fleming
- Biomedical Informatics Research, Stanford University, Palo Alto, California, United States
| | - Catherine Aftandilian
- Division of Pediatric Hematology/Oncology, Stanford University, Palo Alto, California, United States
| | - Nigam Shah
- Biomedical Informatics Research, Stanford University, Palo Alto, California, United States
| | - Lillian Sung
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, Ontario, Canada.,Division of Haematology/Oncology, The Hospital for Sick Children, Toronto, Ontario, Canada
| |
Collapse
|
6
|
Ganapathi S, Palmer J, Alderman JE, Calvert M, Espinoza C, Gath J, Ghassemi M, Heller K, Mckay F, Karthikesalingam A, Kuku S, Mackintosh M, Manohar S, Mateen BA, Matin R, McCradden M, Oakden-Rayner L, Ordish J, Pearson R, Pfohl SR, Rostamzadeh N, Sapey E, Sebire N, Sounderajah V, Summers C, Treanor D, Denniston AK, Liu X. Tackling bias in AI health datasets through the STANDING Together initiative. Nat Med 2022; 28:2232-2233. [PMID: 36163296 DOI: 10.1038/s41591-022-01987-w] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Affiliation(s)
- Shaswath Ganapathi
- College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK
| | - Jo Palmer
- University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK
| | - Joseph E Alderman
- University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK.,Institute of Inflammation and Ageing, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK
| | - Melanie Calvert
- Birmingham Health Partners Centre for Regulatory Science and Innovation, University of Birmingham, Birmingham, UK.,Centre for Patient Reported Outcome Research, Institute of Applied Health Research, University of Birmingham, Birmingham, UK.,NIHR Birmingham Biomedical Research Centre, University of Birmingham, Birmingham, UK.,NIHR Surgical Reconstruction and Microbiology Research Centre, University of Birmingham, Birmingham, UK.,NIHR Applied Research Collaborative West Midlands University of Birmingham, Birmingham, UK
| | | | - Jacqui Gath
- Patient Partner, Birmingham, UK.,Patient Partner, Sheffield, UK
| | - Marzyeh Ghassemi
- Department of Electrical Engineering and Computer Science; Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA, USA
| | | | - Francis Mckay
- The Ethox Centre and the Wellcome Centre for Ethics and Humanities, Nuffield Department of Population Health, University of Oxford, Oxford, UK
| | | | - Stephanie Kuku
- Institute of Women's Health, University College London, London, UK.,Hardian Health, London, UK
| | | | | | - Bilal A Mateen
- Institute of Health Informatics, University College London, London, UK.,The Wellcome Trust, London, UK
| | - Rubeta Matin
- Oxford University Hospitals NHS Foundation Trust, Oxford, UK
| | - Melissa McCradden
- Department of Bioethics, Hospital for Sick Children, Toronto, Ontario, Canada.,Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
| | - Lauren Oakden-Rayner
- Australian Institute for Machine Learning, University of Adelaide, Adelaide, South Australia, Australia
| | - Johan Ordish
- Medicines and Healthcare Products Regulatory Agency, London, UK
| | - Russell Pearson
- Medicines and Healthcare Products Regulatory Agency, London, UK
| | | | | | - Elizabeth Sapey
- University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK.,Institute of Inflammation and Ageing, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK
| | - Neil Sebire
- Health Data Research, London, UK.,Great Ormond Street Hospital for Children, London, UK
| | - Viknesh Sounderajah
- Institute of Global Health Innovation, Imperial College London, London, UK.,Department of Surgery and Cancer, Imperial College London, London, UK
| | - Charlotte Summers
- Wolfson Lung Injury Unit, Heart and Lung Research Institute, University of Cambridge, Cambrdige, UK.,Cambridge University Hospitals NHS Foundation Trust, Cambridge, UK
| | - Darren Treanor
- Leeds Teaching Hospitals NHS Trust, Leeds, UK.,University of Leeds, Leeds, UK.,Department of Clinical Pathology, and Department of Clinical and Experimental Medicine, Linköping University, Linköping, Sweden.,Center for Medical Image Science and Visualization (CMIV), Linköping University, Linköping, Sweden
| | - Alastair K Denniston
- University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK.,Institute of Inflammation and Ageing, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK.,Birmingham Health Partners Centre for Regulatory Science and Innovation, University of Birmingham, Birmingham, UK.,NIHR Birmingham Biomedical Research Centre, University of Birmingham, Birmingham, UK.,Health Data Research, London, UK
| | - Xiaoxuan Liu
- University Hospitals Birmingham NHS Foundation Trust, Birmingham, UK. .,Institute of Inflammation and Ageing, College of Medical and Dental Sciences, University of Birmingham, Birmingham, UK. .,Birmingham Health Partners Centre for Regulatory Science and Innovation, University of Birmingham, Birmingham, UK.
| |
Collapse
|
7
|
Foryciarz A, Pfohl SR, Patel B, Shah N. Evaluating algorithmic fairness in the presence of clinical guidelines: the case of atherosclerotic cardiovascular disease risk estimation. BMJ Health Care Inform 2022; 29:bmjhci-2021-100460. [PMID: 35396247 PMCID: PMC8996004 DOI: 10.1136/bmjhci-2021-100460] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2021] [Accepted: 12/17/2021] [Indexed: 12/28/2022] Open
Abstract
ObjectivesThe American College of Cardiology and the American Heart Association guidelines on primary prevention of atherosclerotic cardiovascular disease (ASCVD) recommend using 10-year ASCVD risk estimation models to initiate statin treatment. For guideline-concordant decision-making, risk estimates need to be calibrated. However, existing models are often miscalibrated for race, ethnicity and sex based subgroups. This study evaluates two algorithmic fairness approaches to adjust the risk estimators (group recalibration and equalised odds) for their compatibility with the assumptions underpinning the guidelines’ decision rules.MethodsUsing an updated pooled cohorts data set, we derive unconstrained, group-recalibrated and equalised odds-constrained versions of the 10-year ASCVD risk estimators, and compare their calibration at guideline-concordant decision thresholds.ResultsWe find that, compared with the unconstrained model, group-recalibration improves calibration at one of the relevant thresholds for each group, but exacerbates differences in false positive and false negative rates between groups. An equalised odds constraint, meant to equalise error rates across groups, does so by miscalibrating the model overall and at relevant decision thresholds.DiscussionHence, because of induced miscalibration, decisions guided by risk estimators learned with an equalised odds fairness constraint are not concordant with existing guidelines. Conversely, recalibrating the model separately for each group can increase guideline compatibility, while increasing intergroup differences in error rates. As such, comparisons of error rates across groups can be misleading when guidelines recommend treating at fixed decision thresholds.ConclusionThe illustrated tradeoffs between satisfying a fairness criterion and retaining guideline compatibility underscore the need to evaluate models in the context of downstream interventions.
Collapse
Affiliation(s)
- Agata Foryciarz
- Department of Computer Science, Stanford University School of Engineering, Stanford, California, USA
- Stanford Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, California, USA
| | - Stephen R Pfohl
- Stanford Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, California, USA
| | - Birju Patel
- Stanford Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, California, USA
| | - Nigam Shah
- Stanford Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, California, USA
| |
Collapse
|
8
|
Luo C, Islam MN, Sheils NE, Buresh J, Reps J, Schuemie MJ, Ryan PB, Edmondson M, Duan R, Tong J, Marks-Anglin A, Bian J, Chen Z, Duarte-Salles T, Fernández-Bertolín S, Falconer T, Kim C, Park RW, Pfohl SR, Shah NH, Williams AE, Xu H, Zhou Y, Lautenbach E, Doshi JA, Werner RM, Asch DA, Chen Y. DLMM as a lossless one-shot algorithm for collaborative multi-site distributed linear mixed models. Nat Commun 2022; 13:1678. [PMID: 35354802 PMCID: PMC8967932 DOI: 10.1038/s41467-022-29160-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2021] [Accepted: 03/03/2022] [Indexed: 12/21/2022] Open
Abstract
Linear mixed models are commonly used in healthcare-based association analyses for analyzing multi-site data with heterogeneous site-specific random effects. Due to regulations for protecting patients’ privacy, sensitive individual patient data (IPD) typically cannot be shared across sites. We propose an algorithm for fitting distributed linear mixed models (DLMMs) without sharing IPD across sites. This algorithm achieves results identical to those achieved using pooled IPD from multiple sites (i.e., the same effect size and standard error estimates), hence demonstrating the lossless property. The algorithm requires each site to contribute minimal aggregated data in only one round of communication. We demonstrate the lossless property of the proposed DLMM algorithm by investigating the associations between demographic and clinical characteristics and length of hospital stay in COVID-19 patients using administrative claims from the UnitedHealth Group Clinical Discovery Database. We extend this association study by incorporating 120,609 COVID-19 patients from 11 collaborative data sources worldwide. A lossless, one-shot and privacy-preserving distributed algorithm was revealed for fitting linear mixed models on multi-site data. The algorithm was applied to a study of 120,609 COVID-19 patients using only minimal aggregated data from each of 14 sites.
Collapse
Affiliation(s)
- Chongliang Luo
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA.,Division of Public Health Sciences, Washington University School of Medicine in St. Louis, St. Louis, MO, USA
| | | | | | | | - Jenna Reps
- Janssen Research and Development LLC, Titusville, NJ, USA
| | | | - Patrick B Ryan
- Janssen Research and Development LLC, Titusville, NJ, USA
| | - Mackenzie Edmondson
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Rui Duan
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA.,Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Jiayi Tong
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Arielle Marks-Anglin
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Zhaoyi Chen
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Talita Duarte-Salles
- Fundacio Institut Universitari per a la recerca a l'Atencio Primaria de Salut Jordi Gol i Gurina (IDIAPJGol), Barcelona, Spain
| | - Sergio Fernández-Bertolín
- Fundacio Institut Universitari per a la recerca a l'Atencio Primaria de Salut Jordi Gol i Gurina (IDIAPJGol), Barcelona, Spain
| | - Thomas Falconer
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Chungsoo Kim
- Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea
| | - Rae Woong Park
- Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea.,Department of Biomedical Informatics, Ajou University School of Medicine, Suwon, Republic of Korea
| | - Stephen R Pfohl
- Stanford Center for Biomedical Informatics Research, Stanford, CA, USA
| | - Nigam H Shah
- Stanford Center for Biomedical Informatics Research, Stanford, CA, USA
| | - Andrew E Williams
- Institute for Clinical Research and Health Policy Studies, Tufts University School of Medicine, Boston, MA, USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Yujia Zhou
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Ebbing Lautenbach
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA.,Division of Infectious Diseases, Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.,Center for Clinical Epidemiology and Biostatistics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Jalpa A Doshi
- Division of General Internal Medicine, University of Pennsylvania, Philadelphia, PA, USA.,Leonard Davis Institute of Health Economics, Philadelphia, PA, USA
| | - Rachel M Werner
- Division of General Internal Medicine, University of Pennsylvania, Philadelphia, PA, USA.,Leonard Davis Institute of Health Economics, Philadelphia, PA, USA.,Cpl Michael J Crescenz VA Medical Center, Philadelphia, PA, USA
| | - David A Asch
- Division of General Internal Medicine, University of Pennsylvania, Philadelphia, PA, USA.,Leonard Davis Institute of Health Economics, Philadelphia, PA, USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
9
|
Guo LL, Pfohl SR, Fries J, Johnson AEW, Posada J, Aftandilian C, Shah N, Sung L. Evaluation of domain generalization and adaptation on improving model robustness to temporal dataset shift in clinical medicine. Sci Rep 2022; 12:2726. [PMID: 35177653 PMCID: PMC8854561 DOI: 10.1038/s41598-022-06484-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2021] [Accepted: 01/31/2022] [Indexed: 11/24/2022] Open
Abstract
Temporal dataset shift associated with changes in healthcare over time is a barrier to deploying machine learning-based clinical decision support systems. Algorithms that learn robust models by estimating invariant properties across time periods for domain generalization (DG) and unsupervised domain adaptation (UDA) might be suitable to proactively mitigate dataset shift. The objective was to characterize the impact of temporal dataset shift on clinical prediction models and benchmark DG and UDA algorithms on improving model robustness. In this cohort study, intensive care unit patients from the MIMIC-IV database were categorized by year groups (2008–2010, 2011–2013, 2014–2016 and 2017–2019).
Tasks were predicting mortality, long length of stay, sepsis and invasive ventilation. Feedforward neural networks were used as prediction models. The baseline experiment trained models using empirical risk minimization (ERM) on 2008–2010 (ERM[08–10]) and evaluated them on subsequent year groups. DG experiment trained models using algorithms that estimated invariant properties using 2008–2016 and evaluated them on 2017–2019. UDA experiment leveraged unlabelled samples from 2017 to 2019 for unsupervised distribution matching. DG and UDA models were compared to ERM[08–16] models trained using 2008–2016. Main performance measures were area-under-the-receiver-operating-characteristic curve (AUROC), area-under-the-precision-recall curve and absolute calibration error. Threshold-based metrics including false-positives and false-negatives were used to assess the clinical impact of temporal dataset shift and its mitigation strategies. In the baseline experiments, dataset shift was most evident for sepsis prediction (maximum AUROC drop, 0.090; 95% confidence interval (CI), 0.080–0.101). Considering a scenario of 100 consecutively admitted patients showed that ERM[08–10] applied to 2017–2019 was associated with one additional false-negative among 11 patients with sepsis, when compared to the model applied to 2008–2010. When compared with ERM[08–16], DG and UDA experiments failed to produce more robust models (range of AUROC difference, − 0.003 to 0.050). In conclusion, DG and UDA failed to produce more robust models compared to ERM in the setting of temporal dataset shift. Alternate approaches are required to preserve model performance over time in clinical medicine.
Collapse
Affiliation(s)
- Lin Lawrence Guo
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada
| | - Stephen R Pfohl
- Biomedical Informatics Research, Stanford University, Palo Alto, USA
| | - Jason Fries
- Biomedical Informatics Research, Stanford University, Palo Alto, USA
| | - Alistair E W Johnson
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada
| | - Jose Posada
- Biomedical Informatics Research, Stanford University, Palo Alto, USA
| | | | - Nigam Shah
- Biomedical Informatics Research, Stanford University, Palo Alto, USA
| | - Lillian Sung
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, ON, Canada. .,Division of Haematology/Oncology, The Hospital for Sick Children, 555 University Avenue, Toronto, ON, M5G1X8, Canada.
| |
Collapse
|
10
|
Nestsiarovich A, Reps JM, Matheny ME, DuVall SL, Lynch KE, Beaton M, Jiang X, Spotnitz M, Pfohl SR, Shah NH, Torre CO, Reich CG, Lee DY, Son SJ, You SC, Park RW, Ryan PB, Lambert CG. Predictors of diagnostic transition from major depressive disorder to bipolar disorder: a retrospective observational network study. Transl Psychiatry 2021; 11:642. [PMID: 34930903 PMCID: PMC8688463 DOI: 10.1038/s41398-021-01760-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/12/2021] [Revised: 11/25/2021] [Accepted: 12/01/2021] [Indexed: 12/02/2022] Open
Abstract
Many patients with bipolar disorder (BD) are initially misdiagnosed with major depressive disorder (MDD) and are treated with antidepressants, whose potential iatrogenic effects are widely discussed. It is unknown whether MDD is a comorbidity of BD or its earlier stage, and no consensus exists on individual conversion predictors, delaying BD's timely recognition and treatment. We aimed to build a predictive model of MDD to BD conversion and to validate it across a multi-national network of patient databases using the standardization afforded by the Observational Medical Outcomes Partnership (OMOP) common data model. Five "training" US databases were retrospectively analyzed: IBM MarketScan CCAE, MDCR, MDCD, Optum EHR, and Optum Claims. Cyclops regularized logistic regression models were developed on one-year MDD-BD conversion with all standard covariates from the HADES PatientLevelPrediction package. Time-to-conversion Kaplan-Meier analysis was performed up to a decade after MDD, stratified by model-estimated risk. External validation of the final prediction model was performed across 9 patient record databases within the Observational Health Data Sciences and Informatics (OHDSI) network internationally. The model's area under the curve (AUC) varied 0.633-0.745 (µ = 0.689) across the five US training databases. Nine variables predicted one-year MDD-BD transition. Factors that increased risk were: younger age, severe depression, psychosis, anxiety, substance misuse, self-harm thoughts/actions, and prior mental disorder. AUCs of the validation datasets ranged 0.570-0.785 (µ = 0.664). An assessment algorithm was built for MDD to BD conversion that allows distinguishing as much as 100-fold risk differences among patients and validates well across multiple international data sources.
Collapse
Affiliation(s)
- Anastasiya Nestsiarovich
- University of New Mexico Health Sciences Center, Department of Internal Medicine, Center for Global Health, Albuquerque, NM, USA
| | - Jenna M Reps
- Janssen Research and Development, Raritan, NJ, USA
| | - Michael E Matheny
- Vanderbilt University, Department of Biomedical Informatics, Department of Medicine, Department of Biostatistics, Nashville, TN, USA
- Tennessee Valley Healthcare System VA, Nashville, TN, USA
| | - Scott L DuVall
- Veterans Affairs Informatics and Computing Infrastructure, VA Salt Lake City Health Care System, Salt Lake City, UT, USA
- University of Utah, Department of Internal Medicine, Salt Lake City, UT, USA
| | - Kristine E Lynch
- Veterans Affairs Informatics and Computing Infrastructure, VA Salt Lake City Health Care System, Salt Lake City, UT, USA
- University of Utah, Department of Internal Medicine, Salt Lake City, UT, USA
| | - Maura Beaton
- Columbia University Irving Medical Center, Department of Biomedical Informatics, New York, NY, USA
| | - Xinzhuo Jiang
- Columbia University Irving Medical Center, Department of Biomedical Informatics, New York, NY, USA
| | - Matthew Spotnitz
- Columbia University Irving Medical Center, Department of Biomedical Informatics, New York, NY, USA
| | - Stephen R Pfohl
- Stanford University, Stanford Center for Biomedical Informatics Research, Stanford, CA, USA
| | - Nigam H Shah
- Stanford University, Stanford Center for Biomedical Informatics Research, Stanford, CA, USA
| | | | | | - Dong Yun Lee
- Ajou University School of Medicine, Department of Psychiatry, Suwon, Republic of Korea
| | - Sang Joon Son
- Ajou University School of Medicine, Department of Psychiatry, Suwon, Republic of Korea
| | - Seng Chan You
- Ajou University School of Medicine, Department of Biomedical Informatics, Suwon, Republic of Korea
| | - Rae Woong Park
- Ajou University School of Medicine, Department of Biomedical Informatics, Suwon, Republic of Korea
| | - Patrick B Ryan
- Janssen Research and Development, Raritan, NJ, USA
- Columbia University Irving Medical Center, Department of Biomedical Informatics, New York, NY, USA
| | - Christophe G Lambert
- University of New Mexico Health Sciences Center, Department of Internal Medicine, Center for Global Health, Albuquerque, NM, USA.
- University of New Mexico Health Sciences Center, Department of Internal Medicine, Center for Global Health, Division of Translational Informatics, Albuquerque, NM, USA.
| |
Collapse
|
11
|
Patel BS, Steinberg E, Pfohl SR, Shah NH. Learning decision thresholds for risk stratification models from aggregate clinician behavior. J Am Med Inform Assoc 2021; 28:2258-2264. [PMID: 34350942 PMCID: PMC8449610 DOI: 10.1093/jamia/ocab159] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Revised: 06/26/2021] [Accepted: 07/13/2021] [Indexed: 11/22/2022] Open
Abstract
Using a risk stratification model to guide clinical practice often requires the choice of a cutoff—called the decision threshold—on the model’s output to trigger a subsequent action such as an electronic alert. Choosing this cutoff is not always straightforward. We propose a flexible approach that leverages the collective information in treatment decisions made in real life to learn reference decision thresholds from physician practice. Using the example of prescribing a statin for primary prevention of cardiovascular disease based on 10-year risk calculated by the 2013 pooled cohort equations, we demonstrate the feasibility of using real-world data to learn the implicit decision threshold that reflects existing physician behavior. Learning a decision threshold in this manner allows for evaluation of a proposed operating point against the threshold reflective of the community standard of care. Furthermore, this approach can be used to monitor and audit model-guided clinical decision making following model deployment.
Collapse
Affiliation(s)
- Birju S Patel
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USA
| | - Ethan Steinberg
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USA
| | - Stephen R Pfohl
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USA
| | - Nigam H Shah
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, California, USA
- Corresponding Author: Nigam H. Shah, MBBS, PhD, Stanford Center for Biomedical Informatics Research, Stanford University, 1265 Welch Road, Stanford, CA 94305, USA;
| |
Collapse
|
12
|
Guo LL, Pfohl SR, Fries J, Posada J, Fleming SL, Aftandilian C, Shah N, Sung L. Systematic Review of Approaches to Preserve Machine Learning Performance in the Presence of Temporal Dataset Shift in Clinical Medicine. Appl Clin Inform 2021; 12:808-815. [PMID: 34470057 PMCID: PMC8410238 DOI: 10.1055/s-0041-1735184] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2021] [Accepted: 07/12/2021] [Indexed: 10/20/2022] Open
Abstract
OBJECTIVE The change in performance of machine learning models over time as a result of temporal dataset shift is a barrier to machine learning-derived models facilitating decision-making in clinical practice. Our aim was to describe technical procedures used to preserve the performance of machine learning models in the presence of temporal dataset shifts. METHODS Studies were included if they were fully published articles that used machine learning and implemented a procedure to mitigate the effects of temporal dataset shift in a clinical setting. We described how dataset shift was measured, the procedures used to preserve model performance, and their effects. RESULTS Of 4,457 potentially relevant publications identified, 15 were included. The impact of temporal dataset shift was primarily quantified using changes, usually deterioration, in calibration or discrimination. Calibration deterioration was more common (n = 11) than discrimination deterioration (n = 3). Mitigation strategies were categorized as model level or feature level. Model-level approaches (n = 15) were more common than feature-level approaches (n = 2), with the most common approaches being model refitting (n = 12), probability calibration (n = 7), model updating (n = 6), and model selection (n = 6). In general, all mitigation strategies were successful at preserving calibration but not uniformly successful in preserving discrimination. CONCLUSION There was limited research in preserving the performance of machine learning models in the presence of temporal dataset shift in clinical medicine. Future research could focus on the impact of dataset shift on clinical decision making, benchmark the mitigation strategies on a wider range of datasets and tasks, and identify optimal strategies for specific settings.
Collapse
Affiliation(s)
- Lin Lawrence Guo
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, Canada
| | - Stephen R. Pfohl
- Biomedical Informatics Research, Stanford University, Palo Alto, California, United States
| | - Jason Fries
- Biomedical Informatics Research, Stanford University, Palo Alto, California, United States
| | - Jose Posada
- Biomedical Informatics Research, Stanford University, Palo Alto, California, United States
| | - Scott Lanyon Fleming
- Biomedical Informatics Research, Stanford University, Palo Alto, California, United States
| | - Catherine Aftandilian
- Division of Pediatric Hematology/Oncology, Stanford University, Palo Alto, United States
| | - Nigam Shah
- Biomedical Informatics Research, Stanford University, Palo Alto, California, United States
| | - Lillian Sung
- Program in Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, Canada
- Division of Haematology/Oncology, The Hospital for Sick Children, Toronto, Canada
| |
Collapse
|
13
|
Steinberg E, Jung K, Fries JA, Corbin CK, Pfohl SR, Shah NH. Language models are an effective representation learning technique for electronic health record data. J Biomed Inform 2021; 113:103637. [PMID: 33290879 PMCID: PMC7863633 DOI: 10.1016/j.jbi.2020.103637] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2020] [Revised: 10/10/2020] [Accepted: 11/26/2020] [Indexed: 11/17/2022]
Abstract
Widespread adoption of electronic health records (EHRs) has fueled the development of using machine learning to build prediction models for various clinical outcomes. However, this process is often constrained by having a relatively small number of patient records for training the model. We demonstrate that using patient representation schemes inspired from techniques in natural language processing can increase the accuracy of clinical prediction models by transferring information learned from the entire patient population to the task of training a specific model, where only a subset of the population is relevant. Such patient representation schemes enable a 3.5% mean improvement in AUROC on five prediction tasks compared to standard baselines, with the average improvement rising to 19% when only a small number of patient records are available for training the clinical prediction model.
Collapse
Affiliation(s)
- Ethan Steinberg
- Stanford University, 450 Serra Mall, Stanford, CA 94305, USA.
| | - Ken Jung
- Stanford University, 450 Serra Mall, Stanford, CA 94305, USA
| | - Jason A Fries
- Stanford University, 450 Serra Mall, Stanford, CA 94305, USA
| | - Conor K Corbin
- Stanford University, 450 Serra Mall, Stanford, CA 94305, USA
| | - Stephen R Pfohl
- Stanford University, 450 Serra Mall, Stanford, CA 94305, USA
| | - Nigam H Shah
- Stanford University, 450 Serra Mall, Stanford, CA 94305, USA
| |
Collapse
|
14
|
Pfohl SR, Foryciarz A, Shah NH. An empirical characterization of fair machine learning for clinical risk prediction. J Biomed Inform 2020; 113:103621. [PMID: 33220494 DOI: 10.1016/j.jbi.2020.103621] [Citation(s) in RCA: 34] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2020] [Revised: 10/06/2020] [Accepted: 11/05/2020] [Indexed: 11/19/2022]
Abstract
The use of machine learning to guide clinical decision making has the potential to worsen existing health disparities. Several recent works frame the problem as that of algorithmic fairness, a framework that has attracted considerable attention and criticism. However, the appropriateness of this framework is unclear due to both ethical as well as technical considerations, the latter of which include trade-offs between measures of fairness and model performance that are not well-understood for predictive models of clinical outcomes. To inform the ongoing debate, we conduct an empirical study to characterize the impact of penalizing group fairness violations on an array of measures of model performance and group fairness. We repeat the analysis across multiple observational healthcare databases, clinical outcomes, and sensitive attributes. We find that procedures that penalize differences between the distributions of predictions across groups induce nearly-universal degradation of multiple performance metrics within groups. On examining the secondary impact of these procedures, we observe heterogeneity of the effect of these procedures on measures of fairness in calibration and ranking across experimental conditions. Beyond the reported trade-offs, we emphasize that analyses of algorithmic fairness in healthcare lack the contextual grounding and causal awareness necessary to reason about the mechanisms that lead to health disparities, as well as about the potential of algorithmic fairness methods to counteract those mechanisms. In light of these limitations, we encourage researchers building predictive models for clinical use to step outside the algorithmic fairness frame and engage critically with the broader sociotechnical context surrounding the use of machine learning in healthcare.
Collapse
Affiliation(s)
- Stephen R Pfohl
- Stanford Center for Biomedical Informatics Research, Stanford University, 1265 Welch Road, Stanford, CA 94305, United States of America.
| | - Agata Foryciarz
- Stanford Center for Biomedical Informatics Research, Stanford University, 1265 Welch Road, Stanford, CA 94305, United States of America; Computer Science Department, Stanford University, 353 Jane Stanford Way, Stanford, CA 94305, United States of America.
| | - Nigam H Shah
- Stanford Center for Biomedical Informatics Research, Stanford University, 1265 Welch Road, Stanford, CA 94305, United States of America.
| |
Collapse
|
15
|
Wang Q, Reps JM, Kostka KF, Ryan PB, Zou Y, Voss EA, Rijnbeek PR, Chen R, Rao GA, Morgan Stewart H, Williams AE, Williams RD, Van Zandt M, Falconer T, Fernandez-Chas M, Vashisht R, Pfohl SR, Shah NH, Kasthurirathne SN, You SC, Jiang Q, Reich C, Zhou Y. Development and validation of a prognostic model predicting symptomatic hemorrhagic transformation in acute ischemic stroke at scale in the OHDSI network. PLoS One 2020; 15:e0226718. [PMID: 31910437 PMCID: PMC6946584 DOI: 10.1371/journal.pone.0226718] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2019] [Accepted: 12/02/2019] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND AND PURPOSE Hemorrhagic transformation (HT) after cerebral infarction is a complex and multifactorial phenomenon in the acute stage of ischemic stroke, and often results in a poor prognosis. Thus, identifying risk factors and making an early prediction of HT in acute cerebral infarction contributes not only to the selections of therapeutic regimen but also, more importantly, to the improvement of prognosis of acute cerebral infarction. The purpose of this study was to develop and validate a model to predict a patient's risk of HT within 30 days of initial ischemic stroke. METHODS We utilized a retrospective multicenter observational cohort study design to develop a Lasso Logistic Regression prediction model with a large, US Electronic Health Record dataset which structured to the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM). To examine clinical transportability, the model was externally validated across 10 additional real-world healthcare datasets include EHR records for patients from America, Europe and Asia. RESULTS In the database the model was developed, the target population cohort contained 621,178 patients with ischemic stroke, of which 5,624 patients had HT within 30 days following initial ischemic stroke. 612 risk predictors, including the distance a patient travels in an ambulance to get to care for a HT, were identified. An area under the receiver operating characteristic curve (AUC) of 0.75 was achieved in the internal validation of the risk model. External validation was performed across 10 databases totaling 5,515,508 patients with ischemic stroke, of which 86,401 patients had HT within 30 days following initial ischemic stroke. The mean external AUC was 0.71 and ranged between 0.60-0.78. CONCLUSIONS A HT prognostic predict model was developed with Lasso Logistic Regression based on routinely collected EMR data. This model can identify patients who have a higher risk of HT than the population average with an AUC of 0.78. It shows the OMOP CDM is an appropriate data standard for EMR secondary use in clinical multicenter research for prognostic prediction model development and validation. In the future, combining this model with clinical information systems will assist clinicians to make the right therapy decision for patients with acute ischemic stroke.
Collapse
Affiliation(s)
- Qiong Wang
- Biomedical Engineering School, Sun Yat-Sen University, Guangzhou, China
- The Third Affiliated Hospital of Guangzhou Medical University, Guangzhou, China
- Observational Health Data Sciences and Informatics, New York, New York, United States of America
| | - Jenna M. Reps
- Observational Health Data Sciences and Informatics, New York, New York, United States of America
- Janssen Research and Development, Raritan, New Jersey, United States of America
| | - Kristin Feeney Kostka
- Observational Health Data Sciences and Informatics, New York, New York, United States of America
- IQVIA, Durham, North Carolina, United States of America
| | - Patrick B. Ryan
- Observational Health Data Sciences and Informatics, New York, New York, United States of America
- Janssen Research and Development, Raritan, New Jersey, United States of America
- Department of Biomedical Informatics, Columbia University, New York, New York, United States of America
| | - Yuhui Zou
- Department of Neurosurgery, General Hospital of Southern Theatre Command, Guangzhou, China
| | - Erica A. Voss
- Observational Health Data Sciences and Informatics, New York, New York, United States of America
- Janssen Research and Development, Raritan, New Jersey, United States of America
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Peter R. Rijnbeek
- Observational Health Data Sciences and Informatics, New York, New York, United States of America
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - RuiJun Chen
- Observational Health Data Sciences and Informatics, New York, New York, United States of America
- Department of Biomedical Informatics, Columbia University, New York, New York, United States of America
- Department of Medicine, Weill Cornell Medical College, New York, New York, United States of America
| | - Gowtham A. Rao
- Observational Health Data Sciences and Informatics, New York, New York, United States of America
- Janssen Research and Development, Raritan, New Jersey, United States of America
| | - Henry Morgan Stewart
- Observational Health Data Sciences and Informatics, New York, New York, United States of America
- IQVIA, Durham, North Carolina, United States of America
| | - Andrew E. Williams
- Observational Health Data Sciences and Informatics, New York, New York, United States of America
- Tufts Medical Center, Institute for Clinical Research and Health Policy Studies, Boston, Massachusetts, United States of America
| | - Ross D. Williams
- Observational Health Data Sciences and Informatics, New York, New York, United States of America
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Mui Van Zandt
- Observational Health Data Sciences and Informatics, New York, New York, United States of America
- IQVIA, Durham, North Carolina, United States of America
| | - Thomas Falconer
- Observational Health Data Sciences and Informatics, New York, New York, United States of America
- Department of Biomedical Informatics, Columbia University, New York, New York, United States of America
| | - Margarita Fernandez-Chas
- Observational Health Data Sciences and Informatics, New York, New York, United States of America
- IQVIA, Durham, North Carolina, United States of America
| | - Rohit Vashisht
- Observational Health Data Sciences and Informatics, New York, New York, United States of America
- Stanford Center for Biomedical Informatics Research, Stanford, California, United States of America
| | - Stephen R. Pfohl
- Observational Health Data Sciences and Informatics, New York, New York, United States of America
- Stanford Center for Biomedical Informatics Research, Stanford, California, United States of America
| | - Nigam H. Shah
- Observational Health Data Sciences and Informatics, New York, New York, United States of America
- Stanford Center for Biomedical Informatics Research, Stanford, California, United States of America
| | - Suranga N. Kasthurirathne
- Observational Health Data Sciences and Informatics, New York, New York, United States of America
- Center for Biomedical Informatics, Regenstrief Institute, Indianapolis, Indiana, United States of America
- Department of Epidemiology, Indiana University Richard M. Fairbanks School of Public Health, Indianapolis, Indiana, United States of America
| | - Seng Chan You
- Observational Health Data Sciences and Informatics, New York, New York, United States of America
- Department of Biomedical informatics, Ajou University School of Medicine, Suwon, Korea
| | - Qing Jiang
- Biomedical Engineering School, Sun Yat-Sen University, Guangzhou, China
| | - Christian Reich
- Observational Health Data Sciences and Informatics, New York, New York, United States of America
- IQVIA, Durham, North Carolina, United States of America
| | - Yi Zhou
- Department of Biomedical Engineering, Zhongshan School of Medicine, Sun Yat-Sen University, Guangzhou, China
| |
Collapse
|
16
|
Pfohl SR, Kim RB, Coan GS, Mitchell CS. Unraveling the Complexity of Amyotrophic Lateral Sclerosis Survival Prediction. Front Neuroinform 2018; 12:36. [PMID: 29962944 PMCID: PMC6010549 DOI: 10.3389/fninf.2018.00036] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2018] [Accepted: 05/28/2018] [Indexed: 12/12/2022] Open
Abstract
Objective: The heterogeneity of amyotrophic lateral sclerosis (ALS) survival duration, which varies from <1 year to >10 years, challenges clinical decisions and trials. Utilizing data from 801 deceased ALS patients, we: (1) assess the underlying complex relationships among common clinical ALS metrics; (2) identify which clinical ALS metrics are the "best" survival predictors and how their predictive ability changes as a function of disease progression. Methods: Analyses included examination of relationships within the raw data as well as the construction of interactive survival regression and classification models (generalized linear model and random forests model). Dimensionality reduction and feature clustering enabled decomposition of clinical variable contributions. Thirty-eight metrics were utilized, including Medical Research Council (MRC) muscle scores; respiratory function, including forced vital capacity (FVC) and FVC % predicted, oxygen saturation, negative inspiratory force (NIF); the Revised ALS Functional Rating Scale (ALSFRS-R) and its activities of daily living (ADL) and respiratory sub-scores; body weight; onset type, onset age, gender, and height. Prognostic random forest models confirm the dominance of patient age-related parameters decline in classifying survival at thresholds of 30, 60, 90, and 180 days and 1, 2, 3, 4, and 5 years. Results: Collective prognostic insight derived from the overall investigation includes: multi-dimensionality of ALSFRS-R scores suggests cautious usage for survival forecasting; upper and lower extremities independently degenerate and are autonomous from respiratory decline, with the latter associating with nearer-to-death classifications; height and weight-based metrics are auxiliary predictors for farther-from-death classifications; sex and onset site (limb, bulbar) are not independent survival predictors due to age co-correlation. Conclusion: The dimensionality and fluctuating predictors of ALS survival must be considered when developing predictive models for clinical trial development or in-clinic usage. Additional independent metrics and possible revisions to current metrics, like the ALSFRS-R, are needed to capture the underlying complexity needed for population and personalized forecasting of survival.
Collapse
Affiliation(s)
- Stephen R Pfohl
- Department of Biomedical Engineering, Georgia Institute of Technology, Emory University School of Medicine, Atlanta, GA, United States.,Department of Biomedical Informatics, Stanford University, Stanford, CA, United States
| | - Renaid B Kim
- Department of Biomedical Engineering, Georgia Institute of Technology, Emory University School of Medicine, Atlanta, GA, United States.,Medical Scientist Training Program, University of Michigan Medical School, Ann Arbor, MI, United States
| | - Grant S Coan
- Department of Biomedical Engineering, Georgia Institute of Technology, Emory University School of Medicine, Atlanta, GA, United States.,School of Medicine, University of Texas Health Science Center at San Antonio, San Antonio, TX, United States
| | - Cassie S Mitchell
- Department of Biomedical Engineering, Georgia Institute of Technology, Emory University School of Medicine, Atlanta, GA, United States
| |
Collapse
|
17
|
Pfohl SR, Halicek MT, Mitchell CS. Characterization of the Contribution of Genetic Background and Gender to Disease Progression in the SOD1 G93A Mouse Model of Amyotrophic Lateral Sclerosis: A Meta-Analysis. J Neuromuscul Dis 2015; 2:137-150. [PMID: 26594635 PMCID: PMC4652798 DOI: 10.3233/jnd-140068] [Citation(s) in RCA: 62] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Background: The SOD1 G93A mouse model of amyotrophic lateral sclerosis (ALS) is the most frequently used model to examine ALS pathophysiology. There is a lack of homogeneity in usage of the SOD1 G93A mouse, including differences in genetic background and gender, which could confound the field’s results. Objective: In an analysis of 97 studies, we characterized the ALS progression for the high transgene copy control SOD1 G93A mouse on the basis of disease onset, overall lifespan, and disease duration for male and female mice on the B6SJL and C57BL/6J genetic backgrounds and quantified magnitudes of differences between groups. Methods: Mean age at onset, onset assessment measure, disease duration, and overall lifespan data from each study were extracted and statistically modeled as the response of linear regression with the sex and genetic background factored as predictors. Additional examination was performed on differing experimental onset and endpoint assessment measures. Results: C57BL/6 background mice show delayed onset of symptoms, increased lifespan, and an extended disease duration compared to their sex-matched B6SJL counterparts. Female B6SJL generally experience extended lifespan and delayed onset compared to their male counterparts, while female mice on the C57BL/6 background show delayed onset but no difference in survival compared to their male counterparts. Finally, different experimental protocols (tremor, rotarod, etc.) for onset determination result in notably different onset means. Conclusions: Overall, the observed effect of sex on disease endpoints was smaller than that which can be attributed to the genetic background. The often-reported increase in lifespan for female mice was observed only for mice on the B6SJL background, implicating a strain-dependent effect of sex on disease progression that manifests despite identical mutant SOD1 expression.
Collapse
Affiliation(s)
- Stephen R Pfohl
- Georgia Institute of Technology and Emory University School of Medicine, Atlanta, GA, USA
| | - Martin T Halicek
- Georgia Institute of Technology and Emory University School of Medicine, Atlanta, GA, USA
| | - Cassie S Mitchell
- Georgia Institute of Technology and Emory University School of Medicine, Atlanta, GA, USA
| |
Collapse
|