1
|
Xie K, Ojemann WKS, Gallagher RS, Shinohara RT, Lucas A, Hill CE, Hamilton RH, Johnson KB, Roth D, Litt B, Ellis CA. Disparities in seizure outcomes revealed by large language models. J Am Med Inform Assoc 2024; 31:1348-1355. [PMID: 38481027 PMCID: PMC11105138 DOI: 10.1093/jamia/ocae047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 02/21/2024] [Accepted: 02/23/2024] [Indexed: 03/26/2024] Open
Abstract
OBJECTIVE Large-language models (LLMs) can potentially revolutionize health care delivery and research, but risk propagating existing biases or introducing new ones. In epilepsy, social determinants of health are associated with disparities in care access, but their impact on seizure outcomes among those with access remains unclear. Here we (1) evaluated our validated, epilepsy-specific LLM for intrinsic bias, and (2) used LLM-extracted seizure outcomes to determine if different demographic groups have different seizure outcomes. MATERIALS AND METHODS We tested our LLM for differences and equivalences in prediction accuracy and confidence across demographic groups defined by race, ethnicity, sex, income, and health insurance, using manually annotated notes. Next, we used LLM-classified seizure freedom at each office visit to test for demographic outcome disparities, using univariable and multivariable analyses. RESULTS We analyzed 84 675 clinic visits from 25 612 unique patients seen at our epilepsy center. We found little evidence of bias in the prediction accuracy or confidence of outcome classifications across demographic groups. Multivariable analysis indicated worse seizure outcomes for female patients (OR 1.33, P ≤ .001), those with public insurance (OR 1.53, P ≤ .001), and those from lower-income zip codes (OR ≥1.22, P ≤ .007). Black patients had worse outcomes than White patients in univariable but not multivariable analysis (OR 1.03, P = .66). CONCLUSION We found little evidence that our LLM was intrinsically biased against any demographic group. Seizure freedom extracted by LLM revealed disparities in seizure outcomes across several demographic groups. These findings quantify the critical need to reduce disparities in the care of people with epilepsy.
Collapse
Affiliation(s)
- Kevin Xie
- Department of Bioengineering, University of Pennsylvania, Philadelphia, PA 19104, United States
- Center for Neuroengineering and Therapeutics, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - William K S Ojemann
- Department of Bioengineering, University of Pennsylvania, Philadelphia, PA 19104, United States
- Center for Neuroengineering and Therapeutics, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Ryan S Gallagher
- Center for Neuroengineering and Therapeutics, University of Pennsylvania, Philadelphia, PA 19104, United States
- Department of Neurology, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Russell T Shinohara
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Alfredo Lucas
- Department of Bioengineering, University of Pennsylvania, Philadelphia, PA 19104, United States
- Center for Neuroengineering and Therapeutics, University of Pennsylvania, Philadelphia, PA 19104, United States
- Department of Neurology, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Chloé E Hill
- Department of Neurology, University of Michigan, Ann Arbor, MI 48109, United States
| | - Roy H Hamilton
- Department of Neurology, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Kevin B Johnson
- Department of Bioengineering, University of Pennsylvania, Philadelphia, PA 19104, United States
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, United States
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104, United States
- Department of Pediatrics, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Dan Roth
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Brian Litt
- Department of Bioengineering, University of Pennsylvania, Philadelphia, PA 19104, United States
- Center for Neuroengineering and Therapeutics, University of Pennsylvania, Philadelphia, PA 19104, United States
- Department of Neurology, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Colin A Ellis
- Center for Neuroengineering and Therapeutics, University of Pennsylvania, Philadelphia, PA 19104, United States
- Department of Neurology, University of Pennsylvania, Philadelphia, PA 19104, United States
| |
Collapse
|
2
|
Liu Y, Joly R, Reading Turchioe M, Benda N, Hermann A, Beecy A, Pathak J, Zhang Y. Preparing for the bedside-optimizing a postpartum depression risk prediction model for clinical implementation in a health system. J Am Med Inform Assoc 2024; 31:1258-1267. [PMID: 38531676 PMCID: PMC11105144 DOI: 10.1093/jamia/ocae056] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Revised: 02/23/2024] [Accepted: 03/04/2024] [Indexed: 03/28/2024] Open
Abstract
OBJECTIVE We developed and externally validated a machine-learning model to predict postpartum depression (PPD) using data from electronic health records (EHRs). Effort is under way to implement the PPD prediction model within the EHR system for clinical decision support. We describe the pre-implementation evaluation process that considered model performance, fairness, and clinical appropriateness. MATERIALS AND METHODS We used EHR data from an academic medical center (AMC) and a clinical research network database from 2014 to 2020 to evaluate the predictive performance and net benefit of the PPD risk model. We used area under the curve and sensitivity as predictive performance and conducted a decision curve analysis. In assessing model fairness, we employed metrics such as disparate impact, equal opportunity, and predictive parity with the White race being the privileged value. The model was also reviewed by multidisciplinary experts for clinical appropriateness. Lastly, we debiased the model by comparing 5 different debiasing approaches of fairness through blindness and reweighing. RESULTS We determined the classification threshold through a performance evaluation that prioritized sensitivity and decision curve analysis. The baseline PPD model exhibited some unfairness in the AMC data but had a fair performance in the clinical research network data. We revised the model by fairness through blindness, a debiasing approach that yielded the best overall performance and fairness, while considering clinical appropriateness suggested by the expert reviewers. DISCUSSION AND CONCLUSION The findings emphasize the need for a thorough evaluation of intervention-specific models, considering predictive performance, fairness, and appropriateness before clinical implementation.
Collapse
Affiliation(s)
- Yifan Liu
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, United States
| | - Rochelle Joly
- Department of Obstetrics and Gynecology, Weill Cornell Medicine, New York, NY 10065, United States
| | | | - Natalie Benda
- Columbia University School of Nursing, New York, NY, United States
| | - Alison Hermann
- Department of Psychiatry, Weill Cornell Medicine, New York, NY 10065, United States
| | - Ashley Beecy
- Department of Medicine, Weill Cornell Medicine, New York, NY 10065, United States
- NewYork-Presbyterian Hospital, New York, NY 10065, United States
| | - Jyotishman Pathak
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, United States
- Department of Psychiatry, Weill Cornell Medicine, New York, NY 10065, United States
| | - Yiye Zhang
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, United States
- NewYork-Presbyterian Hospital, New York, NY 10065, United States
| |
Collapse
|
3
|
Yu Z, Peng C, Yang X, Dang C, Adekkanattu P, Gopal Patra B, Peng Y, Pathak J, Wilson DL, Chang CY, Lo-Ciganic WH, George TJ, Hogan WR, Guo Y, Bian J, Wu Y. Identifying social determinants of health from clinical narratives: A study of performance, documentation ratio, and potential bias. J Biomed Inform 2024; 153:104642. [PMID: 38621641 PMCID: PMC11141428 DOI: 10.1016/j.jbi.2024.104642] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Revised: 04/09/2024] [Accepted: 04/12/2024] [Indexed: 04/17/2024]
Abstract
OBJECTIVE To develop a natural language processing (NLP) package to extract social determinants of health (SDoH) from clinical narratives, examine the bias among race and gender groups, test the generalizability of extracting SDoH for different disease groups, and examine population-level extraction ratio. METHODS We developed SDoH corpora using clinical notes identified at the University of Florida (UF) Health. We systematically compared 7 transformer-based large language models (LLMs) and developed an open-source package - SODA (i.e., SOcial DeterminAnts) to facilitate SDoH extraction from clinical narratives. We examined the performance and potential bias of SODA for different race and gender groups, tested the generalizability of SODA using two disease domains including cancer and opioid use, and explored strategies for improvement. We applied SODA to extract 19 categories of SDoH from the breast (n = 7,971), lung (n = 11,804), and colorectal cancer (n = 6,240) cohorts to assess patient-level extraction ratio and examine the differences among race and gender groups. RESULTS We developed an SDoH corpus using 629 clinical notes of cancer patients with annotations of 13,193 SDoH concepts/attributes from 19 categories of SDoH, and another cross-disease validation corpus using 200 notes from opioid use patients with 4,342 SDoH concepts/attributes. We compared 7 transformer models and the GatorTron model achieved the best mean average strict/lenient F1 scores of 0.9122 and 0.9367 for SDoH concept extraction and 0.9584 and 0.9593 for linking attributes to SDoH concepts. There is a small performance gap (∼4%) between Males and Females, but a large performance gap (>16 %) among race groups. The performance dropped when we applied the cancer SDoH model to the opioid cohort; fine-tuning using a smaller opioid SDoH corpus improved the performance. The extraction ratio varied in the three cancer cohorts, in which 10 SDoH could be extracted from over 70 % of cancer patients, but 9 SDoH could be extracted from less than 70 % of cancer patients. Individuals from the White and Black groups have a higher extraction ratio than other minority race groups. CONCLUSIONS Our SODA package achieved good performance in extracting 19 categories of SDoH from clinical narratives. The SODA package with pre-trained transformer models is available at https://github.com/uf-hobi-informatics-lab/SODA_Docker.
Collapse
Affiliation(s)
- Zehao Yu
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Cheng Peng
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
| | - Xi Yang
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
| | - Chong Dang
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Prakash Adekkanattu
- Information Technologies and Services, Weill Cornell Medicine, New York, NY, USA
| | - Braja Gopal Patra
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| | - Yifan Peng
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| | - Jyotishman Pathak
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| | - Debbie L Wilson
- Department of Pharmaceutical Outcomes & Policy, College of Pharmacy, University of Florida, Gainesville, FL 32611, USA
| | - Ching-Yuan Chang
- Department of Pharmaceutical Outcomes & Policy, College of Pharmacy, University of Florida, Gainesville, FL 32611, USA
| | - Wei-Hsuan Lo-Ciganic
- Department of Pharmaceutical Outcomes & Policy, College of Pharmacy, University of Florida, Gainesville, FL 32611, USA
| | - Thomas J George
- Division of Hematology & Oncology, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA
| | - William R Hogan
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Yi Guo
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
| | - Yonghui Wu
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA.
| |
Collapse
|
4
|
Siddique SM, Tipton K, Leas B, Jepson C, Aysola J, Cohen JB, Flores E, Harhay MO, Schmidt H, Weissman GE, Fricke J, Treadwell JR, Mull NK. The Impact of Health Care Algorithms on Racial and Ethnic Disparities : A Systematic Review. Ann Intern Med 2024; 177:484-496. [PMID: 38467001 DOI: 10.7326/m23-2960] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 03/13/2024] Open
Abstract
BACKGROUND There is increasing concern for the potential impact of health care algorithms on racial and ethnic disparities. PURPOSE To examine the evidence on how health care algorithms and associated mitigation strategies affect racial and ethnic disparities. DATA SOURCES Several databases were searched for relevant studies published from 1 January 2011 to 30 September 2023. STUDY SELECTION Using predefined criteria and dual review, studies were screened and selected to determine: 1) the effect of algorithms on racial and ethnic disparities in health and health care outcomes and 2) the effect of strategies or approaches to mitigate racial and ethnic bias in the development, validation, dissemination, and implementation of algorithms. DATA EXTRACTION Outcomes of interest (that is, access to health care, quality of care, and health outcomes) were extracted with risk-of-bias assessment using the ROBINS-I (Risk Of Bias In Non-randomised Studies - of Interventions) tool and adapted CARE-CPM (Critical Appraisal for Racial and Ethnic Equity in Clinical Prediction Models) equity extension. DATA SYNTHESIS Sixty-three studies (51 modeling, 4 retrospective, 2 prospective, 5 prepost studies, and 1 randomized controlled trial) were included. Heterogenous evidence on algorithms was found to: a) reduce disparities (for example, the revised kidney allocation system), b) perpetuate or exacerbate disparities (for example, severity-of-illness scores applied to critical care resource allocation), and/or c) have no statistically significant effect on select outcomes (for example, the HEART Pathway [history, electrocardiogram, age, risk factors, and troponin]). To mitigate disparities, 7 strategies were identified: removing an input variable, replacing a variable, adding race, adding a non-race-based variable, changing the racial and ethnic composition of the population used in model development, creating separate thresholds for subpopulations, and modifying algorithmic analytic techniques. LIMITATION Results are mostly based on modeling studies and may be highly context-specific. CONCLUSION Algorithms can mitigate, perpetuate, and exacerbate racial and ethnic disparities, regardless of the explicit use of race and ethnicity, but evidence is heterogeneous. Intentionality and implementation of the algorithm can impact the effect on disparities, and there may be tradeoffs in outcomes. PRIMARY FUNDING SOURCE Agency for Healthcare Quality and Research.
Collapse
Affiliation(s)
- Shazia Mehmood Siddique
- Division of Gastroenterology, University of Pennsylvania; Leonard Davis Institute of Health Economics, University of Pennsylvania; and Center for Evidence-Based Practice, Penn Medicine, Philadelphia, Pennsylvania (S.M.S.)
| | - Kelley Tipton
- ECRI-Penn Medicine Evidence-based Practice Center, ECRI, Plymouth Meeting, Pennsylvania (K.T., C.J., J.R.T.)
| | - Brian Leas
- Center for Evidence-Based Practice, Penn Medicine, Philadelphia, Pennsylvania (B.L., E.F., J.F.)
| | - Christopher Jepson
- ECRI-Penn Medicine Evidence-based Practice Center, ECRI, Plymouth Meeting, Pennsylvania (K.T., C.J., J.R.T.)
| | - Jaya Aysola
- Leonard Davis Institute of Health Economics, University of Pennsylvania; Division of General Internal Medicine, University of Pennsylvania; and Penn Medicine Center for Health Equity Advancement, Penn Medicine, Philadelphia, Pennsylvania (J.A.)
| | - Jordana B Cohen
- Division of Renal-Electrolyte and Hypertension, University of Pennsylvania; and Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania (J.B.C.)
| | - Emilia Flores
- Center for Evidence-Based Practice, Penn Medicine, Philadelphia, Pennsylvania (B.L., E.F., J.F.)
| | - Michael O Harhay
- Leonard Davis Institute of Health Economics, University of Pennsylvania; Center for Evidence-Based Practice, Penn Medicine; Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania; and Division of Pulmonary and Critical Care, University of Pennsylvania, Philadelphia, Pennsylvania (M.O.H.)
| | - Harald Schmidt
- Department of Medical Ethics & Health Policy, University of Pennsylvania, Philadelphia, Pennsylvania (H.S.)
| | - Gary E Weissman
- Leonard Davis Institute of Health Economics, University of Pennsylvania; Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania; and Division of Pulmonary and Critical Care, University of Pennsylvania, Philadelphia, Pennsylvania (G.E.W.)
| | - Julie Fricke
- Center for Evidence-Based Practice, Penn Medicine, Philadelphia, Pennsylvania (B.L., E.F., J.F.)
| | - Jonathan R Treadwell
- ECRI-Penn Medicine Evidence-based Practice Center, ECRI, Plymouth Meeting, Pennsylvania (K.T., C.J., J.R.T.)
| | - Nikhil K Mull
- Center for Evidence-Based Practice, Penn Medicine; and Division of Hospital Medicine, University of Pennsylvania, Philadelphia, Pennsylvania (N.K.M.)
| |
Collapse
|
5
|
Mashima Y, Tanigawa M, Yokoi H. Information heterogeneity between progress notes by physicians and nurses for inpatients with digestive system diseases. Sci Rep 2024; 14:7656. [PMID: 38561333 PMCID: PMC10984979 DOI: 10.1038/s41598-024-56324-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Accepted: 03/05/2024] [Indexed: 04/04/2024] Open
Abstract
This study focused on the heterogeneity in progress notes written by physicians or nurses. A total of 806 days of progress notes written by physicians or nurses from 83 randomly selected patients hospitalized in the Gastroenterology Department at Kagawa University Hospital from January to December 2021 were analyzed. We extracted symptoms as the International Classification of Diseases (ICD) Chapter 18 (R00-R99, hereinafter R codes) from each progress note using MedNER-J natural language processing software and counted the days one or more symptoms were extracted to calculate the extraction rate. The R-code extraction rate was significantly higher from progress notes by nurses than by physicians (physicians 68.5% vs. nurses 75.2%; p = 0.00112), regardless of specialty. By contrast, the R-code subcategory R10-R19 for digestive system symptoms (44.2 vs. 37.5%, respectively; p = 0.00299) and many chapters of ICD codes for disease names, as represented by Chapter 11 K00-K93 (68.4 vs. 30.9%, respectively; p < 0.001), were frequently extracted from the progress notes by physicians, reflecting their specialty. We believe that understanding the information heterogeneity of medical documents, which can be the basis of medical artificial intelligence, is crucial, and this study is a pioneering step in that direction.
Collapse
Affiliation(s)
- Yukinori Mashima
- Clinical Research Support Center, Kagawa University Hospital, 1750-1 Ikenobe, Miki-cho, Kita-gun, Kagawa, 761-0793, Japan.
- Department of Medical Informatics, Faculty of Medicine, Kagawa University, Kagawa, Japan.
| | - Masatoshi Tanigawa
- Clinical Research Support Center, Kagawa University Hospital, 1750-1 Ikenobe, Miki-cho, Kita-gun, Kagawa, 761-0793, Japan
| | - Hideto Yokoi
- Clinical Research Support Center, Kagawa University Hospital, 1750-1 Ikenobe, Miki-cho, Kita-gun, Kagawa, 761-0793, Japan
- Department of Medical Informatics, Faculty of Medicine, Kagawa University, Kagawa, Japan
| |
Collapse
|
6
|
Huang Y, Guo J, Chen WH, Lin HY, Tang H, Wang F, Xu H, Bian J. A scoping review of fair machine learning techniques when using real-world data. J Biomed Inform 2024; 151:104622. [PMID: 38452862 PMCID: PMC11146346 DOI: 10.1016/j.jbi.2024.104622] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2023] [Revised: 01/19/2024] [Accepted: 03/03/2024] [Indexed: 03/09/2024]
Abstract
OBJECTIVE The integration of artificial intelligence (AI) and machine learning (ML) in health care to aid clinical decisions is widespread. However, as AI and ML take important roles in health care, there are concerns about AI and ML associated fairness and bias. That is, an AI tool may have a disparate impact, with its benefits and drawbacks unevenly distributed across societal strata and subpopulations, potentially exacerbating existing health inequities. Thus, the objectives of this scoping review were to summarize existing literature and identify gaps in the topic of tackling algorithmic bias and optimizing fairness in AI/ML models using real-world data (RWD) in health care domains. METHODS We conducted a thorough review of techniques for assessing and optimizing AI/ML model fairness in health care when using RWD in health care domains. The focus lies on appraising different quantification metrics for accessing fairness, publicly accessible datasets for ML fairness research, and bias mitigation approaches. RESULTS We identified 11 papers that are focused on optimizing model fairness in health care applications. The current research on mitigating bias issues in RWD is limited, both in terms of disease variety and health care applications, as well as the accessibility of public datasets for ML fairness research. Existing studies often indicate positive outcomes when using pre-processing techniques to address algorithmic bias. There remain unresolved questions within the field that require further research, which includes pinpointing the root causes of bias in ML models, broadening fairness research in AI/ML with the use of RWD and exploring its implications in healthcare settings, and evaluating and addressing bias in multi-modal data. CONCLUSION This paper provides useful reference material and insights to researchers regarding AI/ML fairness in real-world health care data and reveals the gaps in the field. Fair AI/ML in health care is a burgeoning field that requires a heightened research focus to cover diverse applications and different types of RWD.
Collapse
Affiliation(s)
- Yu Huang
- Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, FL, USA
| | - Jingchuan Guo
- Pharmaceutical Outcomes & Policy, University of Florida, Gainesville, FL, USA
| | - Wei-Han Chen
- Pharmaceutical Outcomes & Policy, University of Florida, Gainesville, FL, USA
| | - Hsin-Yueh Lin
- Pharmaceutical Outcomes & Policy, University of Florida, Gainesville, FL, USA
| | - Huilin Tang
- Pharmaceutical Outcomes & Policy, University of Florida, Gainesville, FL, USA
| | - Fei Wang
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA; Institute of Artificial Intelligence for Digital Health, Weill Cornell Medicine, New York, NY, USA
| | - Hua Xu
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, FL, USA.
| |
Collapse
|
7
|
Davenport MA, Sirrianni JW, Chisolm DJ. Machine learning data sources in pediatric sleep research: assessing racial/ethnic differences in electronic health record-based clinical notes prior to model training. FRONTIERS IN SLEEP 2024; 3:1271167. [PMID: 38817450 PMCID: PMC11138315 DOI: 10.3389/frsle.2024.1271167] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
Introduction Pediatric sleep problems can be detected across racial/ethnic subpopulations in primary care settings. However, the electronic health record (EHR) data documentation that describes patients' sleep problems may be inherently biased due to both historical biases and informed presence. This study assessed racial/ethnic differences in natural language processing (NLP) training data (e.g., pediatric sleep-related keywords in primary care clinical notes) prior to model training. Methods We used a predefined keyword features set containing 178 Peds B-SATED keywords. We then queried all the clinical notes from patients seen in pediatric primary care between the ages of 5 and 18 from January 2018 to December 2021. A least absolute shrinkage and selection operator (LASSO) regression model was used to investigate whether there were racial/ethnic differences in the documentation of Peds B-SATED keywords. Then, mixed-effects logistic regression was used to determine whether the odds of the presence of global Peds B-SATED dimensions also differed across racial/ethnic subpopulations. Results Using both LASSO and multilevel modeling approaches, the current study found that there were racial/ethnic differences in providers' documentation of Peds B-SATED keywords and global dimensions. In addition, the most frequently documented Peds B-SATED keyword rankings qualitatively differed across racial/ethnic subpopulations. Conclusion This study revealed providers' differential patterns of documenting Peds B-SATED keywords and global dimensions that may account for the under-detection of pediatric sleep problems among racial/ethnic subpopulations. In research, these findings have important implications for the equitable clinical documentation of sleep problems in pediatric primary care settings and extend prior retrospective work in pediatric sleep specialty settings.
Collapse
Affiliation(s)
- Mattina A. Davenport
- Abigail Wexner Research Institute, Center for Child Health Equity and Outcomes Research, Nationwide Children’s Hospital, Columbus, OH, United States
- Department of Pediatrics, College of Medicine, The Ohio State University, Columbus, OH, United States
| | - Joseph W. Sirrianni
- Abigail Wexner Research Institute, IT Research and Innovation, Nationwide Children’s Hospital, Columbus, OH, United States
| | - Deena J. Chisolm
- Abigail Wexner Research Institute, Center for Child Health Equity and Outcomes Research, Nationwide Children’s Hospital, Columbus, OH, United States
- Department of Pediatrics, College of Medicine, The Ohio State University, Columbus, OH, United States
| |
Collapse
|
8
|
Cary MP, Zink A, Wei S, Olson A, Yan M, Senior R, Bessias S, Gadhoumi K, Jean-Pierre G, Wang D, Ledbetter LS, Economou-Zavlanos NJ, Obermeyer Z, Pencina MJ. Mitigating Racial And Ethnic Bias And Advancing Health Equity In Clinical Algorithms: A Scoping Review. Health Aff (Millwood) 2023; 42:1359-1368. [PMID: 37782868 PMCID: PMC10668606 DOI: 10.1377/hlthaff.2023.00553] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/04/2023]
Abstract
In August 2022 the Department of Health and Human Services (HHS) issued a notice of proposed rulemaking prohibiting covered entities, which include health care providers and health plans, from discriminating against individuals when using clinical algorithms in decision making. However, HHS did not provide specific guidelines on how covered entities should prevent discrimination. We conducted a scoping review of literature published during the period 2011-22 to identify health care applications, frameworks, reviews and perspectives, and assessment tools that identify and mitigate bias in clinical algorithms, with a specific focus on racial and ethnic bias. Our scoping review encompassed 109 articles comprising 45 empirical health care applications that included tools tested in health care settings, 16 frameworks, and 48 reviews and perspectives. We identified a wide range of technical, operational, and systemwide bias mitigation strategies for clinical algorithms, but there was no consensus in the literature on a single best practice that covered entities could employ to meet the HHS requirements. Future research should identify optimal bias mitigation methods for various scenarios, depending on factors such as patient population, clinical setting, algorithm design, and types of bias to be addressed.
Collapse
Affiliation(s)
- Michael P Cary
- Michael P. Cary Jr. , Duke University, Durham, North Carolina
| | - Anna Zink
- Anna Zink, University of Chicago, Chicago, Illinois
| | - Sijia Wei
- Sijia Wei, Northwestern University, Chicago, Illinois
| | | | | | | | | | | | | | | | | | | | - Ziad Obermeyer
- Ziad Obermeyer, University of California Berkeley, Berkeley, California
| | | |
Collapse
|
9
|
Xie K, Ojemann WKS, Gallagher RS, Lucas A, Hill CE, Hamilton RH, Johnson KB, Roth D, Litt B, Ellis CA. Disparities in seizure outcomes revealed by large language models. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.09.20.23295842. [PMID: 37790442 PMCID: PMC10543059 DOI: 10.1101/2023.09.20.23295842] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/05/2023]
Abstract
Objective Large-language models (LLMs) in healthcare have the potential to propagate existing biases or introduce new ones. For people with epilepsy, social determinants of health are associated with disparities in access to care, but their impact on seizure outcomes among those with access to specialty care remains unclear. Here we (1) evaluated our validated, epilepsy-specific LLM for intrinsic bias, and (2) used LLM-extracted seizure outcomes to test the hypothesis that different demographic groups have different seizure outcomes. Methods First, we tested our LLM for intrinsic bias in the form of differential performance in demographic groups by race, ethnicity, sex, income, and health insurance in manually annotated notes. Next, we used LLM-classified seizure freedom at each office visit to test for outcome disparities in the same demographic groups, using univariable and multivariable analyses. Results We analyzed 84,675 clinic visits from 25,612 patients seen at our epilepsy center 2005-2022. We found no differences in the accuracy, or positive or negative class balance of outcome classifications across demographic groups. Multivariable analysis indicated worse seizure outcomes for female patients (OR 1.33, p = 3×10-8), those with public insurance (OR 1.53, p = 2×10-13), and those from lower-income zip codes (OR ≥ 1.22, p ≤ 6.6×10-3). Black patients had worse outcomes than White patients in univariable but not multivariable analysis (OR 1.03, p = 0.66). Significance We found no evidence that our LLM was intrinsically biased against any demographic group. Seizure freedom extracted by LLM revealed disparities in seizure outcomes across several demographic groups. These findings highlight the critical need to reduce disparities in the care of people with epilepsy.
Collapse
Affiliation(s)
- Kevin Xie
- University of Pennsylvania, Dept. of Bioengineering, Philadelphia, PA, USA
- University of Pennsylvania, Center for Neuroengineering and Therapeutics, Philadelphia, PA, USA
| | - William K S Ojemann
- University of Pennsylvania, Dept. of Bioengineering, Philadelphia, PA, USA
- University of Pennsylvania, Center for Neuroengineering and Therapeutics, Philadelphia, PA, USA
| | - Ryan S Gallagher
- University of Pennsylvania, Center for Neuroengineering and Therapeutics, Philadelphia, PA, USA
- University of Pennsylvania, Dept. of Neurology, Philadelphia, PA, USA
| | - Alfredo Lucas
- University of Pennsylvania, Dept. of Bioengineering, Philadelphia, PA, USA
- University of Pennsylvania, Center for Neuroengineering and Therapeutics, Philadelphia, PA, USA
- University of Pennsylvania, Dept. of Neurology, Philadelphia, PA, USA
| | - Chloé E Hill
- University of Michigan, Dept. of Neurology, Ann Arbor, MI, USA
| | - Roy H Hamilton
- University of Pennsylvania, Dept. of Neurology, Philadelphia, PA, USA
| | - Kevin B Johnson
- University of Pennsylvania, Dept. of Bioengineering, Philadelphia, PA, USA
- University of Pennsylvania, Dept. Of Biostatistics, Epidemiology and Informatics, Philadelphia, PA USA
- University of Pennsylvania, Dept. of Computer and Information Science, Philadelphia, PA, USA
- University of Pennsylvania, Dept. of Pediatrics, Philadelphia, PA, USA
| | - Dan Roth
- University of Pennsylvania, Dept. of Computer and Information Science, Philadelphia, PA, USA
| | - Brian Litt
- University of Pennsylvania, Dept. of Bioengineering, Philadelphia, PA, USA
- University of Pennsylvania, Center for Neuroengineering and Therapeutics, Philadelphia, PA, USA
- University of Pennsylvania, Dept. of Neurology, Philadelphia, PA, USA
| | - Colin A Ellis
- University of Pennsylvania, Center for Neuroengineering and Therapeutics, Philadelphia, PA, USA
- University of Pennsylvania, Dept. of Neurology, Philadelphia, PA, USA
| |
Collapse
|
10
|
Abstract
OBJECTIVES Through a scoping review, we examine in this survey what ways health equity has been promoted in clinical research informatics with patient implications and especially published in the year of 2021 (and some in 2022). METHOD A scoping review was conducted guided by using methods described in the Joanna Briggs Institute Manual. The review process consisted of five stages: 1) development of aim and research question, 2) literature search, 3) literature screening and selection, 4) data extraction, and 5) accumulate and report results. RESULTS From the 478 identified papers in 2021 on the topic of clinical research informatics with focus on health equity as a patient implication, 8 papers met our inclusion criteria. All included papers focused on artificial intelligence (AI) technology. The papers addressed health equity in clinical research informatics either through the exposure of inequity in AI-based solutions or using AI as a tool for promoting health equity in the delivery of healthcare services. While algorithmic bias poses a risk to health equity within AI-based solutions, AI has also uncovered inequity in traditional treatment and demonstrated effective complements and alternatives that promotes health equity. CONCLUSIONS Clinical research informatics with implications for patients still face challenges of ethical nature and clinical value. However, used prudently-for the right purpose in the right context-clinical research informatics could bring powerful tools in advancing health equity in patient care.
Collapse
|
11
|
Ge Y, Guo Y, Das S, Al-Garadi MA, Sarker A. Few-shot learning for medical text: A review of advances, trends, and opportunities. J Biomed Inform 2023; 144:104458. [PMID: 37488023 PMCID: PMC10940971 DOI: 10.1016/j.jbi.2023.104458] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 06/19/2023] [Accepted: 07/19/2023] [Indexed: 07/26/2023]
Abstract
BACKGROUND Few-shot learning (FSL) is a class of machine learning methods that require small numbers of labeled instances for training. With many medical topics having limited annotated text-based data in practical settings, FSL-based natural language processing (NLP) holds substantial promise. We aimed to conduct a review to explore the current state of FSL methods for medical NLP. METHODS We searched for articles published between January 2016 and October 2022 using PubMed/Medline, Embase, ACL Anthology, and IEEE Xplore Digital Library. We also searched the preprint servers (e.g., arXiv, medRxiv, and bioRxiv) via Google Scholar to identify the latest relevant methods. We included all articles that involved FSL and any form of medical text. We abstracted articles based on the data source, target task, training set size, primary method(s)/approach(es), and evaluation metric(s). RESULTS Fifty-one articles met our inclusion criteria-all published after 2018, and most since 2020 (42/51; 82%). Concept extraction/named entity recognition was the most frequently addressed task (21/51; 41%), followed by text classification (16/51; 31%). Thirty-two (61%) articles reconstructed existing datasets to fit few-shot scenarios, and MIMIC-III was the most frequently used dataset (10/51; 20%). 77% of the articles attempted to incorporate prior knowledge to augment the small datasets available for training. Common methods included FSL with attention mechanisms (20/51; 39%), prototypical networks (11/51; 22%), meta-learning (7/51; 14%), and prompt-based learning methods, the latter being particularly popular since 2021. Benchmarking experiments demonstrated relative underperformance of FSL methods on biomedical NLP tasks. CONCLUSION Despite the potential for FSL in biomedical NLP, progress has been limited. This may be attributed to the rarity of specialized data, lack of standardized evaluation criteria, and the underperformance of FSL methods on biomedical topics. The creation of publicly-available specialized datasets for biomedical FSL may aid method development by facilitating comparative analyses.
Collapse
Affiliation(s)
- Yao Ge
- Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, United States of America
| | - Yuting Guo
- Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, United States of America
| | - Sudeshna Das
- Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, United States of America
| | - Mohammed Ali Al-Garadi
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Vanderbilt University, Nashville, TN, United States of America
| | - Abeed Sarker
- Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, GA, United States of America; Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA, United States of America.
| |
Collapse
|
12
|
Huang D, Cogill S, Hsia RY, Yang S, Kim D. Development and external validation of a pretrained deep learning model for the prediction of non-accidental trauma. NPJ Digit Med 2023; 6:131. [PMID: 37468526 DOI: 10.1038/s41746-023-00875-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2022] [Accepted: 07/07/2023] [Indexed: 07/21/2023] Open
Abstract
Non-accidental trauma (NAT) is deadly and difficult to predict. Transformer models pretrained on large datasets have recently produced state of the art performance on diverse prediction tasks, but the optimal pretraining strategies for diagnostic predictions are not known. Here we report the development and external validation of Pretrained and Adapted BERT for Longitudinal Outcomes (PABLO), a transformer-based deep learning model with multitask clinical pretraining, to identify patients who will receive a diagnosis of NAT in the next year. We develop a clinical interface to visualize patient trajectories, model predictions, and individual risk factors. In two comprehensive statewide databases, approximately 1% of patients experience NAT within one year of prediction. PABLO predicts NAT events with area under the receiver operating characteristic curve (AUROC) of 0.844 (95% CI 0.838-0.851) in the California test set, and 0.849 (95% CI 0.846-0.851) on external validation in Florida, outperforming comparator models. Multitask pretraining significantly improves model performance. Attribution analysis shows substance use, psychiatric, and injury diagnoses, in the context of age and racial demographics, as influential predictors of NAT. As a clinical decision support system, PABLO can identify high-risk patients and patient-specific risk factors, which can be used to target secondary screening and preventive interventions at the point-of-care.
Collapse
Affiliation(s)
- David Huang
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | | | - Renee Y Hsia
- Department of Emergency Medicine, UCSF School of Medicine, San Francisco, CA, USA
| | - Samuel Yang
- Department of Emergency Medicine, Stanford University, Stanford, CA, USA
| | - David Kim
- Department of Emergency Medicine, Stanford University, Stanford, CA, USA.
| |
Collapse
|
13
|
Banda JM, Shah NH, Periyakoil VS. Characterizing subgroup performance of probabilistic phenotype algorithms within older adults: a case study for dementia, mild cognitive impairment, and Alzheimer's and Parkinson's diseases. JAMIA Open 2023; 6:ooad043. [PMID: 37397506 PMCID: PMC10307941 DOI: 10.1093/jamiaopen/ooad043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Revised: 06/06/2023] [Accepted: 06/22/2023] [Indexed: 07/04/2023] Open
Abstract
Objective Biases within probabilistic electronic phenotyping algorithms are largely unexplored. In this work, we characterize differences in subgroup performance of phenotyping algorithms for Alzheimer's disease and related dementias (ADRD) in older adults. Materials and methods We created an experimental framework to characterize the performance of probabilistic phenotyping algorithms under different racial distributions allowing us to identify which algorithms may have differential performance, by how much, and under what conditions. We relied on rule-based phenotype definitions as reference to evaluate probabilistic phenotype algorithms created using the Automated PHenotype Routine for Observational Definition, Identification, Training and Evaluation framework. Results We demonstrate that some algorithms have performance variations anywhere from 3% to 30% for different populations, even when not using race as an input variable. We show that while performance differences in subgroups are not present for all phenotypes, they do affect some phenotypes and groups more disproportionately than others. Discussion Our analysis establishes the need for a robust evaluation framework for subgroup differences. The underlying patient populations for the algorithms showing subgroup performance differences have great variance between model features when compared with the phenotypes with little to no differences. Conclusion We have created a framework to identify systematic differences in the performance of probabilistic phenotyping algorithms specifically in the context of ADRD as a use case. Differences in subgroup performance of probabilistic phenotyping algorithms are not widespread nor do they occur consistently. This highlights the great need for careful ongoing monitoring to evaluate, measure, and try to mitigate such differences.
Collapse
Affiliation(s)
- Juan M Banda
- Corresponding Author: Juan M. Banda, PhD, Department of Computer Science, College of Arts and Sciences, Georgia State University, 25 Park Place, Suite 752, Atlanta, GA 30303, USA;
| | - Nigam H Shah
- Stanford Center for Biomedical Informatics Research, Stanford University School of Medicine, Stanford, California, USA
| | - Vyjeyanthi S Periyakoil
- Stanford Department of Medicine, Palo Alto, California, USA
- VA Palo Alto Health Care System, Palo Alto, California, USA
| |
Collapse
|
14
|
Malerbi FK, Nakayama LF, Gayle Dychiao R, Zago Ribeiro L, Villanueva C, Celi LA, Regatieri CV. Digital Education for the Deployment of Artificial Intelligence in Health Care. J Med Internet Res 2023; 25:e43333. [PMID: 37347537 PMCID: PMC10337407 DOI: 10.2196/43333] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2022] [Revised: 01/19/2023] [Accepted: 04/05/2023] [Indexed: 06/23/2023] Open
Abstract
Artificial Intelligence (AI) represents a significant milestone in health care's digital transformation. However, traditional health care education and training often lack digital competencies. To promote safe and effective AI implementation, health care professionals must acquire basic knowledge of machine learning and neural networks, critical evaluation of data sets, integration within clinical workflows, bias control, and human-machine interaction in clinical settings. Additionally, they should understand the legal and ethical aspects of digital health care and the impact of AI adoption. Misconceptions and fears about AI systems could jeopardize its real-life implementation. However, there are multiple barriers to promoting electronic health literacy, including time constraints, overburdened curricula, and the shortage of capacitated professionals. To overcome these challenges, partnerships among developers, professional societies, and academia are essential. Integrating specialists from different backgrounds, including data specialists, lawyers, and social scientists, can significantly contribute to combating digital illiteracy and promoting safe AI implementation in health care.
Collapse
Affiliation(s)
| | - Luis Filipe Nakayama
- Ophthalmology Department, Sao Paulo Federal University, Sao Paulo, Brazil
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, United States
| | | | - Lucas Zago Ribeiro
- Ophthalmology Department, Sao Paulo Federal University, Sao Paulo, Brazil
| | - Cleva Villanueva
- Escuela Superior de Medicina, Instituto Politecnico Nacional, Mexico City, Mexico
| | - Leo Anthony Celi
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, United States
- Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, MA, United States
| | | |
Collapse
|
15
|
Khor S, Heagerty PJ, Basu A, Haupt EC, Lyons LJL, Hahn EE, Bansal A. Racial Disparities in the Ascertainment of Cancer Recurrence in Electronic Health Records. JCO Clin Cancer Inform 2023; 7:e2300004. [PMID: 37267516 PMCID: PMC10530597 DOI: 10.1200/cci.23.00004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Revised: 03/20/2023] [Accepted: 04/05/2023] [Indexed: 06/04/2023] Open
Abstract
PURPOSE There is growing interest in using computable phenotypes or proxies to identify important clinical outcomes, such as cancer recurrence, in rich electronic health records data. However, the race/ethnicity-specific accuracies of these proxies remain unclear. We examined whether the accuracy of a proxy for colorectal cancer (CRC) recurrence differed by race/ethnicity and the possible mechanisms that drove the differences. METHODS Using data from a large integrated health care system, we identified a stratified random sample of 282 Black/African American (AA), Hispanic, and non-Hispanic White (NHW) patients with CRC who received primary treatment. Patient 5-year recurrence status was estimated using a utilization-based proxy and evaluated against the true recurrence status obtained using detailed chart review and by race/ethnicity. We used covariate-adjusted probit regression models to estimate the associations between race/ethnicity and misclassification. RESULTS The recurrence proxy had excellent overall accuracy (positive predictive value [PPV] 89.4%; negative predictive value 96.5%; mean difference in timing 1.96 months); however, accuracy varied by race/ethnicity. Compared with NHW patients, PPV was 14.9% lower (95% CI, 2.53 to 28.6) among Hispanic patients and 4.3% lower (95% CI, -4.8 to 14.8) among Black/AA patients. The proxy disproportionately inflated the 5-year recurrence incidence for Hispanic patients by 10.6% (95% CI, 4.2 to 18.2). Compared with NHW patients, proxy recurrences for Hispanic patients were almost three times as likely to have been misclassified as positive (adjusted risk ratio 2.91 [95% CI, 1.21 to 8.31]). Higher false positives among racial/ethnic minorities may be related to higher prevalence of noncancerous lung-related problems and substantial delays in primary treatment because of insufficient patient-provider communication and abnormal treatment patterns. CONCLUSION Using a proxy with worse accuracy among racial/ethnic minority patients to estimate population health may misdirect resources and support erroneous conclusions around treatment benefit for these patients.
Collapse
Affiliation(s)
- Sara Khor
- Comparative Health Outcomes, Policy, and Economics (CHOICE) Institute, University of Washington, Seattle, WA
| | | | - Anirban Basu
- Comparative Health Outcomes, Policy, and Economics (CHOICE) Institute, University of Washington, Seattle, WA
| | - Eric C. Haupt
- Department of Research and Evaluation, Kaiser Permanente Southern California, Pasadena, CA
| | - Lindsay Joe L. Lyons
- Department of Research and Evaluation, Kaiser Permanente Southern California, Pasadena, CA
| | - Erin E. Hahn
- Department of Research and Evaluation, Kaiser Permanente Southern California, Pasadena, CA
| | - Aasthaa Bansal
- Comparative Health Outcomes, Policy, and Economics (CHOICE) Institute, University of Washington, Seattle, WA
| |
Collapse
|
16
|
Booth GJ, Ross B, Cronin WA, McElrath A, Cyr KL, Hodgson JA, Sibley C, Ismawan JM, Zuehl A, Slotto JG, Higgs M, Haldeman M, Geiger P, Jardine D. Competency-Based Assessments: Leveraging Artificial Intelligence to Predict Subcompetency Content. ACADEMIC MEDICINE : JOURNAL OF THE ASSOCIATION OF AMERICAN MEDICAL COLLEGES 2023; 98:497-504. [PMID: 36477379 DOI: 10.1097/acm.0000000000005115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
PURPOSE Faculty feedback on trainees is critical to guiding trainee progress in a competency-based medical education framework. The authors aimed to develop and evaluate a Natural Language Processing (NLP) algorithm that automatically categorizes narrative feedback into corresponding Accreditation Council for Graduate Medical Education Milestone 2.0 subcompetencies. METHOD Ten academic anesthesiologists analyzed 5,935 narrative evaluations on anesthesiology trainees at 4 graduate medical education (GME) programs between July 1, 2019, and June 30, 2021. Each sentence (n = 25,714) was labeled with the Milestone 2.0 subcompetency that best captured its content or was labeled as demographic or not useful. Inter-rater agreement was assessed by Fleiss' Kappa. The authors trained an NLP model to predict feedback subcompetencies using data from 3 sites and evaluated its performance at a fourth site. Performance metrics included area under the receiver operating characteristic curve (AUC), positive predictive value, sensitivity, F1, and calibration curves. The model was implemented at 1 site in a self-assessment exercise. RESULTS Fleiss' Kappa for subcompetency agreement was moderate (0.44). Model performance was good for professionalism, interpersonal and communication skills, and practice-based learning and improvement (AUC 0.79, 0.79, and 0.75, respectively). Subcompetencies within medical knowledge and patient care ranged from fair to excellent (AUC 0.66-0.84 and 0.63-0.88, respectively). Performance for systems-based practice was poor (AUC 0.59). Performances for demographic and not useful categories were excellent (AUC 0.87 for both). In approximately 1 minute, the model interpreted several hundred evaluations and produced individual trainee reports with organized feedback to guide a self-assessment exercise. The model was built into a web-based application. CONCLUSIONS The authors developed an NLP model that recognized the feedback language of anesthesiologists across multiple GME programs. The model was operationalized in a self-assessment exercise. It is a powerful tool which rapidly organizes large amounts of narrative feedback.
Collapse
Affiliation(s)
- Gregory J Booth
- G.J. Booth is assistant professor, Uniformed Services University of the Health Sciences, and residency program director, Department of Anesthesiology and Pain Medicine, Naval Medical Center Portsmouth, Portsmouth, Virginia
| | - Benjamin Ross
- G.J. Booth is assistant professor, Uniformed Services University of the Health Sciences, and residency program director, Department of Anesthesiology and Pain Medicine, Naval Medical Center Portsmouth, Portsmouth, Virginia
| | - William A Cronin
- G.J. Booth is assistant professor, Uniformed Services University of the Health Sciences, and residency program director, Department of Anesthesiology and Pain Medicine, Naval Medical Center Portsmouth, Portsmouth, Virginia
| | - Angela McElrath
- G.J. Booth is assistant professor, Uniformed Services University of the Health Sciences, and residency program director, Department of Anesthesiology and Pain Medicine, Naval Medical Center Portsmouth, Portsmouth, Virginia
| | - Kyle L Cyr
- G.J. Booth is assistant professor, Uniformed Services University of the Health Sciences, and residency program director, Department of Anesthesiology and Pain Medicine, Naval Medical Center Portsmouth, Portsmouth, Virginia
| | - John A Hodgson
- G.J. Booth is assistant professor, Uniformed Services University of the Health Sciences, and residency program director, Department of Anesthesiology and Pain Medicine, Naval Medical Center Portsmouth, Portsmouth, Virginia
| | - Charles Sibley
- G.J. Booth is assistant professor, Uniformed Services University of the Health Sciences, and residency program director, Department of Anesthesiology and Pain Medicine, Naval Medical Center Portsmouth, Portsmouth, Virginia
| | - J Martin Ismawan
- G.J. Booth is assistant professor, Uniformed Services University of the Health Sciences, and residency program director, Department of Anesthesiology and Pain Medicine, Naval Medical Center Portsmouth, Portsmouth, Virginia
| | - Alyssa Zuehl
- G.J. Booth is assistant professor, Uniformed Services University of the Health Sciences, and residency program director, Department of Anesthesiology and Pain Medicine, Naval Medical Center Portsmouth, Portsmouth, Virginia
| | - James G Slotto
- G.J. Booth is assistant professor, Uniformed Services University of the Health Sciences, and residency program director, Department of Anesthesiology and Pain Medicine, Naval Medical Center Portsmouth, Portsmouth, Virginia
| | - Maureen Higgs
- G.J. Booth is assistant professor, Uniformed Services University of the Health Sciences, and residency program director, Department of Anesthesiology and Pain Medicine, Naval Medical Center Portsmouth, Portsmouth, Virginia
| | - Matthew Haldeman
- G.J. Booth is assistant professor, Uniformed Services University of the Health Sciences, and residency program director, Department of Anesthesiology and Pain Medicine, Naval Medical Center Portsmouth, Portsmouth, Virginia
| | - Phillip Geiger
- G.J. Booth is assistant professor, Uniformed Services University of the Health Sciences, and residency program director, Department of Anesthesiology and Pain Medicine, Naval Medical Center Portsmouth, Portsmouth, Virginia
| | - Dink Jardine
- G.J. Booth is assistant professor, Uniformed Services University of the Health Sciences, and residency program director, Department of Anesthesiology and Pain Medicine, Naval Medical Center Portsmouth, Portsmouth, Virginia
| |
Collapse
|
17
|
Yang S, Varghese P, Stephenson E, Tu K, Gronsbell J. Machine learning approaches for electronic health records phenotyping: a methodical review. J Am Med Inform Assoc 2023; 30:367-381. [PMID: 36413056 PMCID: PMC9846699 DOI: 10.1093/jamia/ocac216] [Citation(s) in RCA: 16] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 09/27/2022] [Accepted: 10/27/2022] [Indexed: 11/23/2022] Open
Abstract
OBJECTIVE Accurate and rapid phenotyping is a prerequisite to leveraging electronic health records for biomedical research. While early phenotyping relied on rule-based algorithms curated by experts, machine learning (ML) approaches have emerged as an alternative to improve scalability across phenotypes and healthcare settings. This study evaluates ML-based phenotyping with respect to (1) the data sources used, (2) the phenotypes considered, (3) the methods applied, and (4) the reporting and evaluation methods used. MATERIALS AND METHODS We searched PubMed and Web of Science for articles published between 2018 and 2022. After screening 850 articles, we recorded 37 variables on 100 studies. RESULTS Most studies utilized data from a single institution and included information in clinical notes. Although chronic conditions were most commonly considered, ML also enabled the characterization of nuanced phenotypes such as social determinants of health. Supervised deep learning was the most popular ML paradigm, while semi-supervised and weakly supervised learning were applied to expedite algorithm development and unsupervised learning to facilitate phenotype discovery. ML approaches did not uniformly outperform rule-based algorithms, but deep learning offered a marginal improvement over traditional ML for many conditions. DISCUSSION Despite the progress in ML-based phenotyping, most articles focused on binary phenotypes and few articles evaluated external validity or used multi-institution data. Study settings were infrequently reported and analytic code was rarely released. CONCLUSION Continued research in ML-based phenotyping is warranted, with emphasis on characterizing nuanced phenotypes, establishing reporting and evaluation standards, and developing methods to accommodate misclassified phenotypes due to algorithm errors in downstream applications.
Collapse
Affiliation(s)
- Siyue Yang
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
| | | | - Ellen Stephenson
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Karen Tu
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Jessica Gronsbell
- Department of Statistical Sciences, University of Toronto, Toronto, Ontario, Canada
- Department of Family & Community Medicine, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
18
|
Davoudi A, Sajdeya R, Ison R, Hagen J, Rashidi P, Price CC, Tighe PJ. Fairness in the prediction of acute postoperative pain using machine learning models. Front Digit Health 2023; 4:970281. [PMID: 36714611 PMCID: PMC9874861 DOI: 10.3389/fdgth.2022.970281] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Accepted: 10/24/2022] [Indexed: 01/12/2023] Open
Abstract
Introduction Overall performance of machine learning-based prediction models is promising; however, their generalizability and fairness must be vigorously investigated to ensure they perform sufficiently well for all patients. Objective This study aimed to evaluate prediction bias in machine learning models used for predicting acute postoperative pain. Method We conducted a retrospective review of electronic health records for patients undergoing orthopedic surgery from June 1, 2011, to June 30, 2019, at the University of Florida Health system/Shands Hospital. CatBoost machine learning models were trained for predicting the binary outcome of low (≤4) and high pain (>4). Model biases were assessed against seven protected attributes of age, sex, race, area deprivation index (ADI), speaking language, health literacy, and insurance type. Reweighing of protected attributes was investigated for reducing model bias compared with base models. Fairness metrics of equal opportunity, predictive parity, predictive equality, statistical parity, and overall accuracy equality were examined. Results The final dataset included 14,263 patients [age: 60.72 (16.03) years, 53.87% female, 39.13% low acute postoperative pain]. The machine learning model (area under the curve, 0.71) was biased in terms of age, race, ADI, and insurance type, but not in terms of sex, language, and health literacy. Despite promising overall performance in predicting acute postoperative pain, machine learning-based prediction models may be biased with respect to protected attributes. Conclusion These findings show the need to evaluate fairness in machine learning models involved in perioperative pain before they are implemented as clinical decision support tools.
Collapse
Affiliation(s)
- Anis Davoudi
- Department of Anesthesiology, University of Florida College of Medicine, Gainesville, FL, United Sates
| | - Ruba Sajdeya
- Department of Epidemiology, University of Florida College of Public Health and Health Professions, Gainesville, FL, United States
| | - Ron Ison
- Department of Anesthesiology, University of Florida College of Medicine, Gainesville, FL, United Sates
| | - Jennifer Hagen
- Department of Orthopedic Surgery, University of Florida College of Medicine, Gainesville, FL, United States
| | - Parisa Rashidi
- Department of Biomedical Engineering, University of Florida Herbert Wertheim College of Engineering, Gainesville, FL, United States
| | - Catherine C. Price
- Department of Anesthesiology, University of Florida College of Medicine, Gainesville, FL, United Sates
- Department of Clinical and Health Psychology, University of Florida College of Public Health and Health Professions, Gainesville, FL, United States
| | - Patrick J. Tighe
- Department of Anesthesiology, University of Florida College of Medicine, Gainesville, FL, United Sates
| |
Collapse
|
19
|
Thompson HM, Sharma B, Smith DL, Bhalla S, Erondu I, Hazra A, Ilyas Y, Pachwicewicz P, Sheth NK, Chhabra N, Karnik NS, Afshar M. Machine Learning Techniques to Explore Clinical Presentations of COVID-19 Severity and to Test the Association With Unhealthy Opioid Use: Retrospective Cross-sectional Cohort Study. JMIR Public Health Surveill 2022; 8:e38158. [PMID: 36265163 PMCID: PMC9746674 DOI: 10.2196/38158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Revised: 05/23/2022] [Accepted: 10/18/2022] [Indexed: 11/07/2022] Open
Abstract
BACKGROUND The COVID-19 pandemic has exacerbated health inequities in the United States. People with unhealthy opioid use (UOU) may face disproportionate challenges with COVID-19 precautions, and the pandemic has disrupted access to opioids and UOU treatments. UOU impairs the immunological, cardiovascular, pulmonary, renal, and neurological systems and may increase severity of outcomes for COVID-19. OBJECTIVE We applied machine learning techniques to explore clinical presentations of hospitalized patients with UOU and COVID-19 and to test the association between UOU and COVID-19 disease severity. METHODS This retrospective, cross-sectional cohort study was conducted based on data from 4110 electronic health record patient encounters at an academic health center in Chicago between January 1, 2020, and December 31, 2020. The inclusion criterion was an unplanned admission of a patient aged ≥18 years; encounters were counted as COVID-19-positive if there was a positive test for COVID-19 or 2 COVID-19 International Classification of Disease, Tenth Revision codes. Using a predefined cutoff with optimal sensitivity and specificity to identify UOU, we ran a machine learning UOU classifier on the data for patients with COVID-19 to estimate the subcohort of patients with UOU. Topic modeling was used to explore and compare the clinical presentations documented for 2 subgroups: encounters with UOU and COVID-19 and those with no UOU and COVID-19. Mixed effects logistic regression accounted for multiple encounters for some patients and tested the association between UOU and COVID-19 outcome severity. Severity was measured with 3 utilization metrics: low-severity unplanned admission, medium-severity unplanned admission and receiving mechanical ventilation, and high-severity unplanned admission with in-hospital death. All models controlled for age, sex, race/ethnicity, insurance status, and BMI. RESULTS Topic modeling yielded 10 topics per subgroup and highlighted unique comorbidities associated with UOU and COVID-19 (eg, HIV) and no UOU and COVID-19 (eg, diabetes). In the regression analysis, each incremental increase in the classifier's predicted probability of UOU was associated with 1.16 higher odds of COVID-19 outcome severity (odds ratio 1.16, 95% CI 1.04-1.29; P=.009). CONCLUSIONS Among patients hospitalized with COVID-19, UOU is an independent risk factor associated with greater outcome severity, including in-hospital death. Social determinants of health and opioid-related overdose are unique comorbidities in the clinical presentation of the UOU patient subgroup. Additional research is needed on the role of COVID-19 therapeutics and inpatient management of acute COVID-19 pneumonia for patients with UOU. Further research is needed to test associations between expanded evidence-based harm reduction strategies for UOU and vaccination rates, hospitalizations, and risks for overdose and death among people with UOU and COVID-19. Machine learning techniques may offer more exhaustive means for cohort discovery and a novel mixed methods approach to population health.
Collapse
Affiliation(s)
- Hale M Thompson
- Section of Community Behavioral Health, Department of Psychiatry and Behavioral Sciences, Rush University Medical Center, Chicago, IL, United States
- Center for Education, Research, and Advocacy, Department of Social and Behavioral Research, Howard Brown Health, Chicago, IL, United States
| | - Brihat Sharma
- Section of Community Behavioral Health, Department of Psychiatry and Behavioral Sciences, Rush University Medical Center, Chicago, IL, United States
| | - Dale L Smith
- Section of Community Behavioral Health, Department of Psychiatry and Behavioral Sciences, Rush University Medical Center, Chicago, IL, United States
| | - Sameer Bhalla
- Department of Internal Medicine, Rush University Medical Center, Chicago, IL, United States
| | - Ihuoma Erondu
- Section of Community Behavioral Health, Department of Psychiatry and Behavioral Sciences, Rush University Medical Center, Chicago, IL, United States
| | - Aniruddha Hazra
- Section of Infectious Diseases and Global Health, Department of Medicine, University of Chicago, Chicago, IL, United States
| | - Yousaf Ilyas
- Section of Community Behavioral Health, Department of Psychiatry and Behavioral Sciences, Rush University Medical Center, Chicago, IL, United States
| | - Paul Pachwicewicz
- Section of Community Behavioral Health, Department of Psychiatry and Behavioral Sciences, Rush University Medical Center, Chicago, IL, United States
| | - Neeral K Sheth
- Department of Psychiatry and Behavioral Sciences, Rush University Medical Center, Chicago, IL, United States
| | - Neeraj Chhabra
- Department of Emergency Medicine, Rush University Medical College, Rush University Medical Center, Chicago, IL, United States
| | - Niranjan S Karnik
- Section of Community Behavioral Health, Department of Psychiatry and Behavioral Sciences, Rush University Medical Center, Chicago, IL, United States
| | - Majid Afshar
- Division of Pulmonary and Critical Care, Department of Medicine, School of Medicine and Public Health, University of Wisconsin, Madison, WI, United States
| |
Collapse
|
20
|
Nelson AE, Arbeeva L. Narrative Review of Machine Learning in Rheumatic and Musculoskeletal Diseases for Clinicians and Researchers: Biases, Goals, and Future Directions. J Rheumatol 2022; 49:1191-1200. [PMID: 35840150 PMCID: PMC9633365 DOI: 10.3899/jrheum.220326] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 06/21/2022] [Indexed: 11/22/2022]
Abstract
There has been rapid growth in the use of artificial intelligence (AI) analytics in medicine in recent years, including in rheumatic and musculoskeletal diseases (RMDs). Such methods represent a challenge to clinicians, patients, and researchers, given the "black box" nature of most algorithms, the unfamiliarity of the terms, and the lack of awareness of potential issues around these analyses. Therefore, this review aims to introduce this subject area in a way that is relevant and meaningful to clinicians and researchers. We hope to provide some insights into relevant strengths and limitations, reporting guidelines, as well as recent examples of such analyses in key areas, with a focus on lessons learned and future directions in diagnosis, phenotyping, prognosis, and precision medicine in RMDs.
Collapse
Affiliation(s)
- Amanda E Nelson
- A.E. Nelson, MD, MSCR, Department of Medicine, Division of Rheumatology, Allergy, and Immunology, University of North Carolina at Chapel Hill;
| | - Liubov Arbeeva
- L. Arbeeva, MS, Thurston Arthritis Research Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| |
Collapse
|
21
|
Gao Y, Miller T, Xu D, Dligach D, Churpek MM, Afshar M. Summarizing Patients' Problems from Hospital Progress Notes Using Pre-trained Sequence-to-Sequence Models. PROCEEDINGS OF COLING. INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS 2022; 2022:2979-2991. [PMID: 36268128 PMCID: PMC9581107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Automatically summarizing patients' main problems from daily progress notes using natural language processing methods helps to battle against information and cognitive overload in hospital settings and potentially assists providers with computerized diagnostic decision support. Problem list summarization requires a model to understand, abstract, and generate clinical documentation. In this work, we propose a new NLP task that aims to generate a list of problems in a patient's daily care plan using input from the provider's progress notes during hospitalization. We investigate the performance of T5 and BART, two state-of-the-art seq2seq transformer architectures, in solving this problem. We provide a corpus built on top of progress notes from publicly available electronic health record progress notes in the Medical Information Mart for Intensive Care (MIMIC)-III. T5 and BART are trained on general domain text, and we experiment with a data augmentation method and a domain adaptation pre-training method to increase exposure to medical vocabulary and knowledge. Evaluation methods include ROUGE, BERTScore, cosine similarity on sentence embedding, and F-score on medical concepts. Results show that T5 with domain adaptive pre-training achieves significant performance gains compared to a rule-based system and general domain pre-trained language models, indicating a promising direction for tackling the problem summarization task.
Collapse
Affiliation(s)
- Yanjun Gao
- ICU Data Science Lab, School of Medicine and Public Health, University of Wisconsin-Madison
| | | | - Dongfang Xu
- Boston Children's Hospital, and Harvard Medical School
| | | | - Matthew M Churpek
- ICU Data Science Lab, School of Medicine and Public Health, University of Wisconsin-Madison
| | - Majid Afshar
- ICU Data Science Lab, School of Medicine and Public Health, University of Wisconsin-Madison
| |
Collapse
|
22
|
Hammouda N, Neyra JA. Can Artificial Intelligence Assist in Delivering Continuous Renal Replacement Therapy? Adv Chronic Kidney Dis 2022; 29:439-449. [PMID: 36253027 PMCID: PMC9586461 DOI: 10.1053/j.ackd.2022.08.001] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Revised: 08/02/2022] [Accepted: 08/11/2022] [Indexed: 01/25/2023]
Abstract
Continuous renal replacement therapy (CRRT) is widely utilized to support critically ill patients with acute kidney injury. Artificial intelligence (AI) has the potential to enhance CRRT delivery, but evidence is limited. We reviewed existing literature on the utilization of AI in CRRT with the objective of identifying current gaps in evidence and research considerations. We conducted a scoping review focusing on the development or use of AI-based tools in patients receiving CRRT. Ten papers were identified; 6 of 10 (60%) published in 2021, and 6 of 10 (60%) focused on machine learning models to augment CRRT delivery. All innovations were in the design/early validation phase of development. Primary research interests focused on early indicators of CRRT need, prognostication of mortality and kidney recovery, and identification of risk factors for mortality. Secondary research priorities included dynamic CRRT monitoring, predicting CRRT-related complications, and automated data pooling for point-of-care analysis. Literature gaps included prospective validation and implementation, biases ascertainment, and evaluation of AI-generated health care disparities. Research on AI applications to enhance CRRT delivery has grown exponentially in the last years, but the field remains premature. There is a need to evaluate how these applications could enhance bedside decision-making capacity and assist structure and processes of CRRT delivery.
Collapse
Affiliation(s)
- Nada Hammouda
- Department of Applied Clinical Research, University of Texas, Southwestern, Dallas, TX
| | - Javier A Neyra
- Department of Medicine, Division of Nephrology, University of Alabama at Birmingham, Birmingham, AL.
| |
Collapse
|
23
|
Conditional generation of medical time series for extrapolation to underrepresented populations. PLOS DIGITAL HEALTH 2022; 1:e0000074. [PMID: 36812549 PMCID: PMC9931259 DOI: 10.1371/journal.pdig.0000074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Accepted: 06/10/2022] [Indexed: 11/19/2022]
Abstract
The widespread adoption of electronic health records (EHRs) and subsequent increased availability of longitudinal healthcare data has led to significant advances in our understanding of health and disease with direct and immediate impact on the development of new diagnostics and therapeutic treatment options. However, access to EHRs is often restricted due to their perceived sensitive nature and associated legal concerns, and the cohorts therein typically are those seen at a specific hospital or network of hospitals and therefore not representative of the wider population of patients. Here, we present HealthGen, a new approach for the conditional generation of synthetic EHRs that maintains an accurate representation of real patient characteristics, temporal information and missingness patterns. We demonstrate experimentally that HealthGen generates synthetic cohorts that are significantly more faithful to real patient EHRs than the current state-of-the-art, and that augmenting real data sets with conditionally generated cohorts of underrepresented subpopulations of patients can significantly enhance the generalisability of models derived from these data sets to different patient populations. Synthetic conditionally generated EHRs could help increase the accessibility of longitudinal healthcare data sets and improve the generalisability of inferences made from these data sets to underrepresented populations.
Collapse
|
24
|
Natural language processing to identify substance misuse in the electronic health record. Lancet Digit Health 2022; 4:e401-e402. [DOI: 10.1016/s2589-7500(22)00096-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2022] [Accepted: 05/10/2022] [Indexed: 11/24/2022]
|
25
|
Afshar M. To err is machine: Considerations on the clinical impact of machine learning models in patients with unhealthy alcohol use. Alcohol Clin Exp Res 2022; 46:912-914. [PMID: 35429003 DOI: 10.1111/acer.14842] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Revised: 04/07/2022] [Accepted: 04/09/2022] [Indexed: 11/28/2022]
Affiliation(s)
- Majid Afshar
- Department of Medicine, School of Medicine and Public Health, University of Wisconsin, Madison, Wisconsin, USA
| |
Collapse
|
26
|
Afshar M, Sharma B, Dligach D, Oguss M, Brown R, Chhabra N, Thompson HM, Markossian T, Joyce C, Churpek MM, Karnik NS. Development and multimodal validation of a substance misuse algorithm for referral to treatment using artificial intelligence (SMART-AI): a retrospective deep learning study. THE LANCET DIGITAL HEALTH 2022; 4:e426-e435. [PMID: 35623797 PMCID: PMC9159760 DOI: 10.1016/s2589-7500(22)00041-3] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/10/2021] [Revised: 02/12/2022] [Accepted: 02/16/2022] [Indexed: 01/02/2023]
Abstract
Background Substance misuse is a heterogeneous and complex set of behavioural conditions that are highly prevalent in hospital settings and frequently co-occur. Few hospital-wide solutions exist to comprehensively and reliably identify these conditions to prioritise care and guide treatment. The aim of this study was to apply natural language processing (NLP) to clinical notes collected in the electronic health record (EHR) to accurately screen for substance misuse. Methods The model was trained and developed on a reference dataset derived from a hospital-wide programme at Rush University Medical Center (RUMC), Chicago, IL, USA, that used structured diagnostic interviews to manually screen admitted patients over 27 months (between Oct 1, 2017, and Dec 31, 2019; n=54 915). The Alcohol Use Disorder Identification Test and Drug Abuse Screening Tool served as reference standards. The first 24 h of notes in the EHR were mapped to standardised medical vocabulary and fed into single-label, multilabel, and multilabel with auxillary-task neural network models. Temporal validation of the model was done using data from the subsequent 12 months on a subset of RUMC patients (n=16 917). External validation was done using data from Loyola University Medical Center, Chicago, IL, USA between Jan 1, 2007, and Sept 30, 2017 (n=1991 adult patients). The primary outcome was discrimination for alcohol misuse, opioid misuse, or non-opioid drug misuse. Discrimination was assessed by the area under the receiver operating characteristic curve (AUROC). Calibration slope and intercept were measured with the unreliability index. Bias assessments were performed across demographic subgroups. Findings The model was trained on a cohort that had 3·5% misuse (n=1 921) with any type of substance. 220 (11%) of 1921 patients with substance misuse had more than one type of misuse. The multilabel convolutional neural network classifier had a mean AUROC of 0·97 (95% CI 0·96–0·98) during temporal validation for all types of substance misuse. The model was well calibrated and showed good face validity with model features containing explicit mentions of aberrant drug-taking behaviour. A false-negative rate of 0·18–0·19 and a false-positive rate of 0·03 between non-Hispanic Black and non-Hispanic White groups occurred. In external validation, the AUROCs for alcohol and opioid misuse were 0·88 (95% CI 0·86–0·90) and 0·94 (0·92–0·95), respectively. Interpretation We developed a novel and accurate approach to leveraging the first 24 h of EHR notes for screening multiple types of substance misuse. Funding National Institute On Drug Abuse, National Institutes of Health.
Collapse
|
27
|
Huang J, Galal G, Etemadi M, Vaidyanathan M. Evaluation and Mitigation of Racial Bias in Clinical Machine Learning Models: A Scoping Review (Preprint). JMIR Med Inform 2022; 10:e36388. [PMID: 35639450 PMCID: PMC9198828 DOI: 10.2196/36388] [Citation(s) in RCA: 27] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Revised: 02/17/2022] [Accepted: 03/27/2022] [Indexed: 01/12/2023] Open
Abstract
Background Racial bias is a key concern regarding the development, validation, and implementation of machine learning (ML) models in clinical settings. Despite the potential of bias to propagate health disparities, racial bias in clinical ML has yet to be thoroughly examined and best practices for bias mitigation remain unclear. Objective Our objective was to perform a scoping review to characterize the methods by which the racial bias of ML has been assessed and describe strategies that may be used to enhance algorithmic fairness in clinical ML. Methods A scoping review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) Extension for Scoping Reviews. A literature search using PubMed, Scopus, and Embase databases, as well as Google Scholar, identified 635 records, of which 12 studies were included. Results Applications of ML were varied and involved diagnosis, outcome prediction, and clinical score prediction performed on data sets including images, diagnostic studies, clinical text, and clinical variables. Of the 12 studies, 1 (8%) described a model in routine clinical use, 2 (17%) examined prospectively validated clinical models, and the remaining 9 (75%) described internally validated models. In addition, 8 (67%) studies concluded that racial bias was present, 2 (17%) concluded that it was not, and 2 (17%) assessed the implementation of bias mitigation strategies without comparison to a baseline model. Fairness metrics used to assess algorithmic racial bias were inconsistent. The most commonly observed metrics were equal opportunity difference (5/12, 42%), accuracy (4/12, 25%), and disparate impact (2/12, 17%). All 8 (67%) studies that implemented methods for mitigation of racial bias successfully increased fairness, as measured by the authors’ chosen metrics. Preprocessing methods of bias mitigation were most commonly used across all studies that implemented them. Conclusions The broad scope of medical ML applications and potential patient harms demand an increased emphasis on evaluation and mitigation of racial bias in clinical ML. However, the adoption of algorithmic fairness principles in medicine remains inconsistent and is limited by poor data availability and ML model reporting. We recommend that researchers and journal editors emphasize standardized reporting and data availability in medical ML studies to improve transparency and facilitate evaluation for racial bias.
Collapse
Affiliation(s)
- Jonathan Huang
- Department of Anesthesiology, Northwestern University Feinberg School of Medicine, Chicago, IL, United States
| | - Galal Galal
- Department of Anesthesiology, Northwestern University Feinberg School of Medicine, Chicago, IL, United States
| | - Mozziyar Etemadi
- Department of Anesthesiology, Northwestern University Feinberg School of Medicine, Chicago, IL, United States
- Department of Biomedical Engineering, Northwestern University, Evanston, IL, United States
| | - Mahesh Vaidyanathan
- Department of Anesthesiology, Northwestern University Feinberg School of Medicine, Chicago, IL, United States
- Digital Health & Data Science Curricular Thread, Northwestern University Feinberg School of Medicine, Chicago, IL, United States
| |
Collapse
|
28
|
Aboalshamat K, Alhuzali R, Alalyani A, Alsharif S, Qadhi H, Almatrafi R, Ammash D, Alotaibi S. Medical and Dental Professionals Readiness for Artificial Intelligence for Saudi Arabia Vision 2030. INTERNATIONAL JOURNAL OF PHARMACEUTICAL RESEARCH AND ALLIED SCIENCES 2022. [DOI: 10.51847/nu8y6y6q1m] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|