1
|
Vithanage D, Deng C, Wang L, Yin M, Alkhalaf M, Zhang Z, Zhu Y, Yu P. Adapting Generative Large Language Models for Information Extraction from Unstructured Electronic Health Records in Residential Aged Care: A Comparative Analysis of Training Approaches. JOURNAL OF HEALTHCARE INFORMATICS RESEARCH 2025; 9:191-219. [PMID: 40309133 PMCID: PMC12037947 DOI: 10.1007/s41666-025-00190-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2024] [Revised: 01/24/2025] [Accepted: 02/03/2025] [Indexed: 05/02/2025]
Abstract
Information extraction (IE) of unstructured electronic health records is challenging due to the semantic complexity of textual data. Generative large language models (LLMs) offer promising solutions to address this challenge. However, identifying the best training methods to adapt LLMs for IE in residential aged care settings remains underexplored. This research addresses this challenge by evaluating the effects of zero-shot and few-shot learning, both with and without parameter-efficient fine-tuning (PEFT) and retrieval-augmented generation (RAG) using Llama 3.1-8B. The study performed named entity recognition (NER) to nursing notes from Australian aged care facilities (RACFs), focusing on agitation in dementia and malnutrition risk factors. Performance evaluation includes accuracy, macro-averaged precision, recall, and F1 score. We used non-parametric statistical methods to compare if the differences were statistically significant. Results show that zero-shot and few-shot learning, whether combined with PEFT or RAG, achieve comparable performance across the clinical domains when the same prompting template is used. Few-shot learning significantly outperforms zero-shot learning when neither PEFT nor RAG is applied. Notably, PEFT significantly improves model performance in both zero-shot and few-shot learning; however, RAG significantly improves performance only in few-shot learning. After PEFT, the performance of zero-shot learning reaches a comparable level with few-shot learning. However, few-shot learning with RAG significantly outperforms zero-shot learning with RAG. We also found a similar level of performance between few-shot learning with RAG and zero-shot learning with PEFT. These findings provide valuable insights for researchers, practitioners, and stakeholders to optimize the use of generative LLMs in clinical IE. Supplementary Information The online version contains supplementary material available at 10.1007/s41666-025-00190-z.
Collapse
Affiliation(s)
- Dinithi Vithanage
- School of Computing and Information Technology, University of Wollongong, Wollongong, Australia
| | - Chao Deng
- School of Medical, Indigenous and Health Sciences, University of Wollongong, Wollongong, Australia
| | - Lei Wang
- School of Computing and Information Technology, University of Wollongong, Wollongong, Australia
| | | | - Mohammad Alkhalaf
- School of Computing and Information Technology, University of Wollongong, Wollongong, Australia
- School of Computer Science, Qassim University, Qassim, Saudi Arabia
| | - Zhenyu Zhang
- School of Computing and Information Technology, University of Wollongong, Wollongong, Australia
| | - Yunshu Zhu
- School of Computing and Information Technology, University of Wollongong, Wollongong, Australia
| | - Ping Yu
- School of Computing and Information Technology, University of Wollongong, Wollongong, Australia
| |
Collapse
|
2
|
Poole S, Sisodia N, Koshal K, Henderson K, Wijangco J, Paredes D, Chen C, Rowles W, Akula A, Wuerfel J, Sharma V, Rauschecker AM, Henry RG, Bove R. Detecting New Lesions Using a Large Language Model: Applications in Real-World Multiple Sclerosis Datasets. Ann Neurol 2025. [PMID: 40277428 DOI: 10.1002/ana.27251] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2025] [Revised: 03/11/2025] [Accepted: 04/03/2025] [Indexed: 04/26/2025]
Abstract
OBJECTIVE Neuroimaging is routinely utilized to identify new inflammatory activity in multiple sclerosis (MS). A large language model to classify narrative magnetic resonance imaging reports in the electronic health record (EHR) as discrete data could provide significant benefits for MS research. The objectives of the current study were to develop such a prompt and to illustrate its research applications through a common clinical scenario: monitoring response to B-cell depleting therapy (BCDT). METHODS An institutional ecosystem that securely connects healthcare data with ChatGPT4 was applied to clinical MS magnetic resonance imaging reports in a single institutional EHR (2000-2022). A prompt (msLesionprompt) was developed and iteratively refined to classify the presence or absence of new T2-weighted lesions (newT2w) and contrast-enhancing lesions (CEL). The multistep validation included evaluating efficiency (time and cost), comparison with manually annotated reports using standard confusion matrix, and application to identifying predictors of newT2w/CEL after BCDT start. RESULTS Accuracy of msLesionprompt was high for detection of newT2w (97%) and CEL (96.8%). All 14,888 available reports were categorized in 4.13 hours ($28); 79% showed no newT2w or CEL. Data extracted showed expected suppression of new activity by BCDT (>97% monitoring magnetic resonance images after an initial "rebaseline" scan). Neighborhood poverty (Area Deprivation Index) was identified as a predictor of inflammatory activity (newT2w: OR 1.69, 95% CI 1.10-2.59, p = 0.017; CEL: OR 1.54, 95% CI 1.01-2.34, p = 0.046). INTERPRETATION Extracting discrete information from narrative imaging reports using an large language model is feasible and efficient. This approach could augment many real-world analyses of MS disease evolution and treatment response. ANN NEUROL 2025.
Collapse
Affiliation(s)
- Shane Poole
- UCSF Weill Institute for Neurosciences, University of California San Francisco, San Francisco, CA, USA
| | - Nikki Sisodia
- UCSF Weill Institute for Neurosciences, University of California San Francisco, San Francisco, CA, USA
| | - Kanishka Koshal
- UCSF Weill Institute for Neurosciences, University of California San Francisco, San Francisco, CA, USA
| | - Kyra Henderson
- UCSF Weill Institute for Neurosciences, University of California San Francisco, San Francisco, CA, USA
| | - Jaeleene Wijangco
- UCSF Weill Institute for Neurosciences, University of California San Francisco, San Francisco, CA, USA
| | - Danelvis Paredes
- UCSF Weill Institute for Neurosciences, University of California San Francisco, San Francisco, CA, USA
| | - Chelsea Chen
- UCSF Weill Institute for Neurosciences, University of California San Francisco, San Francisco, CA, USA
| | - William Rowles
- UCSF Weill Institute for Neurosciences, University of California San Francisco, San Francisco, CA, USA
| | - Amit Akula
- UCSF Weill Institute for Neurosciences, University of California San Francisco, San Francisco, CA, USA
| | | | - Vishakha Sharma
- F. Hoffmann-La Roche, Basel, Switzerland
- Roche Diagnostics, Santa Clara, USA
| | - Andreas M Rauschecker
- UCSF Center for Intelligent Imaging (ci2), Department of Radiology & Biomedical imaging, University of California, San Francisco, CA, USA
| | - Roland G Henry
- UCSF Weill Institute for Neurosciences, University of California San Francisco, San Francisco, CA, USA
| | - Riley Bove
- UCSF Weill Institute for Neurosciences, University of California San Francisco, San Francisco, CA, USA
| |
Collapse
|
3
|
Soroush A, Giuffrè M, Chung S, Shung DL. Generative Artificial Intelligence in Clinical Medicine and Impact on Gastroenterology. Gastroenterology 2025:S0016-5085(25)00634-1. [PMID: 40245953 DOI: 10.1053/j.gastro.2025.03.038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/11/2025] [Revised: 03/08/2025] [Accepted: 03/14/2025] [Indexed: 04/19/2025]
Abstract
The pace of artificial intelligence (AI) integration into health care has accelerated with rapid advances in generative AI (genAI). Gastroenterology and hepatology in particular will be transformed due to the multimodal workflows that integrate endoscopic video, radiologic imaging, tabular data, and unstructured note text. GenAI will impact the entire spectrum of clinical experience, from administrative tasks, diagnostic guidance, and treatment recommendations. Unlike traditional machine learning approaches, genAI is more flexible, with one platform able to be used across multiple tasks. Initial evidence suggests benefits in lower-level administrative tasks, such as clinical documentation, medical billing, and scheduling; and information tasks, such as patient education and summarization of the medical literature. No evidence exists for genAI solutions for more complex tasks relevant to clinical care, such as clinical reasoning for diagnostic and treatment decisions that may affect patient outcomes. Challenges of output reliability, data privacy, and useful integration remain; potential solutions include robust validation, regulatory oversight, and "human-AI teaming" strategies to ensure safe, effective deployment. We remain optimistic in the potential of genAI to augment clinical expertise due to the adaptability of genAI to handle multiple data modalities to obtain and focus relevant information flows and the human-friendly interfaces that facilitate ease of use. We believe that the potential of genAI for dynamic human-algorithmic interactions may allow for a degree of clinician-directed customization to enhance human presence.
Collapse
Affiliation(s)
- Ali Soroush
- Division of Data-Driven and Digital Medicine, Icahn School of Medicine at Mount Sinai, New York, New York; Henry D. Janowitz Division of Gastroenterology, Icahn School of Medicine at Mount Sinai, New York, New York; Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York
| | - Mauro Giuffrè
- Section of Digestive Diseases, Department of Medicine, Yale School of Medicine, New Haven, Connecticut
| | - Sunny Chung
- Section of Digestive Diseases, Department of Medicine, Yale School of Medicine, New Haven, Connecticut
| | - Dennis L Shung
- Section of Digestive Diseases, Department of Medicine, Yale School of Medicine, New Haven, Connecticut; Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, Connecticut.
| |
Collapse
|
4
|
Ziegeler K, Kreutzinger V, Tong MW, Chin CT, Bahroos E, Wu PH, Bonnheim N, Fields AJ, Lotz JC, Link TM, Majumdar S. Information Extraction from Lumbar Spine MRI Radiology Reports Using GPT4: Accuracy and Benchmarking Against Research-Grade Comprehensive Scoring. Diagnostics (Basel) 2025; 15:930. [PMID: 40218280 PMCID: PMC11989208 DOI: 10.3390/diagnostics15070930] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2025] [Revised: 04/02/2025] [Accepted: 04/02/2025] [Indexed: 04/14/2025] Open
Abstract
Background/Objectives: This study aimed to create a pipeline for standardized data extraction from lumbar-spine MRI radiology reports using a large language model (LLM) and assess the agreement of the extracted data with research-grade semi-quantitative scoring. Methods: We included a subset of data from a multi-site NIH-funded cohort study of chronic low back pain (cLBP) participants. After initial prompt development, a secure application programming interface (API) deployment of OpenAIs GPT-4 was used to extract different classes of pathology from the clinical radiology report. Unsupervised UMAP and agglomerative clustering of the pathology terms' embeddings provided insight into model comprehension for optimized prompt design. Model extraction was benchmarked against human extraction (gold standard) with F1 scores and false-positive and false-negative rates (FPR/FNR). Then, an expert MSK radiologist provided comprehensive research-grade scores of the images, and agreement with report-extracted data was calculated using Cohen's kappa. Results: Data from 230 patients with cLBP were included (mean age 53.2 years, 54% women). The overall model performance for extracting data from clinical reports was excellent, with a mean F1 score of 0.96 across pathologies. The mean FPR was marginally higher than the FNR (5.1% vs. 3.0%). Agreement with comprehensive scoring was moderate (kappa 0.424), and the underreporting of lateral recess stenosis (FNR 63.6%) and overreporting of disc pathology (FPR 42.7%) were noted. Conclusions: LLMs can accurately extract highly detailed information on lumbar spine imaging pathologies from radiology reports. Moderate agreement between the LLM and comprehensive scores underscores the need for less subjective, machine-based data extraction from imaging.
Collapse
Affiliation(s)
- Katharina Ziegeler
- Department of Radiology and Biomedical Imaging, University of California San Francisco, San Francisco, CA 94143, USA
| | - Virginie Kreutzinger
- Department of Radiology and Biomedical Imaging, University of California San Francisco, San Francisco, CA 94143, USA
| | - Michelle W. Tong
- Department of Radiology and Biomedical Imaging, University of California San Francisco, San Francisco, CA 94143, USA
- Department of Bioengineering, University of California Berkeley, Berkeley, CA 94720, USA
- Department of Bioengineering, University of California San Francisco, San Francisco, CA 94143, USA
| | - Cynthia T. Chin
- Department of Radiology and Biomedical Imaging, University of California San Francisco, San Francisco, CA 94143, USA
| | - Emma Bahroos
- Department of Radiology and Biomedical Imaging, University of California San Francisco, San Francisco, CA 94143, USA
| | - Po-Hung Wu
- Department of Radiology and Biomedical Imaging, University of California San Francisco, San Francisco, CA 94143, USA
- The UCSF REACH Center, The Core Center for Patient-Centric Mechanistic Phenotyping in Chronic Low Back Pain, San Francisco, CA 94143, USA
- Department of Orthopaedic Surgery, University of California San Francisco, San Francisco, CA 94143, USA
| | - Noah Bonnheim
- The UCSF REACH Center, The Core Center for Patient-Centric Mechanistic Phenotyping in Chronic Low Back Pain, San Francisco, CA 94143, USA
- Department of Orthopaedic Surgery, University of California San Francisco, San Francisco, CA 94143, USA
| | - Aaron J. Fields
- The UCSF REACH Center, The Core Center for Patient-Centric Mechanistic Phenotyping in Chronic Low Back Pain, San Francisco, CA 94143, USA
- Department of Orthopaedic Surgery, University of California San Francisco, San Francisco, CA 94143, USA
| | - Jeffrey C. Lotz
- Department of Orthopaedic Surgery, University of California San Francisco, San Francisco, CA 94143, USA
| | - Thomas M. Link
- Department of Radiology and Biomedical Imaging, University of California San Francisco, San Francisco, CA 94143, USA
| | - Sharmila Majumdar
- Department of Radiology and Biomedical Imaging, University of California San Francisco, San Francisco, CA 94143, USA
| |
Collapse
|
5
|
Geevarghese R, Solomon SB, Alexander ES, Marinelli B, Chatterjee S, Jain P, Cadley J, Hollingsworth A, Chatterjee A, Ziv E. Utility of a Large Language Model for Extraction of Clinical Findings from Healthcare Data following Lung Ablation: A Feasibility Study. J Vasc Interv Radiol 2025; 36:704-708. [PMID: 39662619 DOI: 10.1016/j.jvir.2024.11.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2024] [Revised: 10/29/2024] [Accepted: 11/30/2024] [Indexed: 12/13/2024] Open
Abstract
To assess the feasibility of utilizing a large language model (LLM) in extracting clinically relevant information from healthcare data in patients who have undergone microwave ablation for lung tumors. In this single-center retrospective study, radiology reports and clinic notes of 20 patients were extracted, up to 12 months after treatment. Utilizing an LLM (generative pretrained transformer 3.5 Turbo 16k), a zero-shot prompt strategy was employed to identify 4 key outcomes from relevant healthcare data: (a) recurrence at ablation site, (b) pneumothorax, (c) hemoptysis, and (d) hemothorax following ablation. This was validated with ground-truth labels obtained through manual chart review. Analysis of 104 radiology reports and 37 clinic notes was undertaken. The LLM output demonstrated high accuracy (85%-100%) across the 4 outcomes. An LLM approach appears to have utility in extraction of clinically relevant information from healthcare data. This method may be beneficial in facilitating data analysis for future interventional radiology studies.
Collapse
Affiliation(s)
- Ruben Geevarghese
- Division of Interventional Radiology, Department of Radiology, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Stephen B Solomon
- Division of Interventional Radiology, Department of Radiology, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Erica S Alexander
- Division of Interventional Radiology, Department of Radiology, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Brett Marinelli
- Division of Interventional Radiology, Department of Radiology, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Subrata Chatterjee
- Department of Artificial Intelligence & Machine Learning, DigITs, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Pulkit Jain
- Department of Artificial Intelligence & Machine Learning, DigITs, Memorial Sloan Kettering Cancer Center, New York, New York
| | - John Cadley
- Department of Artificial Intelligence & Machine Learning, DigITs, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Alex Hollingsworth
- Department of Artificial Intelligence & Machine Learning, DigITs, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Avijit Chatterjee
- Department of Artificial Intelligence & Machine Learning, DigITs, Memorial Sloan Kettering Cancer Center, New York, New York
| | - Etay Ziv
- Division of Interventional Radiology, Department of Radiology, Memorial Sloan Kettering Cancer Center, New York, New York.
| |
Collapse
|
6
|
Flanagan CP, Trang K, Nacario J, Schneider PA, Gasper WJ, Conte MS, Wick EC, Conway AM. Large language models can accurately populate Vascular Quality Initiative procedural databases using narrative operative reports. J Vasc Surg 2025; 81:973-982. [PMID: 39694151 DOI: 10.1016/j.jvs.2024.12.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2024] [Revised: 11/20/2024] [Accepted: 12/05/2024] [Indexed: 12/20/2024]
Abstract
OBJECTIVE Participation in the Vascular Quality Initiative (VQI) provides important resources to surgeons, but the ability to do so is often limited by time and data entry personnel. Large language models (LLMs) such as ChatGPT (OpenAI) are examples of generative artificial intelligence products that may help bridge this gap. Trained on large volumes of data, the models are used for natural language processing and text generation. We evaluated the ability of LLMs to accurately populate VQI procedural databases using operative reports. METHODS A single-center, retrospective study was performed using institutional VQI data from 2021 to 2023. The most recent procedures for carotid endarterectomy (CEA), endovascular aneurysm repair (EVAR), and infrainguinal lower extremity bypass (LEB) were analyzed using Versa, a HIPAA (Health Insurance Portability and Accountability Act)-compliant institutional version of ChatGPT. We created an automated function to analyze operative reports and generate a shareable VQI file using two models: gpt-35-turbo and gpt-4. Application of the LLMs was accomplished with a cloud-based programming interface. The outputs of this model were compared with VQI data for accuracy. We defined a metric as "unavailable" to the LLM if it was discussed by surgeons in <20% of operative reports. RESULTS A total of 150 operative notes were analyzed, including 50 CEA, 50 EVAR, and 50 LEB. These procedural VQI databases included 25, 179, and 51 metrics, respectively. For all fields, gpt-35-turbo had a median accuracy of 84.0% for CEA (interquartile range [IQR]: 80.0%-88.0%), 92.2% for EVAR (IQR: 87.2%-94.0%), and 84.3% for LEB (IQR: 80.2%-88.1%). A total of 3 of 25, 6 of 179, and 7 of 51 VQI variables were unavailable in the operative reports, respectively. Excluding metric information routinely unavailable in operative reports, the median accuracy rate was 95.5% for each CEA procedure (IQR: 90.9%-100.0%), 94.8% for EVAR (IQR: 92.2%-98.5%), and 93.2% for LEB (IQR: 90.2%-96.4%). Across procedures, gpt-4 did not meaningfully improve performance compared with gpt-35 (P = .97, .85, and .95 for CEA, EVAR, and LEB overall performance, respectively). The cost for 150 operative reports analyzed with gpt-35-turbo and gpt-4 was $0.12 and $3.39, respectively. CONCLUSIONS LLMs can accurately populate VQI procedural databases with both structured and unstructured data, while incurring only minor processing costs. Increased workflow efficiency may improve center ability to successfully participate in the VQI. Further work examining other VQI databases and methods to increase accuracy is needed.
Collapse
Affiliation(s)
- Colleen P Flanagan
- Division of Vascular and Endovascular Surgery, Department of Surgery, University of California San Francisco, San Francisco, CA; Division of Clinical Informatics and Digital Transformation, Department of Medicine, University of California San Francisco, San Francisco, CA.
| | - Karen Trang
- Division of Clinical Informatics and Digital Transformation, Department of Medicine, University of California San Francisco, San Francisco, CA; Division of General Surgery, Department of Surgery, University of California San Francisco, San Francisco, CA
| | - Joyce Nacario
- Division of Vascular and Endovascular Surgery, Department of Surgery, University of California San Francisco, San Francisco, CA
| | - Peter A Schneider
- Division of Vascular and Endovascular Surgery, Department of Surgery, University of California San Francisco, San Francisco, CA
| | - Warren J Gasper
- Division of Vascular and Endovascular Surgery, Department of Surgery, University of California San Francisco, San Francisco, CA
| | - Michael S Conte
- Division of Vascular and Endovascular Surgery, Department of Surgery, University of California San Francisco, San Francisco, CA
| | - Elizabeth C Wick
- Division of Clinical Informatics and Digital Transformation, Department of Medicine, University of California San Francisco, San Francisco, CA; Division of General Surgery, Department of Surgery, University of California San Francisco, San Francisco, CA
| | - Allan M Conway
- Division of Vascular and Endovascular Surgery, Department of Surgery, University of California San Francisco, San Francisco, CA
| |
Collapse
|
7
|
Vong T, Rizer N, Jain V, Thompson VL, Dredze M, Klein EY, Hinson JS, Purnell T, Kwak S, Woreta T, Strauss AT. Automated identification of incidental hepatic steatosis on Emergency Department imaging using large language models. Hepatol Commun 2025; 9:e0638. [PMID: 39969431 PMCID: PMC11841845 DOI: 10.1097/hc9.0000000000000638] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/01/2024] [Accepted: 11/13/2024] [Indexed: 02/20/2025] Open
Abstract
BACKGROUND Hepatic steatosis is a precursor to more severe liver disease, increasing morbidity and mortality risks. In the Emergency Department, routine abdominal imaging often reveals incidental hepatic steatosis that goes undiagnosed due to the acute nature of encounters. Imaging reports in the electronic health record contain valuable information not easily accessible as discrete data elements. We hypothesized that large language models could reliably detect hepatic steatosis from reports without extensive natural language processing training. METHODS We identified 200 adults who had CT abdominal imaging in the Emergency Department between August 1, 2016, and December 31, 2023. Using text from imaging reports and structured prompts, 3 Azure OpenAI models (ChatGPT 3.5, 4, 4o) identified patients with hepatic steatosis. We evaluated model performance regarding accuracy, inter-rater reliability, sensitivity, and specificity compared to physician reviews. RESULTS The accuracy for the models was 96.2% for v3.5, 98.3% for v4, and 98.8% for v4o. Inter-rater reliability ranged from 0.99 to 1.00 across 10 iterations. Mean model confidence scores were 2.9 (SD 0.8) for v3.5, 3.9 (SD 0.3) for v4, and 4.0 (SD 0.07) for v4o. Incorrect evaluations were 76 (3.8%) for v3.5, 34 (1.7%) for v4, and 25 (1.3%) for v4o. All models showed sensitivity and specificity above 0.9. CONCLUSIONS Large language models can assist in identifying incidental conditions from imaging reports that otherwise may be missed opportunities for early disease intervention. Large language models are a democratization of natural language processing by allowing for a user-friendly, expansive analyses of electronic medical records without requiring the development of complex natural language processing models.
Collapse
Affiliation(s)
- Tyrus Vong
- Division of Gastroenterology & Hepatology, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Nicholas Rizer
- Department of Emergency Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Vedant Jain
- Division of Gastroenterology & Hepatology, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
- Department of Gastroenterology & Hepatology Carle Illinois College of Medicine, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA
| | - Valerie L. Thompson
- Division of Gastroenterology & Hepatology, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Mark Dredze
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA
| | - Eili Y. Klein
- Department of Emergency Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Jeremiah S. Hinson
- Department of Emergency Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Tanjala Purnell
- Department of Epidemiology, Johns Hopkins University Bloomberg School of Public Health, Baltimore, Maryland, USA
| | - Stephen Kwak
- Department of Radiology, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Tinsay Woreta
- Division of Gastroenterology & Hepatology, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Alexandra T. Strauss
- Division of Gastroenterology & Hepatology, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| |
Collapse
|
8
|
Puckett MA, Chafjiri FMA, Gettings JV, Landschaft A, Loddenkemper T. Utilizing natural language processing to identify pediatric patients experiencing status epilepticus. Seizure 2025; 125:54-61. [PMID: 39799705 DOI: 10.1016/j.seizure.2025.01.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2024] [Revised: 01/01/2025] [Accepted: 01/07/2025] [Indexed: 01/15/2025] Open
Abstract
PURPOSE Compare the identification of patients with established status epilepticus (ESE) and refractory status epilepticus (RSE) in electronic health records (EHR) using human review versus natural language processing (NLP) assisted review. METHODS We reviewed EHRs of patients aged 1 month to 21 years from Boston Children's Hospital (BCH). We included all patients with convulsive ESE or RSE during admission. We employed and validated a pre-trained NLP tool, Document review Tool (DrT), to identify patients from 2013-2020, excluding training years (2017-2019). DrT notes a machine-learning score based on a support vector machine (SVM) and bag-of-n-grams. Higher scores indicated more likely ESE/RSE cases. To further evaluate the effectiveness of DrT-assisted review, we compared the results to human-reviewed notes from the pediatric Status Epilepticus Research Group (pSERG) consortium at BCH. RESULTS The pre-trained algorithm identified 170 patients with RSE using DrT (Sensitivity: 98.8%), compared to 116 patients identified during human review (Sensitivity: 67.4%). Additionally, we identified 207 patients with ESE using DrT (Sensitivity: 99.5%), compared to 91 patients identified using human review (Sensitivity: 43.8%). Overall, DrT missed 3 cases (2 RSE and 1 ESE cases) that were identified during human review and identified 173 cases (56 RSE and 117 ESE cases) that were not found during the human review. CONCLUSION DrT-assisted manual review demonstrated higher sensitivity in identifying patients with ESE and RSE than the current standard of human review. This suggests that in contexts characterized by resource constraints NLP-related software like DrT can considerably enhance patient identification for research studies, treatment protocols, and preventative care interventions.
Collapse
Affiliation(s)
- Molly Ann Puckett
- Division of Epilepsy and Clinical Neurophysiology, Boston Children's Hospital, Harvard Medical School, 300 Longwood Ave, Boston, MA 02115, USA.
| | - Fatemeh Mohammad Alizadeh Chafjiri
- Division of Epilepsy and Clinical Neurophysiology, Boston Children's Hospital, Harvard Medical School, 300 Longwood Ave, Boston, MA 02115, USA
| | - Jennifer V Gettings
- Division of Epilepsy and Clinical Neurophysiology, Boston Children's Hospital, Harvard Medical School, 300 Longwood Ave, Boston, MA 02115, USA
| | - Assaf Landschaft
- Boston Children's Hospital, Harvard Medical School, Boston, MA, USA; Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS, Schloss Birlinghoven, 1, 53757 Sankt Augustin, Germany
| | - Tobias Loddenkemper
- Division of Epilepsy and Clinical Neurophysiology, Boston Children's Hospital, Harvard Medical School, 300 Longwood Ave, Boston, MA 02115, USA
| |
Collapse
|
9
|
Ramgopal S, Benedetti J, Cotter JM. Performing a Multicenter Retrospective Study. Hosp Pediatr 2025; 15:e77-e82. [PMID: 39746377 DOI: 10.1542/hpeds.2024-008020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2024] [Accepted: 08/09/2024] [Indexed: 01/04/2025]
Abstract
Multicenter retrospective studies can provide a pragmatic approach to evaluating uncommon pediatric conditions and are less expensive than prospective research. A well-executed retrospective multicenter study, with rigorous study design, systematic data collection, and robust statistical analysis, can produce clinically important and generalizable findings A variety of observational designs can be employed, including cross-sectional, cohort, and case-control studies. Selection bias, ascertainment bias, and confounding are common issues in retrospective research. Key steps include development of a feasible study design, regular contact with site investigators, and detailed data collection and management strategies. Principal investigators must seek to ensure that case ascertainment and data collection are consistent across sites, using manual and/or automated data extraction methods. Operations manuals, training sessions, and regular meetings can be used to ensure data reliability. Ethical considerations include obtaining institutional review board approval and establishing data use agreements. A proactive statistical approach to handling missing data, using techniques like multiple imputation and sensitivity analyses, is necessary. Careful planning, effective collaboration, and embracing technological advancements will enhance the value and accuracy of retrospective multicenter studies. This article discusses important considerations in the performance of a retrospective multicenter study.
Collapse
Affiliation(s)
- Sriram Ramgopal
- Division of Emergency Medicine, Ann & Robert H. Lurie Children's Hospital of Chicago, Department of Pediatrics, Northwestern University Feinberg School of Medicine, Chicago, Illinois
| | - Jillian Benedetti
- Office of Clinical and Community Trials, Stanley Manne Children's Research Institute, Ann & Robert H. Lurie Children's Hospital of Chicago, Chicago, Illinois
| | - Jillian M Cotter
- Department of Pediatrics, Section of Hospital Medicine, Children's Hospital Colorado, University of Colorado School of Medicine, Aurora, Colorado
| |
Collapse
|
10
|
Bedi S, Liu Y, Orr-Ewing L, Dash D, Koyejo S, Callahan A, Fries JA, Wornow M, Swaminathan A, Lehmann LS, Hong HJ, Kashyap M, Chaurasia AR, Shah NR, Singh K, Tazbaz T, Milstein A, Pfeffer MA, Shah NH. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA 2025; 333:319-328. [PMID: 39405325 PMCID: PMC11480901 DOI: 10.1001/jama.2024.21700] [Citation(s) in RCA: 45] [Impact Index Per Article: 45.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Accepted: 09/30/2024] [Indexed: 10/19/2024]
Abstract
Importance Large language models (LLMs) can assist in various health care activities, but current evaluation approaches may not adequately identify the most useful application areas. Objective To summarize existing evaluations of LLMs in health care in terms of 5 components: (1) evaluation data type, (2) health care task, (3) natural language processing (NLP) and natural language understanding (NLU) tasks, (4) dimension of evaluation, and (5) medical specialty. Data Sources A systematic search of PubMed and Web of Science was performed for studies published between January 1, 2022, and February 19, 2024. Study Selection Studies evaluating 1 or more LLMs in health care. Data Extraction and Synthesis Three independent reviewers categorized studies via keyword searches based on the data used, the health care tasks, the NLP and NLU tasks, the dimensions of evaluation, and the medical specialty. Results Of 519 studies reviewed, published between January 1, 2022, and February 19, 2024, only 5% used real patient care data for LLM evaluation. The most common health care tasks were assessing medical knowledge such as answering medical licensing examination questions (44.5%) and making diagnoses (19.5%). Administrative tasks such as assigning billing codes (0.2%) and writing prescriptions (0.2%) were less studied. For NLP and NLU tasks, most studies focused on question answering (84.2%), while tasks such as summarization (8.9%) and conversational dialogue (3.3%) were infrequent. Almost all studies (95.4%) used accuracy as the primary dimension of evaluation; fairness, bias, and toxicity (15.8%), deployment considerations (4.6%), and calibration and uncertainty (1.2%) were infrequently measured. Finally, in terms of medical specialty area, most studies were in generic health care applications (25.6%), internal medicine (16.4%), surgery (11.4%), and ophthalmology (6.9%), with nuclear medicine (0.6%), physical medicine (0.4%), and medical genetics (0.2%) being the least represented. Conclusions and Relevance Existing evaluations of LLMs mostly focus on accuracy of question answering for medical examinations, without consideration of real patient care data. Dimensions such as fairness, bias, and toxicity and deployment considerations received limited attention. Future evaluations should adopt standardized applications and metrics, use clinical data, and broaden focus to include a wider range of tasks and specialties.
Collapse
Affiliation(s)
- Suhana Bedi
- Department of Biomedical Data Science, Stanford School of Medicine, Stanford, California
| | - Yutong Liu
- Clinical Excellence Research Center, Stanford University, Stanford, California
| | - Lucy Orr-Ewing
- Clinical Excellence Research Center, Stanford University, Stanford, California
| | - Dev Dash
- Clinical Excellence Research Center, Stanford University, Stanford, California
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Sanmi Koyejo
- Department of Computer Science, Stanford University, Stanford, California
| | - Alison Callahan
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Jason A. Fries
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Michael Wornow
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Akshay Swaminathan
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | | | - Hyo Jung Hong
- Department of Anesthesiology, Stanford University, Stanford, California
| | - Mehr Kashyap
- Stanford University School of Medicine, Stanford, California
| | - Akash R. Chaurasia
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Nirav R. Shah
- Clinical Excellence Research Center, Stanford University, Stanford, California
| | - Karandeep Singh
- Digital Health Innovation, University of California San Diego Health, San Diego
| | - Troy Tazbaz
- Digital Health Center of Excellence, US Food and Drug Administration, Washington, DC
| | - Arnold Milstein
- Clinical Excellence Research Center, Stanford University, Stanford, California
| | - Michael A. Pfeffer
- Department of Medicine, Stanford University School of Medicine, Stanford, California
| | - Nigam H. Shah
- Clinical Excellence Research Center, Stanford University, Stanford, California
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| |
Collapse
|
11
|
Malik S, Das R, Thongtan T, Thompson K, Dbouk N. AI in Hepatology: Revolutionizing the Diagnosis and Management of Liver Disease. J Clin Med 2024; 13:7833. [PMID: 39768756 PMCID: PMC11678868 DOI: 10.3390/jcm13247833] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2024] [Revised: 12/13/2024] [Accepted: 12/19/2024] [Indexed: 01/11/2025] Open
Abstract
The integration of artificial intelligence (AI) into hepatology is revolutionizing the diagnosis and management of liver diseases amidst a rising global burden of conditions like metabolic-associated steatotic liver disease (MASLD). AI harnesses vast datasets and complex algorithms to enhance clinical decision making and patient outcomes. AI's applications in hepatology span a variety of conditions, including autoimmune hepatitis, primary biliary cholangitis, primary sclerosing cholangitis, MASLD, hepatitis B, and hepatocellular carcinoma. It enables early detection, predicts disease progression, and supports more precise treatment strategies. Despite its transformative potential, challenges remain, including data integration, algorithm transparency, and computational demands. This review examines the current state of AI in hepatology, exploring its applications, limitations, and the opportunities it presents to enhance liver health and care delivery.
Collapse
Affiliation(s)
- Sheza Malik
- Department of Internal Medicine, Rochester General Hospital, Rochester, NY 14621, USA;
| | - Rishi Das
- Division of Digestive Diseases, Emory University School of Medicine, Atlanta, GA 30322, USA; (R.D.); (T.T.)
- Department of Medicine, Emory University School of Medicine, Atlanta, GA 30322, USA;
| | - Thanita Thongtan
- Division of Digestive Diseases, Emory University School of Medicine, Atlanta, GA 30322, USA; (R.D.); (T.T.)
- Department of Medicine, Emory University School of Medicine, Atlanta, GA 30322, USA;
| | - Kathryn Thompson
- Department of Medicine, Emory University School of Medicine, Atlanta, GA 30322, USA;
| | - Nader Dbouk
- Division of Digestive Diseases, Emory University School of Medicine, Atlanta, GA 30322, USA; (R.D.); (T.T.)
- Department of Medicine, Emory University School of Medicine, Atlanta, GA 30322, USA;
- Emory Transplant Center, Emory University School of Medicine, Atlanta, GA 30322, USA
| |
Collapse
|
12
|
Bürgisser N, Chalot E, Mehouachi S, Buclin CP, Lauper K, Courvoisier DS, Mongin D. Large language models for accurate disease detection in electronic health records: the examples of crystal arthropathies. RMD Open 2024; 10:e005003. [PMID: 39794274 PMCID: PMC11664341 DOI: 10.1136/rmdopen-2024-005003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2024] [Accepted: 11/27/2024] [Indexed: 01/13/2025] Open
Abstract
OBJECTIVES We propose and test a framework to detect disease diagnosis using a recent large language model (LLM), Meta's Llama-3-8B, on French-language electronic health record (EHR) documents. Specifically, it focuses on detecting gout ('goutte' in French), a ubiquitous French term that has multiple meanings beyond the disease. The study compares the performance of the LLM-based framework with traditional natural language processing techniques and tests its dependence on the parameter used. METHODS The framework was developed using a training and testing set of 700 paragraphs assessing 'gout' from a random selection of EHR documents from a tertiary university hospital in Geneva, Switzerland. All paragraphs were manually reviewed and classified by two healthcare professionals into disease (true gout) and non-disease (gold standard). The LLM's accuracy was tested using few-shot and chain-of-thought prompting and compared with a regular expression (regex)-based method, focusing on the effects of model parameters and prompt structure. The framework was further validated on 600 paragraphs assessing 'Calcium Pyrophosphate Deposition Disease (CPPD)'. RESULTS The LLM-based algorithm outperformed the regex method, achieving a 92.7% (88.7%-95.4%) positive predictive value, a 96.6% (94.6%-97.8%) negative predictive value and an accuracy of 95.4% (93.6%-96.7%) for gout. In the validation set on CPPD, accuracy was 94.1% (90.2%-97.6%). The LLM framework performed well over a wide range of parameter values. CONCLUSION LLMs accurately detected disease diagnoses from EHRs, even in non-English languages. They could facilitate creating large disease registers in any language, improving disease care assessment and patient recruitment for clinical trials.
Collapse
Affiliation(s)
- Nils Bürgisser
- Division of Rheumatology, Geneva University Hospitals, Geneva, Switzerland
- Division of Internal Medicine, Geneva University Hospitals, Geneva, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Etienne Chalot
- Information Systems Directorate, Geneva University Hospitals, Geneva, Switzerland
| | - Samia Mehouachi
- Division of Rheumatology, Geneva University Hospitals, Geneva, Switzerland
| | - Clement P. Buclin
- Division of Internal Medicine, Geneva University Hospitals, Geneva, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Kim Lauper
- Division of Rheumatology, Geneva University Hospitals, Geneva, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Delphine S. Courvoisier
- Division of Rheumatology, Geneva University Hospitals, Geneva, Switzerland
- Quality of Care Division, Geneva University Hospitals, Geneva, Switzerland
| | - Denis Mongin
- Division of Rheumatology, Geneva University Hospitals, Geneva, Switzerland
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
| |
Collapse
|
13
|
Lewandowska M, Street D, Yim J, Jones S, Viney R. Artificial intelligence in radiation therapy treatment planning: A discrete choice experiment. J Med Radiat Sci 2024. [PMID: 39705152 DOI: 10.1002/jmrs.843] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2024] [Accepted: 11/28/2024] [Indexed: 12/22/2024] Open
Abstract
INTRODUCTION The application of artificial intelligence (AI) in radiation therapy holds promise for addressing challenges, such as healthcare staff shortages, increased efficiency and treatment planning variations. Increased AI adoption has the potential to standardise treatment protocols, enhance quality, improve patient outcomes, and reduce costs. However, drawbacks include impacts on employment and algorithmic biases, making it crucial to navigate trade-offs. A discrete choice experiment (DCE) was undertaken to examine the AI-related characteristics radiation oncology professionals think are most important for adoption in radiation therapy treatment planning. METHODS Radiation oncology professionals completed an online discrete choice experiment to express their preferences about AI systems for radiation therapy planning which were described by five attributes, each with 2-4 levels: accuracy, automation, exploratory ability, compatibility with other systems and impact on workload. The survey also included questions about attitudes to AI. Choices were modelled using mixed logit regression. RESULTS The survey was completed by 82 respondents. The results showed they preferred AI systems that offer the largest time saving, and that provide explanations of the AI reasoning (both in-depth and basic). They also favoured systems that provide improved contouring precision compared with manual systems. Respondents emphasised the importance of AI systems being cost-effective, while also recognising AI's impact on professional roles, responsibilities, and service delivery. CONCLUSIONS This study provides important information about radiation oncology professionals' priorities for AI in treatment planning. The findings from this study can be used to inform future research on economic evaluations and management perspectives of AI-driven technologies in radiation therapy.
Collapse
Affiliation(s)
- Milena Lewandowska
- Centre for Health Economics Research and Evaluation, University of Technology Sydney, Sydney, New South Wales, Australia
| | - Deborah Street
- Centre for Health Economics Research and Evaluation, University of Technology Sydney, Sydney, New South Wales, Australia
| | - Jackie Yim
- Centre for Health Economics Research and Evaluation, University of Technology Sydney, Sydney, New South Wales, Australia
- Radiation Oncology, Royal North Shore Hospital, Sydney, New South Wales, Australia
| | - Scott Jones
- Radiation Oncology Princess Alexandra Hospital Raymond Terrace, Brisbane, Queens Land, Australia
| | - Rosalie Viney
- Centre for Health Economics Research and Evaluation, University of Technology Sydney, Sydney, New South Wales, Australia
| |
Collapse
|
14
|
Chen Y, Lehmann CU, Malin B. Digital Information Ecosystems in Modern Care Coordination and Patient Care Pathways and the Challenges and Opportunities for AI Solutions. J Med Internet Res 2024; 26:e60258. [PMID: 39622048 PMCID: PMC11650087 DOI: 10.2196/60258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Revised: 08/26/2024] [Accepted: 10/28/2024] [Indexed: 02/27/2025] Open
Abstract
The integration of digital technologies into health care has significantly enhanced the efficiency and effectiveness of care coordination. Our perspective paper explores the digital information ecosystems in modern care coordination, focusing on the processes of information generation, updating, transmission, and exchange along a patient's care pathway. We identify several challenges within this ecosystem, including interoperability issues, information silos, hard-to-map patient care journeys, increased workload on health care professionals, coordination and communication gaps, and compliance with privacy regulations. These challenges are often associated with inefficiencies and diminished care quality. We also examine how emerging artificial intelligence (AI) tools have the potential to enhance the management of patient information flow. Specifically, AI can boost interoperability across diverse health systems; optimize and monitor patient care pathways; improve information retrieval and care transitions; humanize health care by integrating patients' desired outcomes and patient-reported outcome measures; and optimize clinical workflows, resource allocation, and digital tool usability and user experiences. By strategically leveraging AI, health care systems can establish a more robust and responsive digital information ecosystem, improving care coordination and patient outcomes. This perspective underscores the importance of continued research and investment in AI technologies in patient care pathways. We advocate for a thoughtful integration of AI into health care practices to fully realize its potential in revolutionizing care coordination.
Collapse
Affiliation(s)
- You Chen
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Christoph U Lehmann
- Clinical Informatics Center, University of Texas Southwestern Medical Center, Dallas, TX, United States
- Institut für Medizinische Informatik, Universitäts Klinikum Heidelberg, Heidelberg, Germany
| | - Bradley Malin
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, United States
| |
Collapse
|
15
|
Chrysafi P, Lam B, Carton S, Patell R. From Code to Clots: Applying Machine Learning to Clinical Aspects of Venous Thromboembolism Prevention, Diagnosis, and Management. Hamostaseologie 2024; 44:429-445. [PMID: 39657652 DOI: 10.1055/a-2415-8408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2024] Open
Abstract
The high incidence of venous thromboembolism (VTE) globally and the morbidity and mortality burden associated with the disease make it a pressing issue. Machine learning (ML) can improve VTE prevention, detection, and treatment. The ability of this novel technology to process large amounts of high-dimensional data can help identify new risk factors and better risk stratify patients for thromboprophylaxis. Applications of ML for VTE include systems that interpret medical imaging, assess the severity of the VTE, tailor treatment according to individual patient needs, and identify VTE cases to facilitate surveillance. Generative artificial intelligence may be leveraged to design new molecules such as new anticoagulants, generate synthetic data to expand datasets, and reduce clinical burden by assisting in generating clinical notes. Potential challenges in the applications of these novel technologies include the availability of multidimensional large datasets, prospective studies and clinical trials to ensure safety and efficacy, continuous quality assessment to maintain algorithm accuracy, mitigation of unwanted bias, and regulatory and legal guardrails to protect patients and providers. We propose a practical approach for clinicians to integrate ML into research, from choosing appropriate problems to integrating ML into clinical workflows. ML offers much promise and opportunity for clinicians and researchers in VTE to translate this technology into the clinic and directly benefit the patients.
Collapse
Affiliation(s)
- Pavlina Chrysafi
- Department of Medicine, Mount Auburn Hospital, Harvard Medical School, Cambridge, Massachusetts, United States
| | - Barbara Lam
- Division of Hemostasis and Thrombosis, Department of Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, United States
- Division of Clinical Informatics, Department of Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, United States
| | - Samuel Carton
- Department of Computer Science, College of Engineering and Physical Sciences, University of New Hampshire, Durham, New Hampshire, United States
| | - Rushad Patell
- Division of Hemostasis and Thrombosis, Department of Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, United States
| |
Collapse
|
16
|
Matute-González M, Darnell A, Comas-Cufí M, Pazó J, Soler A, Saborido B, Mauro E, Turnes J, Forner A, Reig M, Rimola J. Utilizing a domain-specific large language model for LI-RADS v2018 categorization of free-text MRI reports: a feasibility study. Insights Imaging 2024; 15:280. [PMID: 39576290 PMCID: PMC11584817 DOI: 10.1186/s13244-024-01850-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2024] [Accepted: 10/22/2024] [Indexed: 11/25/2024] Open
Abstract
OBJECTIVE To develop a domain-specific large language model (LLM) for LI-RADS v2018 categorization of hepatic observations based on free-text descriptions extracted from MRI reports. MATERIAL AND METHODS This retrospective study included 291 small liver observations, divided into training (n = 141), validation (n = 30), and test (n = 120) datasets. Of these, 120 were fictitious, and 171 were extracted from 175 MRI reports from a single institution. The algorithm's performance was compared to two independent radiologists and one hepatologist in a human replacement scenario, and considering two combined strategies (double reading with arbitration and triage). Agreement on LI-RADS category and dichotomic malignancy (LR-4, LR-5, and LR-M) were estimated using linear-weighted κ statistics and Cohen's κ, respectively. Sensitivity and specificity for LR-5 were calculated. The consensus agreement of three other radiologists served as the ground truth. RESULTS The model showed moderate agreement against the ground truth for both LI-RADS categorization (κ = 0.54 [95% CI: 0.42-0.65]) and the dichotomized approach (κ = 0.58 [95% CI: 0.42-0.73]). Sensitivity and specificity for LR-5 were 0.76 (95% CI: 0.69-0.86) and 0.96 (95% CI: 0.91-1.00), respectively. When the chatbot was used as a triage tool, performance improved for LI-RADS categorization (κ = 0.86/0.87 for the two independent radiologists and κ = 0.76 for the hepatologist), dichotomized malignancy (κ = 0.94/0.91 and κ = 0.87) and LR-5 identification (1.00/0.98 and 0.85 sensitivity, 0.96/0.92 and 0.92 specificity), with no statistical significance compared to the human readers' individual performance. Through this strategy, the workload decreased by 45%. CONCLUSION LI-RADS v2018 categorization from unlabelled MRI reports is feasible using our LLM, and it enhances the efficiency of data curation. CRITICAL RELEVANCE STATEMENT Our proof-of-concept study provides novel insights into the potential applications of LLMs, offering a real-world example of how these tools could be integrated into a local workflow to optimize data curation for research purposes. KEY POINTS Automatic LI-RADS categorization from free-text reports would be beneficial to workflow and data mining. LiverAI, a GPT-4-based model, supported various strategies improving data curation efficiency by up to 60%. LLMs can integrate into workflows, significantly reducing radiologists' workload.
Collapse
Affiliation(s)
- Mario Matute-González
- BCLC Group, Radiology Department, Hospital Clínic of Barcelona, IDIBAPS, Barcelona, Spain
| | - Anna Darnell
- BCLC Group, Radiology Department, Hospital Clínic of Barcelona, IDIBAPS, Barcelona, Spain
| | - Marc Comas-Cufí
- Computer Science, Applied Mathematics and Statistics Department, University of Girona, Girona, Spain
| | - Javier Pazó
- Information Technology Department, Spanish Association for the Study of the Liver, Madrid, Spain
| | - Alexandre Soler
- BCLC Group, Radiology Department, Hospital Clínic of Barcelona, IDIBAPS, Barcelona, Spain
| | - Belén Saborido
- BCLC Group, Fundació Clínic per la Recerca Biomèdica-IDIBAPS, Barcelona, Spain
| | - Ezequiel Mauro
- BCLC Group, Liver Unit, Hospital Clínic of Barcelona, Fundació Clínic per a la Recerca Biomédica (FCRB), IDIBAPS, University of Barcelona, Barcelona, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Barcelona, Spain
| | - Juan Turnes
- Gastroenterology and Hepatology, Pontevedra University Hospital Complex, Pontevedra, Spain
- Galicia Sur Health Research Institute, Vigo, Spain
| | - Alejandro Forner
- BCLC Group, Liver Unit, Hospital Clínic of Barcelona, Fundació Clínic per a la Recerca Biomédica (FCRB), IDIBAPS, University of Barcelona, Barcelona, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Barcelona, Spain
| | - María Reig
- BCLC Group, Liver Unit, Hospital Clínic of Barcelona, Fundació Clínic per a la Recerca Biomédica (FCRB), IDIBAPS, University of Barcelona, Barcelona, Spain
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Barcelona, Spain
| | - Jordi Rimola
- BCLC Group, Radiology Department, Hospital Clínic of Barcelona, IDIBAPS, Barcelona, Spain.
- Centro de Investigación Biomédica en Red de Enfermedades Hepáticas y Digestivas (CIBERehd), Barcelona, Spain.
| |
Collapse
|
17
|
Fu Z, Fu S, Huang Y, He W, Zhong Z, Guo Y, Lin Y. Application of large language model combined with retrieval enhanced generation technology in digestive endoscopic nursing. Front Med (Lausanne) 2024; 11:1500258. [PMID: 39568739 PMCID: PMC11577783 DOI: 10.3389/fmed.2024.1500258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2024] [Accepted: 10/22/2024] [Indexed: 11/22/2024] Open
Abstract
Background Although large language models (LLMs) have demonstrated powerful capabilities in general domains, they may output information in the medical field that could be incorrect, incomplete, or fabricated. They are also unable to answer personalized questions related to departments or individual patient health. Retrieval-augmented generation technology (RAG) can introduce external knowledge bases and utilize the retrieved information to generate answers or text, thereby enhancing prediction accuracy. Method We introduced internal departmental data and 17 commonly used gastroenterology guidelines as a knowledge base. Based on RAG, we developed the Endo-chat medical chat application, which can answer patient questions related to gastrointestinal endoscopy. We then included 200 patients undergoing gastrointestinal endoscopy, randomly divided into two groups of 100 each, for a questionnaire survey. A comparative evaluation was conducted between the traditional manual methods and Endo-chat. Results Compared to ChatGPT, Endo-chat can accurately and professionally answer relevant questions after matching the knowledge base. In terms of response efficiency, completeness, and patient satisfaction, Endo-chat outperformed manual methods significantly. There was no statistical difference in response accuracy between the two. Patients showed a preference for AI services and expressed support for the introduction of AI. All participating nurses in the survey believed that introducing AI could reduce nursing workload. Conclusion In clinical practice, Endo-chat can be used as a highly effective auxiliary tool for digestive endoscopic care.
Collapse
Affiliation(s)
- Zhaoli Fu
- Department of Gastroenterology, The Second Affiliated Hospital of Guanzhou University of Chinese Medicine, Guangzhou, China
| | - Siyuan Fu
- The Fifth Affiliated Hospital of Guangzhou Medical University, Guangzhou, China
| | - Yuan Huang
- Department of Gastroenterology, The Second Affiliated Hospital of Guanzhou University of Chinese Medicine, Guangzhou, China
| | - Wenfang He
- Department of Gastroenterology, The Second Affiliated Hospital of Guanzhou University of Chinese Medicine, Guangzhou, China
| | - Zhuodan Zhong
- Department of Gastroenterology, The Second Affiliated Hospital of Guanzhou University of Chinese Medicine, Guangzhou, China
| | - Yan Guo
- Department of Gastroenterology, The Second Affiliated Hospital of Guanzhou University of Chinese Medicine, Guangzhou, China
| | - Yanfeng Lin
- Department of Gastroenterology, The Second Affiliated Hospital of Guanzhou University of Chinese Medicine, Guangzhou, China
| |
Collapse
|
18
|
Wang B, Lai J, Cao H, Jin F, Li Q, Tang M, Yao C, Zhang P. Enhancing the interoperability and transparency of real-world data extraction in clinical research: evaluating the feasibility and impact of a ChatGLM implementation in Chinese hospital settings. EUROPEAN HEART JOURNAL. DIGITAL HEALTH 2024; 5:712-724. [PMID: 39563908 PMCID: PMC11570364 DOI: 10.1093/ehjdh/ztae066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Revised: 07/04/2024] [Accepted: 08/19/2024] [Indexed: 11/21/2024]
Abstract
Aims This study aims to assess the feasibility and impact of the implementation of the ChatGLM for real-world data (RWD) extraction in hospital settings. The primary focus of this research is on the effectiveness of ChatGLM-driven data extraction compared with that of manual processes associated with the electronic source data repository (ESDR) system. Methods and results The researchers developed the ESDR system, which integrates ChatGLM, electronic case report forms (eCRFs), and electronic health records. The LLaMA (Large Language Model Meta AI) model was also deployed to compare the extraction accuracy of ChatGLM in free-text forms. A single-centre retrospective cohort study served as a pilot case. Five eCRF forms of 63 subjects, including free-text forms and discharge medication, were evaluated. Data collection involved electronic medical and prescription records collected from 13 departments. The ChatGLM-assisted process was associated with an estimated efficiency improvement of 80.7% in the eCRF data transcription time. The initial manual input accuracy for free-text forms was 99.59%, the ChatGLM data extraction accuracy was 77.13%, and the LLaMA data extraction accuracy was 43.86%. The challenges associated with the use of ChatGLM focus on prompt design, prompt output consistency, prompt output verification, and integration with hospital information systems. Conclusion The main contribution of this study is to validate the use of ESDR tools to address the interoperability and transparency challenges of using ChatGLM for RWD extraction in Chinese hospital settings.
Collapse
Affiliation(s)
- Bin Wang
- School of Clinical Medicine, Tsinghua University, No. 30 Shuangqing Road, Haidian District, Beijing 100084, China
| | - Junkai Lai
- Institute of Automation, Chinese Academy of Sciences, No. 95 Zhongguancun Road, Haidian District, Beijing 100080, China
- Hangzhou LionMed Medical Information Technology Co., Ltd, No.19 Jugong Road, Xixing Sub-District, Hangzhou 310000, China
| | - Han Cao
- Medical Data Science Center, Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua University, No. 168 Litang Road, Changping District, Beijing 102218, China
| | - Feifei Jin
- Trauma Medicine Center, Peking University People's Hospital, No. 11 Xizhimen South Street, Xicheng District, Beijing 100044, China
- Key Laboratory of Trauma Treatment and Neural Regeneration, Peking University, Ministry of Education, No. 11 Xizhimen South Street, Xicheng District, Beijing 100044, China
- National Center for Trauma Medicine of China, No. 11 Xizhimen South Street, Xicheng District, Beijing 100044, China
| | - Qiang Li
- Department of Information Administration, Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua University, No. 168 Litang Road, Changping District, Beijing 102218, China
| | - Mingkun Tang
- Medical Data Science Center, Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua University, No. 168 Litang Road, Changping District, Beijing 102218, China
| | - Chen Yao
- Peking University Clinical Research Institute, Peking University First Hospital, No. 8 Xishiku Street, Xicheng District, Beijing 100034, China
- Hainan Institute of Real-World Data, No. 32 Kangxiang Road, Qionghai 571437, China
| | - Ping Zhang
- Department of Cardiology, Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua University, No. 168 Litang Road, Changping District, Beijing 102218, China
| |
Collapse
|
19
|
Ge J, Sun S, Owens J, Galvez V, Gologorskaya O, Lai JC, Pletcher MJ, Lai K. Development of a liver disease-specific large language model chat interface using retrieval-augmented generation. Hepatology 2024; 80:1158-1168. [PMID: 38451962 PMCID: PMC11706764 DOI: 10.1097/hep.0000000000000834] [Citation(s) in RCA: 34] [Impact Index Per Article: 34.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Accepted: 02/24/2024] [Indexed: 03/09/2024]
Abstract
BACKGROUND AND AIMS Large language models (LLMs) have significant capabilities in clinical information processing tasks. Commercially available LLMs, however, are not optimized for clinical uses and are prone to generating hallucinatory information. Retrieval-augmented generation (RAG) is an enterprise architecture that allows the embedding of customized data into LLMs. This approach "specializes" the LLMs and is thought to reduce hallucinations. APPROACH AND RESULTS We developed "LiVersa," a liver disease-specific LLM, by using our institution's protected health information-complaint text embedding and LLM platform, "Versa." We conducted RAG on 30 publicly available American Association for the Study of Liver Diseases guidance documents to be incorporated into LiVersa. We evaluated LiVersa's performance by conducting 2 rounds of testing. First, we compared LiVersa's outputs versus those of trainees from a previously published knowledge assessment. LiVersa answered all 10 questions correctly. Second, we asked 15 hepatologists to evaluate the outputs of 10 hepatology topic questions generated by LiVersa, OpenAI's ChatGPT 4, and Meta's Large Language Model Meta AI 2. LiVersa's outputs were more accurate but were rated less comprehensive and safe compared to those of ChatGPT 4. RESULTS We evaluated LiVersa's performance by conducting 2 rounds of testing. First, we compared LiVersa's outputs versus those of trainees from a previously published knowledge assessment. LiVersa answered all 10 questions correctly. Second, we asked 15 hepatologists to evaluate the outputs of 10 hepatology topic questions generated by LiVersa, OpenAI's ChatGPT 4, and Meta's Large Language Model Meta AI 2. LiVersa's outputs were more accurate but were rated less comprehensive and safe compared to those of ChatGPT 4. CONCLUSIONS In this demonstration, we built disease-specific and protected health information-compliant LLMs using RAG. While LiVersa demonstrated higher accuracy in answering questions related to hepatology, there were some deficiencies due to limitations set by the number of documents used for RAG. LiVersa will likely require further refinement before potential live deployment. The LiVersa prototype, however, is a proof of concept for utilizing RAG to customize LLMs for clinical use cases.
Collapse
Affiliation(s)
- Jin Ge
- Division of Gastroenterology and Hepatology, Department of Medicine, University of California – San Francisco, San Francisco, CA
| | - Steve Sun
- UCSF Health Information Technology, University of California – San Francisco, San Francisco, CA
| | - Joseph Owens
- UCSF Health Information Technology, University of California – San Francisco, San Francisco, CA
| | - Victor Galvez
- UCSF Health Information Technology, University of California – San Francisco, San Francisco, CA
| | - Oksana Gologorskaya
- UCSF Health Information Technology, University of California – San Francisco, San Francisco, CA
- Bakar Computational Health Sciences Institute, University of California – San Francisco, San Francisco, CA
| | - Jennifer C. Lai
- Division of Gastroenterology and Hepatology, Department of Medicine, University of California – San Francisco, San Francisco, CA
| | - Mark J. Pletcher
- Department of Epidemiology and Biostatistics, University of California – San Francisco, San Francisco, CA
| | - Ki Lai
- UCSF Health Information Technology, University of California – San Francisco, San Francisco, CA
| |
Collapse
|
20
|
Wong CR, Flores YN, Avila A, Tieu L, Crespi CM, May FP, Bell D, Glenn B, Bastani R. Improving the Accuracy and Precision of Disease Identification When Utilizing Ehr Data for Research: the Case for Hepatocellular Carcinoma. RESEARCH SQUARE 2024:rs.3.rs-4993106. [PMID: 39483882 PMCID: PMC11527239 DOI: 10.21203/rs.3.rs-4993106/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/03/2024]
Abstract
Objective We assessed the performance of ICD codes to identify patients with hepatocellular carcinoma (HCC) in a large academic health system and determined whether employing an algorithm using a combination of ICD codes could deliver higher accuracy and precision than single ICD codes in identifying HCC cases using electronic health record (EHR) data. Results The use of a single ICD code entry for HCC (ICD-9-CM 155.0 or ICD-10-CM C22.0) in our cohort of 1,007 established ambulatory care patients with potential HCC yielded 58% false positives (not true HCC cases) based on chart reviews. We developed an ICD code-based algorithm that prioritized positive predictive value (PPV), F-score, and accuracy to minimize false positives and negatives. The highest performing algorithm required at least 10 ICD code entries for HCC and the sum of ICD code entries for HCC to exceed the sum of ICD code entries for non-HCC malignancies. The algorithm demonstrated high performance (PPV 97.4%, F-score 0.92, accuracy 94%), which was internally validated (PPV 92.3%, F-score 0.90, accuracy 91%) using a separate sample of potential HCC cases. Our findings support the need to assess the accuracy and precision of ICD codes before using EHR data to study HCC more broadly.
Collapse
Affiliation(s)
- Carrie R Wong
- Vatche and Tamar Manoukian Division of Digestive Diseases, Department of Medicine, University of California, Los Angeles
| | - Yvonne N Flores
- UCLA Center for Cancer Prevention and Control and UCLA-Kaiser Permanente Center for Health Equity
| | - Analissa Avila
- UCLA Center for Cancer Prevention and Control and UCLA-Kaiser Permanente Center for Health Equity
| | - Lina Tieu
- UCLA Center for Cancer Prevention and Control and UCLA-Kaiser Permanente Center for Health Equity
| | - Catherine M Crespi
- UCLA Center for Cancer Prevention and Control and UCLA-Kaiser Permanente Center for Health Equity
| | - Folasade P May
- Vatche and Tamar Manoukian Division of Digestive Diseases, Department of Medicine, University of California, Los Angeles
| | - Douglas Bell
- Division of General Internal Medicine, Department of Medicine, University of California, Los Angeles
| | - Beth Glenn
- UCLA Center for Cancer Prevention and Control and UCLA-Kaiser Permanente Center for Health Equity
| | - Roshan Bastani
- UCLA Center for Cancer Prevention and Control and UCLA-Kaiser Permanente Center for Health Equity
| |
Collapse
|
21
|
Patel PV, Davis C, Ralbovsky A, Tinoco D, Williams CY, Slatter S, Naderalvojoud B, Rosen MJ, Hernandez-Boussard T, Rudrapatna V. Large Language Models Outperform Traditional Natural Language Processing Methods in Extracting Patient-Reported Outcomes in Inflammatory Bowel Disease. GASTRO HEP ADVANCES 2024; 4:100563. [PMID: 39877865 PMCID: PMC11772946 DOI: 10.1016/j.gastha.2024.10.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/04/2024] [Accepted: 10/04/2024] [Indexed: 01/31/2025]
Abstract
Background and Aims Patient-reported outcomes (PROs) are vital in assessing disease activity and treatment outcomes in inflammatory bowel disease (IBD). However, manual extraction of these PROs from the free-text of clinical notes is burdensome. We aimed to improve data curation from free-text information in the electronic health record, making it more available for research and quality improvement. This study aimed to compare traditional natural language processing (tNLP) and large language models (LLMs) in extracting 3 IBD PROs (abdominal pain, diarrhea, fecal blood) from clinical notes across 2 institutions. Methods Clinic notes were annotated for each PRO using preset protocols. Models were developed and internally tested at the University of California, San Francisco, and then externally validated at Stanford University. We compared tNLP and LLM-based models on accuracy, sensitivity, specificity, positive, and negative predictive value. In addition, we conducted fairness and error assessments. Results Interrater reliability between annotators was >90%. On the University of California, San Francisco test set (n = 50), the top-performing tNLP models showcased accuracies of 92% (abdominal pain), 82% (diarrhea) and 80% (fecal blood), comparable to GPT-4, which was 96%, 88%, and 90% accurate, respectively. On external validation at Stanford (n = 250), tNLP models failed to generalize (61%-62% accuracy) while GPT-4 maintained accuracies >90%. Pathways Language Model-2 and Generative Pre-trained Transformer-4 showed similar performance. No biases were detected based on demographics or diagnosis. Conclusion LLMs are accurate and generalizable methods for extracting PROs. They maintain excellent accuracy across institutions, despite heterogeneity in note templates and authors. Widespread adoption of such tools has the potential to enhance IBD research and patient care.
Collapse
Affiliation(s)
- Perseus V. Patel
- Department of Pediatrics, University of California, San Francisco, San Francisco, California
- Division of Pediatric Gastroenterology, Stanford University School of Medicine, Palo Alto, California
| | - Conner Davis
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California
| | - Amariel Ralbovsky
- Department of Pediatrics, University of California, San Francisco, San Francisco, California
| | - Daniel Tinoco
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California
| | - Christopher Y.K. Williams
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California
| | - Shadera Slatter
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California
| | - Behzad Naderalvojoud
- Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University, Palo Alto, California
| | - Michael J. Rosen
- Division of Pediatric Gastroenterology, Stanford University School of Medicine, Palo Alto, California
| | - Tina Hernandez-Boussard
- Stanford Center for Biomedical Informatics Research, Department of Medicine, Stanford University, Palo Alto, California
| | - Vivek Rudrapatna
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California
- Division of Gastroenterology, Department of Medicine, University of California, San Francisco, San Francisco, California
| |
Collapse
|
22
|
Far A, Bastani A, Lee A, Gologorskaya O, Huang CY, Pletcher MJ, Lai JC, Ge J. Evaluating the positive predictive value of code-based identification of cirrhosis and its complications utilizing GPT-4. Hepatology 2024:01515467-990000000-01046. [PMID: 39378414 PMCID: PMC11975717 DOI: 10.1097/hep.0000000000001115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/22/2024] [Accepted: 09/23/2024] [Indexed: 10/10/2024]
Abstract
BACKGROUND AND AIMS Diagnosis code classification is a common method for cohort identification in cirrhosis research, but it is often inaccurate and augmented by labor-intensive chart review. Natural language processing using large language models (LLMs) is a potentially more accurate method. To assess LLMs' potential for cirrhosis cohort identification, we compared code-based versus LLM-based classification with chart review as a "gold standard." APPROACH AND RESULTS We extracted and conducted a limited chart review of 3788 discharge summaries of cirrhosis admissions. We engineered zero-shot prompts using a Generative Pre-trained Transformer 4 to determine whether cirrhosis and its complications were active hospitalization problems. We calculated positive predictive values (PPVs) of LLM-based classification versus limited chart review and PPVs of code-based versus LLM-based classification as a "silver standard" in all 3788 summaries. Compared to gold standard chart review, code-based classification achieved PPVs of 82.2% for identifying cirrhosis, 41.7% for HE, 72.8% for ascites, 59.8% for gastrointestinal bleeding, and 48.8% for spontaneous bacterial peritonitis. Compared to the chart review, Generative Pre-trained Transformer 4 achieved 87.8%-98.8% accuracies for identifying cirrhosis and its complications. Using LLM as a silver standard, code-based classification achieved PPVs of 79.8% for identifying cirrhosis, 53.9% for HE, 55.3% for ascites, 67.6% for gastrointestinal bleeding, and 65.5% for spontaneous bacterial peritonitis. CONCLUSIONS LLM-based classification was highly accurate versus manual chart review in identifying cirrhosis and its complications. This allowed us to assess the performance of code-based classification at scale using LLMs as a silver standard. These results suggest LLMs could augment or replace code-based cohort classification and raise questions regarding the necessity of chart review.
Collapse
Affiliation(s)
- Aryana Far
- Division of Gastroenterology and Hepatology, Department of Medicine, University of California – San Francisco, San Francisco, CA
| | - Asal Bastani
- Division of Gastroenterology and Hepatology, Department of Medicine, University of California – San Francisco, San Francisco, CA
| | - Albert Lee
- Academic Research Services, University of California – San Francisco, San Francisco, CA
- Bakar Computational Health Sciences Institute, University of California – San Francisco, San Francisco, CA
| | - Oksana Gologorskaya
- Academic Research Services, University of California – San Francisco, San Francisco, CA
- Bakar Computational Health Sciences Institute, University of California – San Francisco, San Francisco, CA
| | - Chiung-Yu Huang
- Department of Epidemiology and Biostatistics, University of California – San Francisco, San Francisco, CA
| | - Mark J. Pletcher
- Department of Epidemiology and Biostatistics, University of California – San Francisco, San Francisco, CA
| | - Jennifer C. Lai
- Division of Gastroenterology and Hepatology, Department of Medicine, University of California – San Francisco, San Francisco, CA
| | - Jin Ge
- Division of Gastroenterology and Hepatology, Department of Medicine, University of California – San Francisco, San Francisco, CA
| |
Collapse
|
23
|
Inojosa H, Voigt I, Wenk J, Ferber D, Wiest I, Antweiler D, Weicken E, Gilbert S, Kather JN, Akgün K, Ziemssen T. Integrating large language models in care, research, and education in multiple sclerosis management. Mult Scler 2024; 30:1392-1401. [PMID: 39308156 PMCID: PMC11514324 DOI: 10.1177/13524585241277376] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Revised: 06/26/2024] [Accepted: 08/06/2024] [Indexed: 10/25/2024]
Abstract
Use of techniques derived from generative artificial intelligence (AI), specifically large language models (LLMs), offer a transformative potential on the management of multiple sclerosis (MS). Recent LLMs have exhibited remarkable skills in producing and understanding human-like texts. The integration of AI in imaging applications and the deployment of foundation models for the classification and prognosis of disease course, including disability progression and even therapy response, have received considerable attention. However, the use of LLMs within the context of MS remains relatively underexplored. LLMs have the potential to support several activities related to MS management. Clinical decision support systems could help selecting proper disease-modifying therapies; AI-based tools could leverage unstructured real-world data for research or virtual tutors may provide adaptive education materials for neurologists and people with MS in the foreseeable future. In this focused review, we explore practical applications of LLMs across the continuum of MS management as an initial scope for future analyses, reflecting on regulatory hurdles and the indispensable role of human supervision.
Collapse
Affiliation(s)
- Hernan Inojosa
- Center of Clinical Neuroscience, Department of Neurology, University Hospital Carl Gustav Carus Dresden, Technical University Dresden, Dresden, Germany
| | - Isabel Voigt
- Center of Clinical Neuroscience, Department of Neurology, University Hospital Carl Gustav Carus Dresden, Technical University Dresden, Dresden, Germany
| | - Judith Wenk
- Center of Clinical Neuroscience, Department of Neurology, University Hospital Carl Gustav Carus Dresden, Technical University Dresden, Dresden, Germany
| | - Dyke Ferber
- Else Kröner Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany
| | - Isabella Wiest
- Else Kröner Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany
| | - Dario Antweiler
- Fraunhofer Institute for Intelligent Analysis and Information Systems, Sankt Augustin, Germany
| | - Eva Weicken
- Else Kröner Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany
- Fraunhofer Institute for Telecommunications, Heinrich Hertz Institute, HHI, Berlin, Germany
| | - Stephen Gilbert
- Else Kröner Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany
| | - Jakob Nikolas Kather
- Else Kröner Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany
| | - Katja Akgün
- Center of Clinical Neuroscience, Department of Neurology, University Hospital Carl Gustav Carus Dresden, Technical University Dresden, Dresden, Germany
| | - Tjalf Ziemssen
- Center of Clinical Neuroscience, Department of Neurology, University Hospital Carl Gustav Carus Dresden, Technical University Dresden, Dresden, Germany
| |
Collapse
|
24
|
Patel PV, Davis C, Ralbovsky A, Tinoco D, Williams CYK, Slatter S, Naderalvojoud B, Rosen MJ, Hernandez-Boussard T, Rudrapatna V. Large language models outperform traditional natural language processing methods in extracting patient-reported outcomes in IBD. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.09.05.24313139. [PMID: 39281744 PMCID: PMC11398594 DOI: 10.1101/2024.09.05.24313139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 09/18/2024]
Abstract
Background and Aims Patient-reported outcomes (PROs) are vital in assessing disease activity and treatment outcomes in inflammatory bowel disease (IBD). However, manual extraction of these PROs from the free-text of clinical notes is burdensome. We aimed to improve data curation from free-text information in the electronic health record, making it more available for research and quality improvement. This study aimed to compare traditional natural language processing (tNLP) and large language models (LLMs) in extracting three IBD PROs (abdominal pain, diarrhea, fecal blood) from clinical notes across two institutions. Methods Clinic notes were annotated for each PRO using preset protocols. Models were developed and internally tested at the University of California San Francisco (UCSF), and then externally validated at Stanford University. We compared tNLP and LLM-based models on accuracy, sensitivity, specificity, positive and negative predictive value. Additionally, we conducted fairness and error assessments. Results Inter-rater reliability between annotators was >90%. On the UCSF test set (n=50), the top-performing tNLP models showcased accuracies of 92% (abdominal pain), 82% (diarrhea) and 80% (fecal blood), comparable to GPT-4, which was 96%, 88%, and 90% accurate, respectively. On external validation at Stanford (n=250), tNLP models failed to generalize (61-62% accuracy) while GPT-4 maintained accuracies >90%. PaLM-2 and GPT-4 showed similar performance. No biases were detected based on demographics or diagnosis. Conclusions LLMs are accurate and generalizable methods for extracting PROs. They maintain excellent accuracy across institutions, despite heterogeneity in note templates and authors. Widespread adoption of such tools has the potential to enhance IBD research and patient care.
Collapse
Affiliation(s)
- Perseus V Patel
- Department of Pediatrics, University of California San Francisco, San Francisco, CA
- Division of Pediatric Gastroenterology, Stanford University School of Medicine, Palo Alto, CA
| | - Conner Davis
- Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA
| | - Amariel Ralbovsky
- Department of Pediatrics, University of California San Francisco, San Francisco, CA
| | - Daniel Tinoco
- Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA
| | - Christopher Y K Williams
- Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA
| | - Shadera Slatter
- Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA
| | - Behzad Naderalvojoud
- Stanford Center for Biomedical Informatics Research, Department of Medicine, StanfordUniversity, Palo Alto, CA
| | - Michael J Rosen
- Division of Pediatric Gastroenterology, Stanford University School of Medicine, Palo Alto, CA
| | - Tina Hernandez-Boussard
- Stanford Center for Biomedical Informatics Research, Department of Medicine, StanfordUniversity, Palo Alto, CA
| | - Vivek Rudrapatna
- Bakar Computational Health Sciences Institute, University of California San Francisco, San Francisco, CA
- Division of Gastroenterology, Department of Medicine, University of California San Francisco,San Francisco, CA
| |
Collapse
|
25
|
Shah K, Xu AY, Sharma Y, Daher M, McDonald C, Diebo BG, Daniels AH. Large Language Model Prompting Techniques for Advancement in Clinical Medicine. J Clin Med 2024; 13:5101. [PMID: 39274316 PMCID: PMC11396764 DOI: 10.3390/jcm13175101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2024] [Revised: 08/23/2024] [Accepted: 08/26/2024] [Indexed: 09/16/2024] Open
Abstract
Large Language Models (LLMs have the potential to revolutionize clinical medicine by enhancing healthcare access, diagnosis, surgical planning, and education. However, their utilization requires careful, prompt engineering to mitigate challenges like hallucinations and biases. Proper utilization of LLMs involves understanding foundational concepts such as tokenization, embeddings, and attention mechanisms, alongside strategic prompting techniques to ensure accurate outputs. For innovative healthcare solutions, it is essential to maintain ongoing collaboration between AI technology and medical professionals. Ethical considerations, including data security and bias mitigation, are critical to their application. By leveraging LLMs as supplementary resources in research and education, we can enhance learning and support knowledge-based inquiries, ultimately advancing the quality and accessibility of medical care. Continued research and development are necessary to fully realize the potential of LLMs in transforming healthcare.
Collapse
Affiliation(s)
- Krish Shah
- Warren Alpert Medical School, Brown University, East Providence, RI 02914, USA
| | - Andrew Y Xu
- Warren Alpert Medical School, Brown University, East Providence, RI 02914, USA
| | - Yatharth Sharma
- Warren Alpert Medical School, Brown University, East Providence, RI 02914, USA
| | - Mohammed Daher
- Department of Orthopedics, Warren Alpert Medical School, Brown University, Providence, RI 02912, USA
| | - Christopher McDonald
- Department of Orthopedics, Warren Alpert Medical School, Brown University, Providence, RI 02912, USA
| | - Bassel G Diebo
- Department of Orthopedics, Warren Alpert Medical School, Brown University, Providence, RI 02912, USA
| | - Alan H Daniels
- Department of Orthopedics, Warren Alpert Medical School, Brown University, Providence, RI 02912, USA
| |
Collapse
|