1
|
Ito S, Furukawa E, Okuhara T, Okada H, Kiuchi T. Leveraging artificial intelligence chatbots for anemia prevention: A comparative study of ChatGPT-3.5, copilot, and Gemini outputs against Google Search results. PEC INNOVATION 2025; 6:100390. [PMID: 40276577 PMCID: PMC12020902 DOI: 10.1016/j.pecinn.2025.100390] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/06/2024] [Revised: 03/11/2025] [Accepted: 03/26/2025] [Indexed: 04/26/2025]
Abstract
Aim This study evaluated the understandability, actionability, and readability of text on anemia generated by artificial intelligence (AI) chatbots. Methods This cross-sectional study compared texts generated by ChatGPT-3.5, Microsoft Copilot, and Google Gemini at three levels: "normal," "6th grade," and "PEMAT-P version." Additionally, texts retrieved from the top eight Google Search results for relevant keywords were included for comparison. All texts were written in Japanese. The Japanese version of the PEMAT-P was used to assess understandability and actionability, while jReadability was used for readability. A systematic comparison was conducted to identify the strengths and weaknesses of each source. Results Texts generated by Gemini at the 6th-grade level (n = 26, 86.7 %) and PEMAT-P version (n = 27, 90.0 %), as well as ChatGPT-3.5 at the normal level (n = 21, 80.8 %), achieved significantly higher scores (≥70 %) for understandability and actionability compared to Google Search results (n = 17, 25.4 %, p < 0.001). For readability, Copilot and Gemini texts demonstrated significantly higher percentages of "very readable" to "somewhat difficult" levels than texts retrieved from Google Search (p = 0.000-0.007). Innovation This study is the first to objectively and quantitatively evaluate the understandability and actionability of educational materials on anemia prevention. By utilizing PEMAT-P and jReadability, the study demonstrated the superiority of Gemini in terms of understandability and readability through measurable data. This innovative approach highlights the potential of AI chatbots as a novel method for providing public health information and addressing health disparities. Conclusion AI-generated texts on anemia were found to be more readable and easier to understand than traditional web-based texts, with Gemini demonstrating the highest level of understandability. Moving forward, improvements in prompts will be necessary to enhance the integration of visual elements that encourage actionable responses in AI chatbots.
Collapse
Affiliation(s)
- Shinya Ito
- School of Nursing, Kitasato University, 1-15-1, Kitasato, Minami-ku, Sagamihara-city, Kanagawa 252-0373, Japan
| | - Emi Furukawa
- University Hospital Medical Information Network (UMIN) Center, The University of Tokyo Hospital, 7-3-1, Hongo, Bunkyo-ku, Tokyo 113-8655, Japan
| | - Tsuyoshi Okuhara
- University Hospital Medical Information Network (UMIN) Center, The University of Tokyo Hospital, 7-3-1, Hongo, Bunkyo-ku, Tokyo 113-8655, Japan
- Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo 113-8655, Japan
| | - Hiroko Okada
- University Hospital Medical Information Network (UMIN) Center, The University of Tokyo Hospital, 7-3-1, Hongo, Bunkyo-ku, Tokyo 113-8655, Japan
- Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo 113-8655, Japan
| | - Takahiro Kiuchi
- University Hospital Medical Information Network (UMIN) Center, The University of Tokyo Hospital, 7-3-1, Hongo, Bunkyo-ku, Tokyo 113-8655, Japan
- Department of Health Communication, School of Public Health, Graduate School of Medicine, The University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo 113-8655, Japan
| |
Collapse
|
2
|
Ebby CG, Tse G, Bethel J, Zhao Q, Gerber DM, Kelly MM. Large Language Models to Summarize Pediatric Admission Notes Into Plain Language. Pediatrics 2025; 155:e2024069515. [PMID: 40374185 DOI: 10.1542/peds.2024-069515] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/10/2024] [Accepted: 03/12/2025] [Indexed: 05/17/2025] Open
Affiliation(s)
- Cris G Ebby
- Division of Hospital Medicine and Complex Care, Department of Pediatrics, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin
| | - Gabriel Tse
- Division of Pediatric Hospital Medicine, Department of Pediatrics, Stanford University, Palo Alto, California
| | - Jessica Bethel
- Division of Hospital Medicine and Complex Care, Department of Pediatrics, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin
| | - Qianqian Zhao
- Department of Biostatistics and Medical Informatics, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin
| | - Danielle M Gerber
- Division of Hospital Medicine and Complex Care, Department of Pediatrics, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin
| | - Michelle M Kelly
- Division of Hospital Medicine and Complex Care, Department of Pediatrics, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin
| |
Collapse
|
3
|
Rust P, Frings J, Meister S, Fehring L. Evaluation of a large language model to simplify discharge summaries and provide cardiological lifestyle recommendations. COMMUNICATIONS MEDICINE 2025; 5:208. [PMID: 40442348 PMCID: PMC12122782 DOI: 10.1038/s43856-025-00927-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2024] [Accepted: 05/19/2025] [Indexed: 06/02/2025] Open
Abstract
BACKGROUND Hospital discharge summaries are essential for the continuity of care. However, medical jargon, abbreviations, and technical language often make them too complex for patients to understand, and they frequently omit lifestyle recommendations important for self-management. This study explored using a large language model (LLM) to enhance discharge summary readability and augment it with lifestyle recommendations. METHODS We collected 20 anonymized cardiology discharge summaries. GPT-4o was prompted using full-text and segment-wise approaches to simplify each summary and generate lifestyle recommendations. Readability was measured via three standardized metrics (modified Flesch-Reading-Ease, Vienna Non-fiction Text Formula, Lesbarkeitsindex), and multiple quality dimensions were evaluated by 12 medical experts. RESULTS LLM-generated summaries from both prompting approaches are significantly more readable compared to the original summaries across all metrics (p < 0.0001). Based on 60 expert ratings for the full-text approach and 60 for the segment-wise approach, experts '(strongly) agree' that LLM-summaries are correct (full-text: 85%; segment-wise: 80%), complete (78%; 92%), harmless (83%; 88%), and comprehensible for patients (88%; 97%). Experts '(strongly) agree' that LLM-generated recommendations are relevant in 92%, evidence-based in 88%, personalized in 70%, complete in 88%, consistent in 93%, and harmless in 88% of 60 ratings. CONCLUSIONS LLM-generated summaries achieve a 10th-grade readability level and high-quality ratings. While LLM-generated lifestyle recommendations are generally of high quality, personalization is limited. These findings suggest that LLMs could help create more patient-centric discharge summaries. Further research is needed to confirm clinical utility and address quality assurance, regulatory compliance, and clinical integration challenges.
Collapse
Affiliation(s)
- Paul Rust
- Faculty of Health, School of Medicine, Witten/Herdecke University, Alfred-Herrhausen-Strasse 50, 58455, Witten, Germany
| | - Julian Frings
- Faculty of Health, School of Medicine, Witten/Herdecke University, Alfred-Herrhausen-Strasse 50, 58455, Witten, Germany
| | - Sven Meister
- Health Care Informatics, Faculty of Health, School of Medicine, Witten/Herdecke University, Pferdebachstrasse 11, 58455, Witten, Germany
- Department Healthcare, Fraunhofer Institute for Software and Systems Engineering ISST, Speicherstrasse 6, 44147, Dortmund, Germany
| | - Leonard Fehring
- Faculty of Health, School of Medicine, Witten/Herdecke University, Alfred-Herrhausen-Strasse 50, 58455, Witten, Germany.
- Health Care Informatics, Faculty of Health, School of Medicine, Witten/Herdecke University, Pferdebachstrasse 11, 58455, Witten, Germany.
- Helios University Hospital Wuppertal, Department of Gastroenterology, Witten/Herdecke University, Heusnerstrasse 40, 42283, Wuppertal, Germany.
| |
Collapse
|
4
|
Hains L, Kleinig O, Murugappa A, Gluck S, Marks J, Gilbert T, Bacchi S. Large language model discharge summary preparation using real-world electronic medical record data shows promise. Intern Med J 2025. [PMID: 40434141 DOI: 10.1111/imj.70073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2024] [Accepted: 03/16/2025] [Indexed: 05/29/2025]
Abstract
The efficacy of large language models (LLMs) in discharge summary preparation using real clinical documentation remains novel. Our study aimed to test the efficacy of two LLMs to generate DC summaries which were scored using a validated discharge summary scoring metric. The models performed nearly identically, with the llama3:instruct model having a mean score of 19.1/31 (SD: 2.42) compared to 19.2/31 (SD: 3.48) when produced by llama3:70b. Using LLMs to aid in the generation of discharge summaries may help to reduce the overall clinical administrative workload.
Collapse
Affiliation(s)
- Lewis Hains
- Adelaide Medical School, University of Adelaide, Adelaide, South Australia, Australia
| | - Oliver Kleinig
- Adelaide Medical School, University of Adelaide, Adelaide, South Australia, Australia
| | - Ashwin Murugappa
- Adelaide Medical School, University of Adelaide, Adelaide, South Australia, Australia
| | - Samuel Gluck
- Division of Medicine, Lyell McEwin Hospital, Adelaide, South Australia, Australia
| | - Jarrod Marks
- Division of Medicine, Lyell McEwin Hospital, Adelaide, South Australia, Australia
| | - Toby Gilbert
- Adelaide Medical School, University of Adelaide, Adelaide, South Australia, Australia
- Division of Medicine, Lyell McEwin Hospital, Adelaide, South Australia, Australia
| | - Stephen Bacchi
- Adelaide Medical School, University of Adelaide, Adelaide, South Australia, Australia
- Department of Neurology, Massachusetts General Hospital, Boston, Massachusetts, USA
- Department of Neurology, Harvard University, Boston, Massachusetts, USA
| |
Collapse
|
5
|
Ruan EL, Alkattan A, Elhadad N, Rossetti SC. Clinician Perceptions of Generative Artificial Intelligence Tools and Clinical Workflows: Potential Uses, Motivations for Adoption, and Sentiments on Impact. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2025; 2024:960-969. [PMID: 40417507 PMCID: PMC12099363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/27/2025]
Abstract
Successful integration of Generative Artificial Intelligence (AI) into healthcare requires understanding of health professionals' perspectives, ideally through data-driven approaches. In this study, we use a semi-structured survey and mixed methods analyses to explore clinicians' perceptions on the utility of generative AI for all types of clinical tasks, familiarity and competency with generative AI tools, and sentiments regarding the potential impact of generative AI on healthcare. Analysis of 116 clinician responses found differing perceptions regarding the usefulness of generative AI across clinical workflows, with information gathering from external sources rated highest and communication rated lowest. Clinician-generated prompt suggestions focused most often on clinician decision making and were of mixed quality, with participants more familiar with generative AI suggesting more high-quality prompts. Sentiments regarding the impact of generative AI varied, particularly regarding trustworthiness and impact on bias. Thematic analysis of open-ended comments highlighted concerns about patient care and the role of clinicians.
Collapse
Affiliation(s)
- Elise L Ruan
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
- Department of Medicine, NewYork-Presbyterian/Columbia University Irving Medical Center, New York, NY, USA
| | - Aziz Alkattan
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
- Department of Surgery, NewYork-Presbyterian/Columbia University Irving Medical Center, New York, NY, USA
| | - Noemie Elhadad
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Sarah C Rossetti
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
- School of Nursing, Columbia University, New York, NY, USA
| |
Collapse
|
6
|
Rozenshtein A, Findeiss LK, Wood MJ, Shih G, Parikh JR. The U.S. Radiologist Workforce: AJR Expert Panel Narrative Review. AJR Am J Roentgenol 2025:1-8. [PMID: 39692304 DOI: 10.2214/ajr.24.32085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2024]
Abstract
The U.S. radiologist workforce has experienced periods of growth as well as stagnation and downturns, with concerns of radiologist oversupply during tight job markets followed by perceived workforce shortages. Major issues facing the radiologist workforce today include the following: the impacts of accumulated policy changes; a mismatch between the demand for radiologist services and the current size of the radiologist workforce; dissatisfaction, turnover, and burnout among radiologists; challenges in radiology resident education due to employment trends; and the promise and challenges of artificial intelligence. To address current and future workforce shortages, radiology as a profession must adapt to ongoing stresses and the changing care ecosystem by promoting appropriate utilization, leveraging all existing workforce reserves, and embracing innovation. In this AJR Expert Panel Narrative Review, we explore the recent history of the U.S. radiologist workforce; examine the political, social, and educational milieus faced by current and future radiologists; and consider the effects of disruptive technology.
Collapse
Affiliation(s)
- Anna Rozenshtein
- Department of Radiology, Westchester Medical Center/New York Medical College, 100 Woods Rd, Valhalla, New York, NY 10591
| | | | - Monica J Wood
- Department of Radiology, Mount Auburn Hospital/Harvard Medical School, Cambridge, MA
| | - George Shih
- Department of Radiology, Weill Cornell Medicine, New York, NY
| | - Jay R Parikh
- Division of Diagnostic Radiology, The University of Texas MD Anderson Cancer Center, Houston, TX
| |
Collapse
|
7
|
Bouguettaya A, Team V, Stuart EM, Aboujaoude E. AI-driven report-generation tools in mental healthcare: A review of commercial tools. Gen Hosp Psychiatry 2025; 94:150-158. [PMID: 40088857 DOI: 10.1016/j.genhosppsych.2025.02.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/10/2024] [Revised: 02/21/2025] [Accepted: 02/21/2025] [Indexed: 03/17/2025]
Abstract
Artificial intelligence (AI) systems are increasingly being integrated in clinical care, including for AI-powered note-writing. We aimed to develop and apply a scale for assessing mental health electronic health records (EHRs) that use large language models (LLMs) for note-writing, focusing on their features, security, and ethics. The assessment involved analyzing product information and directly querying vendors about their systems. On their websites, the majority of vendors provided comprehensive information on data protection, privacy measures, multi-platform availability, patient access features, software update history, and Meaningful Use compliance. Most products clearly indicated the LLM's capabilities in creating customized reports or functioning as a co-pilot. However, critical information was often absent, including details on LLM training methodologies, the specific LLM used, bias correction techniques, and methods for evaluating the evidence base. The lack of transparency regarding LLM specifics and bias mitigation strategies raises concerns about the ethical implementation and reliability of these systems in clinical practice. While LLM-enhanced EHRs show promise in alleviating the documentation burden for mental health professionals, there is a pressing need for greater transparency and standardization in reporting LLM-related information. We propose recommendations for the future development and implementation of these systems to ensure they meet the highest standards of security, ethics, and clinical care.
Collapse
Affiliation(s)
- Ayoub Bouguettaya
- Department of Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, CA, United States; School of Nursing and Midwifery, Monash University, Melbourne, Victoria, Australia
| | - Victoria Team
- School of Nursing and Midwifery, Monash University, Melbourne, Victoria, Australia
| | - Elizabeth M Stuart
- Jonathan Jaques Children's Cancer Institute, Miller Children's & Women's Hospital Long Beach, Long Beach, CA, United States
| | - Elias Aboujaoude
- Department of Biomedical Sciences, Cedars-Sinai Medical Center, Los Angeles, CA, United States; Department of Psychiatry and Behavioral Sciences, Stanford University, Stanford, CA, United States.
| |
Collapse
|
8
|
MacKay EJ, Goldfinger S, Chan TJ, Grasfield RH, Eswar VJ, Li K, Cao Q, Pouch AM. Automated structured data extraction from intraoperative echocardiography reports using large language models. Br J Anaesth 2025; 134:1308-1317. [PMID: 40037947 PMCID: PMC12106877 DOI: 10.1016/j.bja.2025.01.028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2024] [Revised: 01/27/2025] [Accepted: 01/28/2025] [Indexed: 03/06/2025] Open
Abstract
BACKGROUND Consensus-based large language model (LLM) ensembles might provide an automated solution for extracting structured data from unstructured text in echocardiography reports. METHODS This cross-sectional study utilised 600 intraoperative transoesophageal reports (100 for prompt engineering; 500 for testing) randomly sampled from 7106 adult patients undergoing cardiac surgery at two hospitals within the University of Pennsylvania Healthcare System. Three echocardiographic parameters (left ventricular ejection fraction, right ventricular systolic function, and tricuspid regurgitation) were extracted from both the presurgical and postsurgical sections of the reports. LLM ensembles were generated using five open-source LLMs and four voting strategies: (1) unanimous (five out of five in agreement); (2) supermajority (four or more of five in agreement); (3) majority (three or more of five in agreement); and (4) plurality (two or more of five in agreement). Returned LLM ensemble responses were compared with the reference standard dataset to calculate raw accuracy, consensus accuracy, error rate, and yield. RESULTS Of the four LLM ensembles, the unanimous LLM ensemble achieved the highest consensus accuracies (99.4% presurgical; 97.9% postsurgical) and the lowest error rates (0.6% presurgical; 2.1% postsurgical) but had the lowest data extraction yields (81.7% presurgical; 80.5% postsurgical) and the lowest raw accuracies (81.2% presurgical; 78.9% postsurgical). In contrast, the plurality LLM ensemble achieved the highest raw accuracies (96.1% presurgical; 93.7% postsurgical) and the highest data extraction yields (99.4% presurgical; 98.9% postsurgical) but had the lowest consensus accuracies (96.7% presurgical; 94.7% postsurgical) and highest error rates (3.3% presurgical; 5.3% postsurgical). CONCLUSIONS A consensus-based LLM ensemble successfully generated structured data from unstructured text contained in intraoperative transoesophageal reports.
Collapse
Affiliation(s)
- Emily J MacKay
- Department of Anaesthesiology and Critical Care, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA; Penn Center for Perioperative Outcomes Research and Transformation (CPORT), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA; Penn's Cardiovascular Outcomes, Quality and Evaluative Research Center (CAVOQER), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA.
| | - Shir Goldfinger
- Department of Radiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA; School of Arts and Sciences, University of Pennsylvania, Philadelphia, PA, USA
| | - Trevor J Chan
- Department of Radiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA; Department of Bioengineering, University of Pennsylvania, Philadelphia, PA, USA
| | - Rachel H Grasfield
- School of Medicine and Health Sciences, Des Moines University, Des Moines, IA, USA
| | - Vikram J Eswar
- Department of Anaesthesiology and Critical Care, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Kelly Li
- Harvard Medical School, Harvard University, Boston, MA, USA
| | - Quy Cao
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Alison M Pouch
- Department of Radiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA; Department of Bioengineering, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
9
|
Groza T, Rayabsri W, Gration D, Hariram H, Jamuar SS, Baynam G. First steps toward building natural history of diseases computationally: Lessons learned from the Noonan syndrome use case. Am J Hum Genet 2025; 112:1158-1172. [PMID: 40245863 PMCID: PMC12120186 DOI: 10.1016/j.ajhg.2025.03.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2024] [Revised: 03/20/2025] [Accepted: 03/21/2025] [Indexed: 04/19/2025] Open
Abstract
Rare diseases (RDs) are conditions affecting fewer than 1 in 2,000 people, with over 7,000 identified, primarily genetic in nature, and more than half impacting children. Although each RD affects a small population, collectively, between 3.5% and 5.9% of the global population, or 262.9-446.2 million people, live with an RD. Most RDs lack established treatment protocols, highlighting the need for proper care pathways addressing prognosis, diagnosis, and management. Advances in generative AI and large language models (LLMs) offer new opportunities to document the temporal progression of phenotypic features, addressing gaps in current knowledge bases. This study proposes an LLM-based framework to capture the natural history of diseases, specifically focusing on Noonan syndrome. The framework aims to document phenotypic trajectories, validate against RD knowledge bases, and integrate insights into care coordination using electronic health record (EHR) data from the Undiagnosed Diseases Program Singapore.
Collapse
Affiliation(s)
- Tudor Groza
- Rare Care Centre, Perth Children's Hospital, Nedlands, WA 6009, Australia; Bioinformatics Institute, Agency for Science, Technology and Research (A(∗)STAR), 30 Biopolis Street #07-01 Matrix, Singapore 138671, Singapore; SingHealth Duke-NUS Institute of Precision Medicine, 5 Hospital Drive Level 9, Singapore 169609, Singapore; School of Electrical Engineering, Computing and Mathematical Sciences, Curtin University, Kent Street, Bentley, WA 6102, Australia.
| | - Warittha Rayabsri
- Western Australian Register of Developmental Anomalies, King Edward Memorial Hospital, 374 Bagot Road, Subiaco, WA 6008, Australia
| | - Dylan Gration
- Western Australian Register of Developmental Anomalies, King Edward Memorial Hospital, 374 Bagot Road, Subiaco, WA 6008, Australia
| | - Harshini Hariram
- Medical Student, Division of Medical Education, School of Medical Sciences, Faculty of Biology, Medicine and Health, The University of Manchester, Manchester M13 9PL, UK
| | - Saumya Shekhar Jamuar
- SingHealth Duke-NUS Institute of Precision Medicine, 5 Hospital Drive Level 9, Singapore 169609, Singapore; Genetics Service, Department of Paediatrics, KK Women's and Children's Hospital, 100 Bukit Timah Road, Singapore 229899, Singapore; SingHealth Duke-NUS Genomic Medicine Centre, 100 Bukit Timah Road, Singapore 229899, Singapore
| | - Gareth Baynam
- Rare Care Centre, Perth Children's Hospital, Nedlands, WA 6009, Australia; Western Australian Register of Developmental Anomalies, King Edward Memorial Hospital, 374 Bagot Road, Subiaco, WA 6008, Australia; Faculty of Health and Medical Sciences, University of Western Australia, 35 Stirling Highway, Crawley, WA 6009, Australia
| |
Collapse
|
10
|
Shahnam A, Nindra U, Hitchen N, Tang J, Hong M, Hong JH, Au-Yeung G, Chua W, Ng W, Hopkins AM, Sorich MJ. Application of Generative Artificial Intelligence for Physician and Patient Oncology Letters-AI-OncLetters. JCO Clin Cancer Inform 2025; 9:e2400323. [PMID: 40315407 DOI: 10.1200/cci-24-00323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2024] [Revised: 02/05/2025] [Accepted: 03/17/2025] [Indexed: 05/04/2025] Open
Abstract
PURPOSE Although large language models (LLMs) are increasingly used in clinical practice, formal assessments of their quality, accuracy, and effectiveness in medical oncology remain limited. We aimed to evaluate the ability of ChatGPT, an LLM, to generate physician and patient letters from clinical case notes. METHODS Six oncologists created 29 (four training, 25 final) synthetic oncology case notes. Structured prompts for ChatGPT were iteratively developed using the four training cases; once finalized, 25 physician-directed and patient-directed letters were generated. These underwent evaluation by expert consumers and oncologists for accuracy, relevance, and readability using Likert scales. The patient letters were also assessed with the Patient Education Materials Assessment Tool for Print (PEMAT-P), Flesch Reading Ease, and Simple Measure of Gobbledygook index. RESULTS Among physician-to-physician letters, 95% (119/125) of oncologists agreed they were accurate, comprehensive, and relevant, with no safety concerns noted. These letters demonstrated precise documentation of history, investigations, and treatment plans and were logically and concisely structured. Patient-directed letters achieved a mean Flesch Reading Ease score of 73.3 (seventh-grade reading level) and a PEMAT-P score above 80%, indicating high understandability. Consumer reviewers found them clear and appropriate for patient communication. Some omissions of details (eg, side effects), stylistic inconsistencies, and repetitive phrasing were identified, although no clinical safety issues emerged. Seventy-two percent (90/125) of consumers expressed willingness to receive artificial intelligence (AI)-generated patient letters. CONCLUSION ChatGPT, when guided by structured prompts, can generate high-quality letters that align with clinical and patient communication standards. No clinical safety concerns were identified, although addressing occasional omissions and improving natural language flow could enhance their utility in practice. Further studies comparing AI-generated and human-written letters are recommended.
Collapse
Affiliation(s)
- Adel Shahnam
- Department of Medical Oncology, Peter MacCallum Cancer Centre, Melbourne, VIC, Australia
| | - Udit Nindra
- Department of Medical Oncology, Liverpool Hospital, Sydney, NSW, Australia
| | - Nadia Hitchen
- Department of Medical Oncology, Peter MacCallum Cancer Centre, Melbourne, VIC, Australia
| | - Joanne Tang
- Department of Medical Oncology, Liverpool Hospital, Sydney, NSW, Australia
| | - Martin Hong
- Department of Medical Oncology, Liverpool Hospital, Sydney, NSW, Australia
| | - Jun Hee Hong
- Department of Medical Oncology, Liverpool Hospital, Sydney, NSW, Australia
| | - George Au-Yeung
- Department of Medical Oncology, Peter MacCallum Cancer Centre, Melbourne, VIC, Australia
- Sir Peter MacCallum Department of Oncology, The University of Melbourne, Melbourne, VIC, Australia
| | - Wei Chua
- Department of Medical Oncology, Liverpool Hospital, Sydney, NSW, Australia
| | - Weng Ng
- Department of Medical Oncology, Liverpool Hospital, Sydney, NSW, Australia
| | - Ashley M Hopkins
- Flinders Health and Medical Research Institute, College of Medicine and Public Health, Flinders University, Adelaide, SA, Australia
| | - Michael J Sorich
- Flinders Health and Medical Research Institute, College of Medicine and Public Health, Flinders University, Adelaide, SA, Australia
| |
Collapse
|
11
|
Tailor PD, D’Souza HS, Castillejo Becerra CM, Dahl HM, Patel NR, Kaplan TM, Kohli D, Bothun ED, Mohney BG, Tooley AA, Baratz KH, Iezzi R, Barkmeier AJ, Bakri SJ, Roddy GW, Hodge D, Sit AJ, Starr MR, Chen JJ. Evaluation of AI Summaries on Interdisciplinary Understanding of Ophthalmology Notes. JAMA Ophthalmol 2025; 143:410-419. [PMID: 40178837 PMCID: PMC11969348 DOI: 10.1001/jamaophthalmol.2025.0351] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2024] [Accepted: 01/19/2025] [Indexed: 04/05/2025]
Abstract
Importance Specialized ophthalmology terminology limits comprehension for nonophthalmology clinicians and professionals, hindering interdisciplinary communication and patient care. The clinical implementation of large language models (LLMs) into practice has to date been relatively unexplored. Objective To evaluate LLM-generated plain language summaries (PLSs) integrated into standard ophthalmology notes (SONs) in improving diagnostic understanding, satisfaction, and clarity. Design, Setting, and Participants Randomized quality improvement study conducted from February 1, 2024, to May 31, 2024, including data from inpatient and outpatient encounters in a single tertiary academic center. Participants were nonophthalmology clinicians and professionals and ophthalmologists. The single inclusion criterion was any encounter note generated by an ophthalmologist during the study dates. Exclusion criteria were (1) lack of established nonophthalmology clinicians and professionals for outpatient encounters and (2) procedure-only patient encounters. Intervention Addition of LLM-generated plain language summaries to ophthalmology notes. Main Outcomes and Measures The primary outcome was survey responses from nonophthalmology clinicians and professionals assessing understanding, satisfaction, and clarity of ophthalmology notes. Secondary outcomes were survey responses from ophthalmologists evaluating PLS in terms of clinical workflow and accuracy, objective measures of semantic quality, and safety analysis. Results A total of 362 (85%) nonophthalmology clinicians and professionals (33.0% response rate) preferred the PLS to SON. Demographic data on age, race and ethnicity, and sex were not collected. Nonophthalmology clinicians and professionals reported enhanced diagnostic understanding (percentage point increase, 9.0; 95% CI, 0.3-18.2; P = .01), increased note detail satisfaction (percentage point increase, 21.5; 95% CI, 11.4-31.5; P < .001), and improved explanation clarity (percentage point increase, 23.0; 95% CI, 12.0-33.1; P < .001) for notes containing a PLS. The addition of a PLS was associated with reduced comprehension gaps between clinicians who were comfortable and uncomfortable with ophthalmology terminology (from 26.1% [95% CI, 13.7%-38.6%; P < .001] to 14.4% [95% CI, 4.3%-24.6%; P > .06]). PLS semantic analysis found high meaning preservation (bidirectional encoder representations from transformers score mean F1 score: 0.85) with greater readability than SONs (Flesch Reading Ease: 51.8 vs 43.6; Flesch-Kincaid Grade Level: 10.7 vs 11.9). Ophthalmologists (n = 489; 84% response rate) reported high PLS accuracy (90% [320 of 355] a great deal) with minimal review time burden (94.9% [464 of 489] ≤1 minute). PLS error rate on ophthalmologist review was 26% (126 of 489). A total of 83.9% (104 of 126) of errors were deemed low risk for harm and none had a risk of severe harm or death. Conclusions and Relevance In this study, use of LLM-generated PLSs was associated with enhanced comprehension and satisfaction among nonophthalmology clinicians and professionals, which might aid interdisciplinary communication. Careful implementation and safety monitoring are recommended for clinical integration given the persistence of errors despite physician review.
Collapse
Affiliation(s)
- Prashant D. Tailor
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota
- Jules Stein Eye Institute and Department of Ophthalmology, David Geffen School of Medicine at UCLA, University of California, Los Angeles, Los Angeles, California
| | | | | | - Heidi M. Dahl
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota
| | - Neil R. Patel
- Department of Allergy & Immunology, ICAHN School of Medicine at Mount Sinai, New York, New York
| | - Tyler M. Kaplan
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota
| | - Darrell Kohli
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota
| | - Erick D. Bothun
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota
| | - Brian G. Mohney
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota
| | | | - Keith H. Baratz
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota
| | - Raymond Iezzi
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota
| | | | - Sophie J. Bakri
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota
| | - Gavin W. Roddy
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota
| | - David Hodge
- Department of Quantitative Health Sciences, Mayo Clinic, Jacksonville, Florida
| | - Arthur J. Sit
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota
| | | | - John J. Chen
- Department of Ophthalmology, Mayo Clinic, Rochester, Minnesota
- Department of Neurology, Mayo Clinic, Rochester, Minnesota
| |
Collapse
|
12
|
Burwell JM. The AI Efficiency Paradox: Reclaiming Quality Patient Care in an Era of Optimization. J Med Syst 2025; 49:49. [PMID: 40240567 DOI: 10.1007/s10916-025-02183-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2025] [Accepted: 04/08/2025] [Indexed: 04/18/2025]
Abstract
We examine how artificial intelligence (AI) integration in healthcare may create an "efficiency paradox" where technologies designed to reduce workload can instead generate new layers of inefficiency. We argue that AI implementation strategies prioritizing efficiency metrics over meaningful patient interactions risk undermining care quality. A framework is proposed for evaluating AI adoption that balances technological optimization with perseveration of the physician-patient relationship.
Collapse
Affiliation(s)
- Julian Michael Burwell
- Department of Medical Education, Geisinger Commonwealth School of Medicine, Scranton, PA, USA.
| |
Collapse
|
13
|
Soroush A, Giuffrè M, Chung S, Shung DL. Generative Artificial Intelligence in Clinical Medicine and Impact on Gastroenterology. Gastroenterology 2025:S0016-5085(25)00634-1. [PMID: 40245953 DOI: 10.1053/j.gastro.2025.03.038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/11/2025] [Revised: 03/08/2025] [Accepted: 03/14/2025] [Indexed: 04/19/2025]
Abstract
The pace of artificial intelligence (AI) integration into health care has accelerated with rapid advances in generative AI (genAI). Gastroenterology and hepatology in particular will be transformed due to the multimodal workflows that integrate endoscopic video, radiologic imaging, tabular data, and unstructured note text. GenAI will impact the entire spectrum of clinical experience, from administrative tasks, diagnostic guidance, and treatment recommendations. Unlike traditional machine learning approaches, genAI is more flexible, with one platform able to be used across multiple tasks. Initial evidence suggests benefits in lower-level administrative tasks, such as clinical documentation, medical billing, and scheduling; and information tasks, such as patient education and summarization of the medical literature. No evidence exists for genAI solutions for more complex tasks relevant to clinical care, such as clinical reasoning for diagnostic and treatment decisions that may affect patient outcomes. Challenges of output reliability, data privacy, and useful integration remain; potential solutions include robust validation, regulatory oversight, and "human-AI teaming" strategies to ensure safe, effective deployment. We remain optimistic in the potential of genAI to augment clinical expertise due to the adaptability of genAI to handle multiple data modalities to obtain and focus relevant information flows and the human-friendly interfaces that facilitate ease of use. We believe that the potential of genAI for dynamic human-algorithmic interactions may allow for a degree of clinician-directed customization to enhance human presence.
Collapse
Affiliation(s)
- Ali Soroush
- Division of Data-Driven and Digital Medicine, Icahn School of Medicine at Mount Sinai, New York, New York; Henry D. Janowitz Division of Gastroenterology, Icahn School of Medicine at Mount Sinai, New York, New York; Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York
| | - Mauro Giuffrè
- Section of Digestive Diseases, Department of Medicine, Yale School of Medicine, New Haven, Connecticut
| | - Sunny Chung
- Section of Digestive Diseases, Department of Medicine, Yale School of Medicine, New Haven, Connecticut
| | - Dennis L Shung
- Section of Digestive Diseases, Department of Medicine, Yale School of Medicine, New Haven, Connecticut; Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, Connecticut.
| |
Collapse
|
14
|
Lim B, Seth I, Maxwell M, Cuomo R, Ross RJ, Rozen WM. Evaluating the Efficacy of Large Language Models in Generating Medical Documentation: A Comparative Study of ChatGPT-4, ChatGPT-4o, and Claude. Aesthetic Plast Surg 2025:10.1007/s00266-025-04842-8. [PMID: 40229614 DOI: 10.1007/s00266-025-04842-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2024] [Accepted: 03/14/2025] [Indexed: 04/16/2025]
Abstract
BACKGROUND Large language models (LLMs) have demonstrated transformative potential in health care. They can enhance clinical and academic medicine by facilitating accurate diagnoses, interpreting laboratory results, and automating documentation processes. This study evaluates the efficacy of LLMs in generating surgical operation reports and discharge summaries, focusing on accuracy, efficiency, and quality. METHODS This study assessed the effectiveness of three leading LLMs-ChatGPT-4.0, ChatGPT-4o, and Claude-using six prompts and analyzing their responses for readability and output quality, validated by plastic surgeons. Readability was measured with the Flesch-Kincaid, Flesch reading ease scores, and Coleman-Liau Index, while reliability was evaluated using the DISCERN score. A paired two-tailed t-test (p<0.05) compared the statistical significance of these metrics and the time taken to generate operation reports and discharge summaries against the authors' results. RESULTS Table 3 shows statistically significant differences in readability between ChatGPT-4o and Claude across all metrics, while ChatGPT-4 and Claude differ significantly in the Flesch reading ease and Coleman-Liau indices. Table 6 reveals extremely low p-values across BL, IS, and MM for all models, with Claude consistently outperforming both ChatGPT-4 and ChatGPT-4o. Additionally, Claude generated documents the fastest, completing tasks in approximately 10 to 14 s. These results suggest that Claude not only excels in readability but also demonstrates superior reliability and speed, making it an efficient choice for practical applications. CONCLUSION The study highlights the importance of selecting appropriate LLMs for clinical use. Integrating these LLMs can streamline healthcare documentation, improve efficiency, and enhance patient outcomes through clearer communication and more accurate medical reports. LEVEL OF EVIDENCE V This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .
Collapse
Affiliation(s)
- Bryan Lim
- Department of Plastic and Reconstructive Surgery, Frankston Hospital, Peninsula Health, Frankston, VIC, Australia.
- Peninsula Clinical School, Central Clinical School, Faculty of Medicine, Monash University, Frankston, VIC, Australia.
| | - Ishith Seth
- Department of Plastic and Reconstructive Surgery, Frankston Hospital, Peninsula Health, Frankston, VIC, Australia
- Peninsula Clinical School, Central Clinical School, Faculty of Medicine, Monash University, Frankston, VIC, Australia
| | - Molly Maxwell
- Department of Plastic and Reconstructive Surgery, Frankston Hospital, Peninsula Health, Frankston, VIC, Australia
| | - Roberto Cuomo
- Department of Plastic and Reconstructive Surgery, University of Siena, Siena, Italy
| | - Richard J Ross
- Department of Plastic and Reconstructive Surgery, Frankston Hospital, Peninsula Health, Frankston, VIC, Australia
| | - Warren M Rozen
- Department of Plastic and Reconstructive Surgery, Frankston Hospital, Peninsula Health, Frankston, VIC, Australia
- Peninsula Clinical School, Central Clinical School, Faculty of Medicine, Monash University, Frankston, VIC, Australia
| |
Collapse
|
15
|
Maddox TM, Embí P, Gerhart J, Goldsack J, Parikh RB, Sarich TC. Generative AI in Medicine - Evaluating Progress and Challenges. N Engl J Med 2025. [PMID: 40208922 DOI: 10.1056/nejmsb2503956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/12/2025]
Affiliation(s)
| | - Peter Embí
- Vanderbilt University Medical Center, Nashville
| | | | | | | | | |
Collapse
|
16
|
Liu S, McCoy AB, Wright A. Improving large language model applications in biomedicine with retrieval-augmented generation: a systematic review, meta-analysis, and clinical development guidelines. J Am Med Inform Assoc 2025; 32:605-615. [PMID: 39812777 PMCID: PMC12005634 DOI: 10.1093/jamia/ocaf008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2024] [Revised: 12/17/2024] [Accepted: 01/03/2025] [Indexed: 01/16/2025] Open
Abstract
OBJECTIVE The objectives of this study are to synthesize findings from recent research of retrieval-augmented generation (RAG) and large language models (LLMs) in biomedicine and provide clinical development guidelines to improve effectiveness. MATERIALS AND METHODS We conducted a systematic literature review and a meta-analysis. The report was created in adherence to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses 2020 analysis. Searches were performed in 3 databases (PubMed, Embase, PsycINFO) using terms related to "retrieval augmented generation" and "large language model," for articles published in 2023 and 2024. We selected studies that compared baseline LLM performance with RAG performance. We developed a random-effect meta-analysis model, using odds ratio as the effect size. RESULTS Among 335 studies, 20 were included in this literature review. The pooled effect size was 1.35, with a 95% confidence interval of 1.19-1.53, indicating a statistically significant effect (P = .001). We reported clinical tasks, baseline LLMs, retrieval sources and strategies, as well as evaluation methods. DISCUSSION Building on our literature review, we developed Guidelines for Unified Implementation and Development of Enhanced LLM Applications with RAG in Clinical Settings to inform clinical applications using RAG. CONCLUSION Overall, RAG implementation showed a 1.35 odds ratio increase in performance compared to baseline LLMs. Future research should focus on (1) system-level enhancement: the combination of RAG and agent, (2) knowledge-level enhancement: deep integration of knowledge into LLM, and (3) integration-level enhancement: integrating RAG systems within electronic health records.
Collapse
Affiliation(s)
- Siru Liu
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
- Department of Computer Science, Vanderbilt University, Nashville, TN 37212, United States
| | - Allison B McCoy
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| | - Adam Wright
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37212, United States
- Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37212, United States
| |
Collapse
|
17
|
McMinn D, Grant T, DeFord-Watts L, Porkess V, Lens M, Rapier C, Joe WQ, Becker TA, Bender W. Using artificial intelligence to expedite and enhance plain language summary abstract writing of scientific content. JAMIA Open 2025; 8:ooaf023. [PMID: 40183004 PMCID: PMC11967854 DOI: 10.1093/jamiaopen/ooaf023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2024] [Revised: 02/12/2025] [Accepted: 03/19/2025] [Indexed: 04/05/2025] Open
Abstract
Objective To assess the capacity of a bespoke artificial intelligence (AI) process to help medical writers efficiently generate quality plain language summary abstracts (PLSAs). Materials and Methods Three independent studies were conducted. In Studies 1 and 3, original scientific abstracts (OSAs; n = 48, n = 2) and corresponding PLSAs written by medical writers versus bespoke AI were assessed using standard readability metrics. Study 2 compared time and effort of medical writers (n = 10) drafting PLSAs starting with an OSA (n = 6) versus the output of 1 bespoke AI (n = 6) and 1 non-bespoke AI (n = 6) process. These PLSAs (n = 72) were assessed by subject matter experts (SMEs; n = 3) for accuracy and physicians (n = 7) for patient suitability. Lastly, in Study 3, medical writers (n = 22) and patients/patient advocates (n = 5) compared quality of medical writer and bespoke AI-generated PLSAs. Results In Study 1, bespoke AI PLSAs were easier to read than medical writer PLSAs across all readability metrics (P <.01). In Study 2, bespoke AI output saved medical writers >40% in time for PLSA creation and required less effort than unassisted writing. SME-assessed quality was higher for AI-assisted PLSAs, and physicians preferred bespoke AI-generated outputs for patient use. In Study 3, bespoke AI PLSAs were more readable and rated of higher quality than medical writer PLSAs. Discussion The bespoke AI process may enhance access to health information by helping medical writers produce PLSAs of scientific content that are fit for purpose. Conclusion The bespoke AI process can more efficiently create better quality, more readable first draft PLSAs versus medical writer-generated PLSAs.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Wilson Q Joe
- Lumanity Communications Inc., Yardley, PA 19067, United States
| | | | | |
Collapse
|
18
|
Jung KH. Large Language Models in Medicine: Clinical Applications, Technical Challenges, and Ethical Considerations. Healthc Inform Res 2025; 31:114-124. [PMID: 40384063 PMCID: PMC12086438 DOI: 10.4258/hir.2025.31.2.114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2025] [Accepted: 04/23/2025] [Indexed: 05/20/2025] Open
Abstract
OBJECTIVES This study presents a comprehensive review of the clinical applications, technical challenges, and ethical considerations associated with using large language models (LLMs) in medicine. METHODS A literature survey of peer-reviewed articles, technical reports, and expert commentary from relevant medical and artificial intelligence journals was conducted. Key clinical application areas, technical limitations (e.g., accuracy, validation, transparency), and ethical issues (e.g., bias, safety, accountability, privacy) were identified and analyzed. RESULTS LLMs have potential in clinical documentation assistance, decision support, patient communication, and workflow optimization. The level of supporting evidence varies; documentation support applications are relatively mature, whereas autonomous diagnostics continue to face notable limitations regarding accuracy and validation. Key technical challenges include model hallucination, lack of robust clinical validation, integration issues, and limited transparency. Ethical concerns involve algorithmic bias risking health inequities, threats to patient safety from inaccuracies, unclear accountability, data privacy, and impacts on clinician-patient interactions. CONCLUSIONS LLMs possess transformative potential for clinical medicine, particularly by augmenting clinician capabilities. However, substantial technical and ethical hurdles necessitate rigorous research, validation, clearly defined guidelines, and human oversight. Existing evidence supports an assistive rather than autonomous role, mandating careful, evidence-based integration that prioritizes patient safety and equity.
Collapse
Affiliation(s)
- Kyu-Hwan Jung
- Department of Medical Device Management and Research, Samsung Advanced Institute for Health Sciences and Technology, Sungkyunkwan University, Seoul,
Korea
- Smart Healthcare Research Institute, Research Institute for Future Medicine, Samsung Medical Center, Seoul,
Korea
| |
Collapse
|
19
|
Biesheuvel LA, Workum JD, Reuland M, van Genderen ME, Thoral P, Dongelmans D, Elbers P. Large language models in critical care. JOURNAL OF INTENSIVE MEDICINE 2025; 5:113-118. [PMID: 40241839 PMCID: PMC11997603 DOI: 10.1016/j.jointm.2024.12.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/07/2024] [Revised: 11/29/2024] [Accepted: 12/01/2024] [Indexed: 04/18/2025]
Abstract
The advent of chat generative pre-trained transformer (ChatGPT) and large language models (LLMs) has revolutionized natural language processing (NLP). These models possess unprecedented capabilities in understanding and generating human-like language. This breakthrough holds significant promise for critical care medicine, where unstructured data and complex clinical information are abundant. Key applications of LLMs in this field include administrative support through automated documentation and patient chart summarization; clinical decision support by assisting in diagnostics and treatment planning; personalized communication to enhance patient and family understanding; and improving data quality by extracting insights from unstructured clinical notes. Despite these opportunities, challenges such as the risk of generating inaccurate or biased information "hallucinations", ethical considerations, and the need for clinician artificial intelligence (AI) literacy must be addressed. Integrating LLMs with traditional machine learning models - an approach known as Hybrid AI - combines the strengths of both technologies while mitigating their limitations. Careful implementation, regulatory compliance, and ongoing validation are essential to ensure that LLMs enhance patient care rather than hinder it. LLMs have the potential to transform critical care practices, but integrating them requires caution. Responsible use and thorough clinician training are crucial to fully realize their benefits.
Collapse
Affiliation(s)
- Laurens A. Biesheuvel
- Department of Intensive Care Medicine, Center for Critical Care Computational Intelligence, Amsterdam Medical Data Science, Amsterdam Public Health, Amsterdam Institute for Immunity and Infectious Diseases, Amsterdam Cardiovascular Science, Amsterdam UMC, Vrije Universiteit, University of Amsterdam, Amsterdam, The Netherlands
| | - Jessica D. Workum
- Department of Intensive Care, Elisabeth-TweeSteden Hospital, Tilburg, The Netherlands
- Department of Adult Intensive Care, Erasmus Medical Center, Rotterdam, The Netherlands
| | - Merijn Reuland
- Department of Intensive Care Medicine, Center for Critical Care Computational Intelligence, Amsterdam Medical Data Science, Amsterdam Public Health, Amsterdam Institute for Immunity and Infectious Diseases, Amsterdam Cardiovascular Science, Amsterdam UMC, Vrije Universiteit, University of Amsterdam, Amsterdam, The Netherlands
| | | | - Patrick Thoral
- Department of Intensive Care Medicine, Center for Critical Care Computational Intelligence, Amsterdam Medical Data Science, Amsterdam Public Health, Amsterdam Institute for Immunity and Infectious Diseases, Amsterdam Cardiovascular Science, Amsterdam UMC, Vrije Universiteit, University of Amsterdam, Amsterdam, The Netherlands
| | - Dave Dongelmans
- Department of Intensive Care Medicine, Amsterdam UMC, National Intensive Care Evaluation (NICE) Foundation, Amsterdam UMC location University of Amsterdam, Department of Medical Informatics, Amsterdam Public Health, Amsterdam, The Netherlands
| | - Paul Elbers
- Department of Intensive Care Medicine, Center for Critical Care Computational Intelligence, Amsterdam Medical Data Science, Amsterdam Public Health, Amsterdam Institute for Immunity and Infectious Diseases, Amsterdam Cardiovascular Science, Amsterdam UMC, Vrije Universiteit, University of Amsterdam, Amsterdam, The Netherlands
| |
Collapse
|
20
|
Zhou J, Zhang J, Wan R, Cui X, Liu Q, Guo H, Shi X, Fu B, Meng J, Yue B, Zhang Y, Zhang Z. Integrating AI into clinical education: evaluating general practice trainees' proficiency in distinguishing AI-generated hallucinations and impacting factors. BMC MEDICAL EDUCATION 2025; 25:406. [PMID: 40108629 PMCID: PMC11924592 DOI: 10.1186/s12909-025-06916-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/25/2024] [Accepted: 02/24/2025] [Indexed: 03/22/2025]
Abstract
OBJECTIVE To assess the ability of General Practice (GP) Trainees to detect AI-generated hallucinations in simulated clinical practice, ChatGPT-4o was utilized. The hallucinations were categorized into three types based on the accuracy of the answers and explanations: (1) correct answers with incorrect or flawed explanations, (2) incorrect answers with explanations that contradict factual evidence, and (3) incorrect answers with correct explanations. METHODS This multi-center, cross-sectional survey study involved 142 GP Trainees, all of whom were undergoing General Practice Specialist Training and volunteered to participate. The study evaluated the accuracy and consistency of ChatGPT-4o, as well as the Trainees' response time, accuracy, sensitivity (d'), and response tendencies (β). Binary regression analysis was used to explore factors affecting the Trainees' ability to identify errors generated by ChatGPT-4o. RESULTS A total of 137 participants were included, with a mean age of 25.93 years. Half of the participants were unfamiliar with AI, and 35.0% had never used it. ChatGPT-4o's overall accuracy was 80.8%, which slightly decreased to 80.1% after human verification. However, the accuracy for professional practice (Subject 4) was only 57.0%, and after human verification, it dropped further to 44.2%. A total of 87 AI-generated hallucinations were identified, primarily occurring at the application and evaluation levels. The mean accuracy of detecting these hallucinations was 55.0%, and the mean sensitivity (d') was 0.39. Regression analysis revealed that shorter response times (OR = 0.92, P = 0.02), higher self-assessed AI understanding (OR = 0.16, P = 0.04), and more frequent AI use (OR = 10.43, P = 0.01) were associated with stricter error detection criteria. CONCLUSIONS The study concluded that GP trainees faced challenges in identifying ChatGPT-4o's errors, particularly in clinical scenarios. This highlights the importance of improving AI literacy and critical thinking skills to ensure effective integration of AI into medical education.
Collapse
Affiliation(s)
- Jiacheng Zhou
- Department of General Practice, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi, Jiangsu, China
- Wuxi Medical Center, Nanjing Medical University, Wuxi People's Hospital, Wuxi, Jiangsu, China
| | - Jintao Zhang
- Department of General Practice, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi, Jiangsu, China
- Wuxi Medical Center, Nanjing Medical University, Wuxi People's Hospital, Wuxi, Jiangsu, China
| | - Rongrong Wan
- Department of General Practice, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi, Jiangsu, China
- Wuxi Medical Center, Nanjing Medical University, Wuxi People's Hospital, Wuxi, Jiangsu, China
| | - Xiaochuan Cui
- Department of General Practice, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi, Jiangsu, China
- Wuxi Medical Center, Nanjing Medical University, Wuxi People's Hospital, Wuxi, Jiangsu, China
| | - Qiyu Liu
- Department of General Practice, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi, Jiangsu, China
- Wuxi Medical Center, Nanjing Medical University, Wuxi People's Hospital, Wuxi, Jiangsu, China
| | - Hua Guo
- Department of General Practice, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi, Jiangsu, China
- Wuxi Medical Center, Nanjing Medical University, Wuxi People's Hospital, Wuxi, Jiangsu, China
| | - Xiaofen Shi
- Department of General Practice, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi, Jiangsu, China
- Wuxi Medical Center, Nanjing Medical University, Wuxi People's Hospital, Wuxi, Jiangsu, China
| | - Bingbing Fu
- Department of Postgraduate Education, The First Affiliated Hospital of Jiamusi University, Heilongjiang, China
| | - Jia Meng
- Department of General Practice, The Second Affiliated Hospital of Harbin Medical University, Heilongjiang, China
| | - Bo Yue
- Residency Training Center, The Second Affiliated Hospital of Qiqihar Medical University, Heilongjiang, China
| | - Yunyun Zhang
- Department of General Practice, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi, Jiangsu, China.
- Wuxi Medical Center, Nanjing Medical University, Wuxi People's Hospital, Wuxi, Jiangsu, China.
- Department of Postgraduate Education, The First Affiliated Hospital of Jiamusi University, Heilongjiang, China.
- Education Department, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi Medical Center, Wuxi People's Hospital, Qingyang road 299, Wuxi, China.
| | - Zhiyong Zhang
- Department of General Practice, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi, Jiangsu, China.
- Wuxi Medical Center, Nanjing Medical University, Wuxi People's Hospital, Wuxi, Jiangsu, China.
- Department of Postgraduate Education, The First Affiliated Hospital of Jiamusi University, Heilongjiang, China.
- Education Department, The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi Medical Center, Wuxi People's Hospital, Qingyang road 299, Wuxi, China.
| |
Collapse
|
21
|
Habs M, Knecht S, Schmidt-Wilcke T. Using artificial intelligence (AI) for form and content checks of medical reports: Proofreading by ChatGPT4.0 in a neurology department. ZEITSCHRIFT FUR EVIDENZ, FORTBILDUNG UND QUALITAT IM GESUNDHEITSWESEN 2025:S1865-9217(25)00079-0. [PMID: 40107951 DOI: 10.1016/j.zefq.2025.02.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Revised: 02/10/2025] [Accepted: 02/14/2025] [Indexed: 03/22/2025]
Abstract
INTRODUCTION Medical reports contain critical information and require concise language, yet often display errors despite advances in digital tools. This study compared the effectiveness of ChatGPT 4.0 in reporting orthographic, grammatical, and content errors in German neurology reports to a human expert. MATERIALS AND METHODS Ten neurology reports were embedded with ten linguistic errors each, including typographical and grammatical mistakes, and one significant content error. The reports were reviewed by ChatGPT 4.0 using three prompts: (1) check the text for spelling and grammatical errors and report them in a list format without altering the original text, (2) identify spelling and grammatical errors and generate a revised version of the text, ensuring content integrity, (3) evaluate the text for factual inaccuracies, including incorrect information and treatment errors, and report them without modifying the original text. Human control was provided by an experienced medical secretary. Outcome parameters were processing time, percentage of identified errors, and overall error detection rate. RESULTS Artificial intelligence (AI) accuracy in error detection was 35% (median) for Prompt 1 and 75% for Prompt 2. The mean word count of erroneous medical reports was 980 (SD = 180). AI-driven report generation was significantly faster than human review (AI Prompt 1: 102.4 s; AI Prompt 2: 209.4 s; Human: 374.0 s; p < 0.0001). Prompt 1, a tabular error report, was faster but less accurate than Prompt 2, a revised version of the report (p = 0.0013). Content analysis by Prompt 3 identified 70% of errors in 34.6 seconds. CONCLUSIONS AI-driven text processing for medical reports is feasible and effective. ChatGPT 4.0 demonstrated strong performance in detecting and reporting errors. The effectiveness of AI depends on prompt design, significantly impacting quality and duration. Integration into medical workflows could enhance accuracy and efficiency. AI holds promise in improving medical report writing. However, proper prompt design seems to be crucial. Appropriately integrated AI can significantly enhance supervision and quality control in health care documentation.
Collapse
Affiliation(s)
- Maximilian Habs
- Department of Neurology, Bezirksklinikum Mainkofen (BKM), Deggendorf, Germany.
| | - Stefan Knecht
- University Hospital of Düsseldorf (UKD), Düsseldorf, Germany
| | - Tobias Schmidt-Wilcke
- Department of Neurology, Bezirksklinikum Mainkofen (BKM), Deggendorf, Germany; University Hospital of Düsseldorf (UKD), Düsseldorf, Germany
| |
Collapse
|
22
|
Vrdoljak J, Boban Z, Vilović M, Kumrić M, Božić J. A Review of Large Language Models in Medical Education, Clinical Decision Support, and Healthcare Administration. Healthcare (Basel) 2025; 13:603. [PMID: 40150453 PMCID: PMC11942098 DOI: 10.3390/healthcare13060603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2024] [Revised: 02/07/2025] [Accepted: 03/06/2025] [Indexed: 03/29/2025] Open
Abstract
Background/Objectives: Large language models (LLMs) have shown significant potential to transform various aspects of healthcare. This review aims to explore the current applications, challenges, and future prospects of LLMs in medical education, clinical decision support, and healthcare administration. Methods: A comprehensive literature review was conducted, examining the applications of LLMs across the three key domains. The analysis included their performance, challenges, and advancements, with a focus on techniques like retrieval-augmented generation (RAG). Results: In medical education, LLMs show promise as virtual patients, personalized tutors, and tools for generating study materials. Some models have outperformed junior trainees in specific medical knowledge assessments. Concerning clinical decision support, LLMs exhibit potential in diagnostic assistance, treatment recommendations, and medical knowledge retrieval, though performance varies across specialties and tasks. In healthcare administration, LLMs effectively automate tasks like clinical note summarization, data extraction, and report generation, potentially reducing administrative burdens on healthcare professionals. Despite their promise, challenges persist, including hallucination mitigation, addressing biases, and ensuring patient privacy and data security. Conclusions: LLMs have transformative potential in medicine but require careful integration into healthcare settings. Ethical considerations, regulatory challenges, and interdisciplinary collaboration between AI developers and healthcare professionals are essential. Future advancements in LLM performance and reliability through techniques such as RAG, fine-tuning, and reinforcement learning will be critical to ensuring patient safety and improving healthcare delivery.
Collapse
Affiliation(s)
- Josip Vrdoljak
- Department for Pathophysiology, School of Medicine, University of Split, 21000 Split, Croatia; (J.V.); (M.V.); (M.K.)
| | - Zvonimir Boban
- Department for Medical Physics, School of Medicine, University of Split, 21000 Split, Croatia;
| | - Marino Vilović
- Department for Pathophysiology, School of Medicine, University of Split, 21000 Split, Croatia; (J.V.); (M.V.); (M.K.)
| | - Marko Kumrić
- Department for Pathophysiology, School of Medicine, University of Split, 21000 Split, Croatia; (J.V.); (M.V.); (M.K.)
| | - Joško Božić
- Department for Pathophysiology, School of Medicine, University of Split, 21000 Split, Croatia; (J.V.); (M.V.); (M.K.)
| |
Collapse
|
23
|
Alarifi M. Appropriateness of Thyroid Nodule Cancer Risk Assessment and Management Recommendations Provided by Large Language Models. JOURNAL OF IMAGING INFORMATICS IN MEDICINE 2025:10.1007/s10278-025-01454-1. [PMID: 40032759 DOI: 10.1007/s10278-025-01454-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/22/2024] [Revised: 02/12/2025] [Accepted: 02/14/2025] [Indexed: 03/05/2025]
Abstract
The study evaluates the appropriateness and reliability of thyroid nodule cancer risk assessment recommendations provided by large language models (LLMs) ChatGPT, Gemini, and Claude in alignment with clinical guidelines from the American Thyroid Association (ATA) and the National Comprehensive Cancer Network (NCCN). A team comprising a medical imaging informatics specialist and two radiologists developed 24 clinically relevant questions based on ATA and NCCN guidelines. The readability of AI-generated responses was evaluated using the Readability Scoring System. A total of 322 radiologists in training or practice from the United States, recruited via Amazon Mechanical Turk, assessed the AI responses. Quantitative analysis using SPSS measured the appropriateness of recommendations, while qualitative feedback was analyzed through Dedoose. The study compared the performance of three AI models ChatGPT, Gemini, and Claude in providing appropriate recommendations. Paired samples t-tests showed no statistically significant differences in overall performance among the models. Claude achieved the highest mean score (21.84), followed closely by ChatGPT (21.83) and Gemini (21.47). Inappropriate response rates did not differ significantly, though Gemini showed a trend toward higher rates. However, ChatGPT achieved the highest accuracy (92.5%) in providing appropriate responses, followed by Claude (92.1%) and Gemini (90.4%). Qualitative feedback highlighted ChatGPT's clarity and structure, Gemini's accessibility but shallowness, and Claude's organization with occasional divergence from focus. LLMs like ChatGPT, Gemini, and Claude show potential in supporting thyroid nodule cancer risk assessment but require clinical oversight to ensure alignment with guidelines. Claude and ChatGPT performed nearly identically overall, with Claude having the highest mean score, though the difference was marginal. Further development is necessary to enhance their reliability for clinical use.
Collapse
Affiliation(s)
- Mohammad Alarifi
- Radiological Sciences Department, College of Applied Medical Sciences, King Saud University, 11451, Riyadh, Saudi Arabia.
- School of Health Studies, Northern Illinois University, 209 Wirtz Hall, DeKalb, IL, 60115, USA.
| |
Collapse
|
24
|
Deb B, Fradley M, Cook S, Barnes GD. Evaluation of Information About Cardiovascular Implications of Gender-Affirming Care From Online Chat-based Artificial Intelligence Systems. CJC Open 2025; 7:338-343. [PMID: 40182418 PMCID: PMC11963172 DOI: 10.1016/j.cjco.2024.11.020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2024] [Accepted: 11/26/2024] [Indexed: 04/05/2025] Open
Abstract
Because of restrictive laws in multiple states for gender-affirming care, patients might be prompted to get recommendations from contemporary online chatbots. This study explored the appropriateness of such recommendations using validated tools to assess patient education materials by a team of LGBTQ-affirming cardiologists. The study showed that although all systems emphasize the need for multidisciplinary care, there were notable differences in the comprehensiveness, cultural appropriateness, and presentation of their responses. GPT-4 (https://chatbotapp.ai) and Gemini (https://gemini.google.com/app) outperformed Bing (https://copilot.microsoft.com), particularly in the balanced and culturally sensitive delivery of information.
Collapse
Affiliation(s)
- Brototo Deb
- Department of Medicine, MedStar Georgetown University - Washington Hospital Center, Washington DC, USA
| | - Michael Fradley
- Department of Cardiology, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Stephen Cook
- Indiana Heart Physicians, Indianapolis, Indiana, USA
| | - Geoffrey D. Barnes
- Frankel Cardiovascular Center and Institute for Healthcare Policy and Innovation, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
25
|
Li Y, Liu F, Cai Q, Deng L, Ouyang Q, Zhang XHF, Zheng J. Invasion and metastasis in cancer: molecular insights and therapeutic targets. Signal Transduct Target Ther 2025; 10:57. [PMID: 39979279 PMCID: PMC11842613 DOI: 10.1038/s41392-025-02148-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2024] [Revised: 12/24/2024] [Accepted: 01/16/2025] [Indexed: 02/22/2025] Open
Abstract
The progression of malignant tumors leads to the development of secondary tumors in various organs, including bones, the brain, liver, and lungs. This metastatic process severely impacts the prognosis of patients, significantly affecting their quality of life and survival rates. Research efforts have consistently focused on the intricate mechanisms underlying this process and the corresponding clinical management strategies. Consequently, a comprehensive understanding of the biological foundations of tumor metastasis, identification of pivotal signaling pathways, and systematic evaluation of existing and emerging therapeutic strategies are paramount to enhancing the overall diagnostic and treatment capabilities for metastatic tumors. However, current research is primarily focused on metastasis within specific cancer types, leaving significant gaps in our understanding of the complex metastatic cascade, organ-specific tropism mechanisms, and the development of targeted treatments. In this study, we examine the sequential processes of tumor metastasis, elucidate the underlying mechanisms driving organ-tropic metastasis, and systematically analyze therapeutic strategies for metastatic tumors, including those tailored to specific organ involvement. Subsequently, we synthesize the most recent advances in emerging therapeutic technologies for tumor metastasis and analyze the challenges and opportunities encountered in clinical research pertaining to bone metastasis. Our objective is to offer insights that can inform future research and clinical practice in this crucial field.
Collapse
Affiliation(s)
- Yongxing Li
- Department of Urology, Urologic Surgery Center, Xinqiao Hospital, Third Military Medical University (Army Medical University), Chongqing, China
- State Key Laboratory of Trauma and Chemical Poisoning, Third Military Medical University (Army Medical University), Chongqing, China
| | - Fengshuo Liu
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, TX, USA
- Dan L. Duncan Cancer Center, Baylor College of Medicine, Houston, TX, USA
- Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, TX, USA
- McNair Medical Institute, Baylor College of Medicine, Houston, TX, USA
- Graduate School of Biomedical Science, Cancer and Cell Biology Program, Baylor College of Medicine, Houston, TX, USA
| | - Qingjin Cai
- Department of Urology, Urologic Surgery Center, Xinqiao Hospital, Third Military Medical University (Army Medical University), Chongqing, China
- State Key Laboratory of Trauma and Chemical Poisoning, Third Military Medical University (Army Medical University), Chongqing, China
| | - Lijun Deng
- Department of Medicinal Chemistry, Third Military Medical University (Army Medical University), Chongqing, China
| | - Qin Ouyang
- Department of Medicinal Chemistry, Third Military Medical University (Army Medical University), Chongqing, China.
| | - Xiang H-F Zhang
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, TX, USA.
- Dan L. Duncan Cancer Center, Baylor College of Medicine, Houston, TX, USA.
- Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, TX, USA.
- McNair Medical Institute, Baylor College of Medicine, Houston, TX, USA.
| | - Ji Zheng
- Department of Urology, Urologic Surgery Center, Xinqiao Hospital, Third Military Medical University (Army Medical University), Chongqing, China.
- State Key Laboratory of Trauma and Chemical Poisoning, Third Military Medical University (Army Medical University), Chongqing, China.
| |
Collapse
|
26
|
Can E, Uller W, Vogt K, Doppler MC, Busch F, Bayerl N, Ellmann S, Kader A, Elkilany A, Makowski MR, Bressem KK, Adams LC. Large Language Models for Simplified Interventional Radiology Reports: A Comparative Analysis. Acad Radiol 2025; 32:888-898. [PMID: 39353826 DOI: 10.1016/j.acra.2024.09.041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2024] [Revised: 09/15/2024] [Accepted: 09/17/2024] [Indexed: 10/04/2024]
Abstract
PURPOSE To quantitatively and qualitatively evaluate and compare the performance of leading large language models (LLMs), including proprietary models (GPT-4, GPT-3.5 Turbo, Claude-3-Opus, and Gemini Ultra) and open-source models (Mistral-7b and Mistral-8×7b), in simplifying 109 interventional radiology reports. METHODS Qualitative performance was assessed using a five-point Likert scale for accuracy, completeness, clarity, clinical relevance, naturalness, and error rates, including trust-breaking and post-therapy misconduct errors. Quantitative readability was assessed using Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), SMOG Index, and Dale-Chall Readability Score (DCRS). Paired t-tests and Bonferroni-corrected p-values were used for statistical analysis. RESULTS Qualitative evaluation showed no significant differences between GPT-4 and Claude-3-Opus for any metrics evaluated (all Bonferroni-corrected p-values: p = 1), while they outperformed other assessed models across five qualitative metrics (p < 0.001). GPT-4 had the fewest content and trust-breaking errors, with Claude-3-Opus second. However, all models exhibited some level of trust-breaking and post-therapy misconduct errors, with GPT-4-Turbo and GPT-3.5-Turbo with few-shot prompting showing the lowest error rates, and Mistral-7B and Mistral-8×7B showing the highest. Quantitatively, GPT-4 surpassed Claude-3-Opus in all readability metrics (all p < 0.001), with a median FRE score of 69.01 (IQR: 64.88-73.14) versus 59.74 (IQR: 55.47-64.01) for Claude-3-Opus. GPT-4 also outperformed GPT-3.5-Turbo and Gemini Ultra (both p < 0.001). Inter-rater reliability was strong (κ = 0.77-0.84). CONCLUSIONS GPT-4 and Claude-3-Opus demonstrated superior performance in generating simplified IR reports, but the presence of errors across all models, including trust-breaking errors, highlights the need for further refinement and validation before clinical implementation. CLINICAL RELEVANCE/APPLICATIONS With the increasing complexity of interventional radiology (IR) procedures and the growing availability of electronic health records, simplifying IR reports is critical to improving patient understanding and clinical decision-making. This study provides insights into the performance of various LLMs in rewriting IR reports, which can help in selecting the most suitable model for clinical patient-centered applications.
Collapse
Affiliation(s)
- Elif Can
- Department of Interventional Radiology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Germany (E.C., W.U., K.V., M.C.D.).
| | - Wibke Uller
- Department of Interventional Radiology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Germany (E.C., W.U., K.V., M.C.D.)
| | - Katharina Vogt
- Department of Interventional Radiology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Germany (E.C., W.U., K.V., M.C.D.)
| | - Michael C Doppler
- Department of Interventional Radiology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Germany (E.C., W.U., K.V., M.C.D.)
| | - Felix Busch
- Department of Radiology, Klinikum rechts der Isar, Technical University of Munich (TUM), Munich, Germany (F.B., A.K., M.R.M., K.K.B., L.C.A.)
| | - Nadine Bayerl
- Institute of Radiology, Friedrich-Alexander-Universität Erlangen-Nürnberg, University Hospital Erlangen, Erlangen, Germany (N.B., S.E.)
| | - Stephan Ellmann
- Institute of Radiology, Friedrich-Alexander-Universität Erlangen-Nürnberg, University Hospital Erlangen, Erlangen, Germany (N.B., S.E.)
| | - Avan Kader
- Department of Radiology, Klinikum rechts der Isar, Technical University of Munich (TUM), Munich, Germany (F.B., A.K., M.R.M., K.K.B., L.C.A.)
| | - Aboelyazid Elkilany
- Department of Diagnostic and Interventional Radiology, University Hospital Leipzig, Leipzig, Saxony, Germany (A.E.)
| | - Marcus R Makowski
- Department of Radiology, Klinikum rechts der Isar, Technical University of Munich (TUM), Munich, Germany (F.B., A.K., M.R.M., K.K.B., L.C.A.)
| | - Keno K Bressem
- Department of Radiology, Klinikum rechts der Isar, Technical University of Munich (TUM), Munich, Germany (F.B., A.K., M.R.M., K.K.B., L.C.A.)
| | - Lisa C Adams
- Department of Radiology, Klinikum rechts der Isar, Technical University of Munich (TUM), Munich, Germany (F.B., A.K., M.R.M., K.K.B., L.C.A.)
| |
Collapse
|
27
|
Crowe B, Shah S, Teng D, Ma SP, DeCamp M, Rosenberg EI, Rodriguez JA, Collins BX, Huber K, Karches K, Zucker S, Kim EJ, Rotenstein L, Rodman A, Jones D, Richman IB, Henry TL, Somlo D, Pitts SI, Chen JH, Mishuris RG. Recommendations for Clinicians, Technologists, and Healthcare Organizations on the Use of Generative Artificial Intelligence in Medicine: A Position Statement from the Society of General Internal Medicine. J Gen Intern Med 2025; 40:694-702. [PMID: 39531100 PMCID: PMC11861482 DOI: 10.1007/s11606-024-09102-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Accepted: 09/27/2024] [Indexed: 11/16/2024]
Abstract
Generative artificial intelligence (generative AI) is a new technology with potentially broad applications across important domains of healthcare, but serious questions remain about how to balance the promise of generative AI against unintended consequences from adoption of these tools. In this position statement, we provide recommendations on behalf of the Society of General Internal Medicine on how clinicians, technologists, and healthcare organizations can approach the use of these tools. We focus on three major domains of medical practice where clinicians and technology experts believe generative AI will have substantial immediate and long-term impacts: clinical decision-making, health systems optimization, and the patient-physician relationship. Additionally, we highlight our most important generative AI ethics and equity considerations for these stakeholders. For clinicians, we recommend approaching generative AI similarly to other important biomedical advancements, critically appraising its evidence and utility and incorporating it thoughtfully into practice. For technologists developing generative AI for healthcare applications, we recommend a major frameshift in thinking away from the expectation that clinicians will "supervise" generative AI. Rather, these organizations and individuals should hold themselves and their technologies to the same set of high standards expected of the clinical workforce and strive to design high-performing, well-studied tools that improve care and foster the therapeutic relationship, not simply those that improve efficiency or market share. We further recommend deep and ongoing partnerships with clinicians and patients as necessary collaborators in this work. And for healthcare organizations, we recommend pursuing a combination of both incremental and transformative change with generative AI, directing resources toward both endeavors, and avoiding the urge to rapidly displace the human clinical workforce with generative AI. We affirm that the practice of medicine remains a fundamentally human endeavor which should be enhanced by technology, not displaced by it.
Collapse
Affiliation(s)
- Byron Crowe
- Division of General Internal Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA.
- Harvard Medical School, Boston, MA, USA.
| | - Shreya Shah
- Department of Medicine, Stanford University, Palo Alto, CA, USA
- Division of Primary Care and Population Health, Stanford Healthcare AI Applied Research Team, Stanford University School of Medicine, Palo Alto, CA, USA
| | - Derek Teng
- Division of General Internal Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
| | - Stephen P Ma
- Division of Hospital Medicine, Stanford, CA, USA
| | - Matthew DeCamp
- Department of Medicine, University of Colorado, Aurora, CO, USA
| | - Eric I Rosenberg
- Division of General Internal Medicine, Department of Medicine, University of Florida College of Medicine, Gainesville, FL, USA
| | - Jorge A Rodriguez
- Harvard Medical School, Boston, MA, USA
- Division of General Internal Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Benjamin X Collins
- Division of General Internal Medicine and Public Health, Vanderbilt University Medical Center, Nashville, TN, USA
- Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA
| | - Kathryn Huber
- Department of Internal Medicine, Kaiser Permanente, Denver, CO, School of Medicine, University of Colorado, Aurora, CO, USA
| | - Kyle Karches
- Department of Internal Medicine, Saint Louis University, Saint Louis, MO, USA
| | - Shana Zucker
- Department of Internal Medicine, University of Miami Miller School of Medicine, Jackson Memorial Hospital, Miami, FL, USA
| | - Eun Ji Kim
- Northwell Health, New Hyde Park, NY, USA
| | - Lisa Rotenstein
- Divisions of General Internal Medicine and Clinical Informatics, Department of Medicine, University of California at San Francisco, San Francisco, CA, USA
| | - Adam Rodman
- Division of General Internal Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
| | - Danielle Jones
- Division of General Internal Medicine, Emory University School of Medicine, Atlanta, GA, USA
| | - Ilana B Richman
- Section of General Internal Medicine, Yale School of Medicine, New Haven, CT, USA
| | - Tracey L Henry
- Division of General Internal Medicine, Emory University School of Medicine, Atlanta, GA, USA
| | - Diane Somlo
- Harvard Medical School, Boston, MA, USA
- Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Samantha I Pitts
- Division of General Internal Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Jonathan H Chen
- Stanford Center for Biomedical Informatics Research, Stanford, CA, USA
- Division of Hospital Medicine, Stanford, CA, USA
- Clinical Excellence Research Center, Stanford, CA, USA
| | - Rebecca G Mishuris
- Harvard Medical School, Boston, MA, USA
- Division of General Internal Medicine, Brigham and Women's Hospital, Boston, MA, USA
- Digital, Mass General Brigham, Somerville, MA, USA
| |
Collapse
|
28
|
Eisinger F, Holderried F, Mahling M, Stegemann-Philipps C, Herrmann-Werner A, Nazarenus E, Sonanini A, Guthoff M, Eickhoff C, Holderried M. What's Going On With Me and How Can I Better Manage My Health? The Potential of GPT-4 to Transform Discharge Letters Into Patient-Centered Letters to Enhance Patient Safety: Prospective, Exploratory Study. J Med Internet Res 2025; 27:e67143. [PMID: 39836954 PMCID: PMC11795158 DOI: 10.2196/67143] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2024] [Revised: 11/22/2024] [Accepted: 11/28/2024] [Indexed: 01/23/2025] Open
Abstract
BACKGROUND For hospitalized patients, the discharge letter serves as a crucial source of medical information, outlining important discharge instructions and health management tasks. However, these letters are often written in professional jargon, making them difficult for patients with limited medical knowledge to understand. Large language models, such as GPT, have the potential to transform these discharge summaries into patient-friendly letters, improving accessibility and understanding. OBJECTIVE This study aims to use GPT-4 to convert discharge letters into more readable patient-centered letters. We evaluated how effectively and comprehensively GPT-4 identified and transferred patient safety-relevant information from the discharge letters to the transformed patient letters. METHODS Three discharge letters were created based on common medical conditions, containing 72 patient safety-relevant pieces of information, referred to as "learning objectives." GPT-4 was prompted to transform these discharge letters into patient-centered letters. The resulting patient letters were analyzed for medical accuracy, patient centricity, and the ability to identify and translate the learning objectives. Bloom's taxonomy was applied to analyze and categorize the learning objectives. RESULTS GPT-4 addressed the majority (56/72, 78%) of the learning objectives from the discharge letters. However, 11 of the 72 (15%) learning objectives were not included in the majority of the patient-centered letters. A qualitative analysis based on Bloom's taxonomy revealed that learning objectives in the "Understand" category (9/11) were more frequently omitted than those in the "Remember" category (2/11). Most of the missing learning objectives were related to the content field of "prevention of complications." By contrast, learning objectives regarding "lifestyle" and "organizational" aspects were addressed more frequently. Medical errors were found in a small proportion of sentences (31/787, 3.9%). In terms of patient centricity, the patient-centered letters demonstrated better readability than the discharge letters. Compared with discharge letters, they included fewer medical terms (132/860, 15.3%, vs 165/273, 60/4%), fewer abbreviations (43/860, 5%, vs 49/273, 17.9%), and more explanations of medical terms (121/131, 92.4%, vs 0/165, 0%). CONCLUSIONS Our study demonstrates that GPT-4 has the potential to transform discharge letters into more patient-centered communication. While the readability and patient centricity of the transformed letters are well-established, they do not fully address all patient safety-relevant information, resulting in the omission of key aspects. Further optimization of prompt engineering may help address this issue and improve the completeness of the transformation.
Collapse
Affiliation(s)
- Felix Eisinger
- Department of Diabetology, Endocrinology, Nephrology, University of Tübingen, Tübingen, Germany
| | | | - Moritz Mahling
- Department of Medical Strategy, Process and Quality Management, University Hospital Tübingen, Tübingen, Germany
| | | | - Anne Herrmann-Werner
- Tübingen Institute for Medical Education, University of Tübingen, Tübingen, Germany
| | - Eric Nazarenus
- Tübingen Institute for Medical Education, University of Tübingen, Tübingen, Germany
| | - Alessandra Sonanini
- Tübingen Institute for Medical Education, University of Tübingen, Tübingen, Germany
| | - Martina Guthoff
- Department of Diabetology, Endocrinology, Nephrology, University of Tübingen, Tübingen, Germany
| | - Carsten Eickhoff
- Institute for Bioinformatics and Medical Informatics, University of Tübingen, Tübingen, Germany
| | - Martin Holderried
- Department of Medical Strategy, Process and Quality Management, University Hospital Tübingen, Tübingen, Germany
| |
Collapse
|
29
|
Gallifant J, Afshar M, Ameen S, Aphinyanaphongs Y, Chen S, Cacciamani G, Demner-Fushman D, Dligach D, Daneshjou R, Fernandes C, Hansen LH, Landman A, Lehmann L, McCoy LG, Miller T, Moreno A, Munch N, Restrepo D, Savova G, Umeton R, Gichoya JW, Collins GS, Moons KGM, Celi LA, Bitterman DS. The TRIPOD-LLM reporting guideline for studies using large language models. Nat Med 2025; 31:60-69. [PMID: 39779929 PMCID: PMC12104976 DOI: 10.1038/s41591-024-03425-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2024] [Accepted: 11/21/2024] [Indexed: 01/11/2025]
Abstract
Large language models (LLMs) are rapidly being adopted in healthcare, necessitating standardized reporting guidelines. We present transparent reporting of a multivariable model for individual prognosis or diagnosis (TRIPOD)-LLM, an extension of the TRIPOD + artificial intelligence statement, addressing the unique challenges of LLMs in biomedical applications. TRIPOD-LLM provides a comprehensive checklist of 19 main items and 50 subitems, covering key aspects from title to discussion. The guidelines introduce a modular format accommodating various LLM research designs and tasks, with 14 main items and 32 subitems applicable across all categories. Developed through an expedited Delphi process and expert consensus, TRIPOD-LLM emphasizes transparency, human oversight and task-specific performance reporting. We also introduce an interactive website ( https://tripod-llm.vercel.app/ ) facilitating easy guideline completion and PDF generation for submission. As a living document, TRIPOD-LLM will evolve with the field, aiming to enhance the quality, reproducibility and clinical applicability of LLM research in healthcare through comprehensive reporting.
Collapse
Affiliation(s)
- Jack Gallifant
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Critical Care, Guy's and St Thomas' NHS Foundation Trust, London, UK
- Artificial Intelligence in Medicine (AIM) Program, Mass General Brigham, Harvard Medical School, Boston, MA, USA
| | - Majid Afshar
- Department of Medicine, University of Wisconsin-Madison, Madison, WI, USA
| | - Saleem Ameen
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Tasmanian School of Medicine, College of Health and Medicine, University of Tasmania, Hobart, Tasmania, Australia
| | - Yindalon Aphinyanaphongs
- Department of Population Health, NYU Grossman School of Medicine and Langone Health, New York, NY, USA
| | - Shan Chen
- Artificial Intelligence in Medicine (AIM) Program, Mass General Brigham, Harvard Medical School, Boston, MA, USA
- Department of Radiation Oncology, Brigham and Women's Hospital/Dana-Farber Cancer Institute, Boston, MA, USA
| | - Giovanni Cacciamani
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
- Artificial Intelligence Center, USC Institute of Urology, University of Southern California, Los Angeles, CA, USA
| | | | - Dmitriy Dligach
- Department of Computer Science, Loyola University, Chicago, IL, USA
| | - Roxana Daneshjou
- Department of Dermatology, Stanford School of Medicine, Redwood City, CA, USA
- Department of Biomedical Data Science, Stanford School of Medicine, Redwood City, CA, USA
| | - Chrystinne Fernandes
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Lasse Hyldig Hansen
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA
- Cognitive Science, Aarhus University, Jens Chr. Skou 2, Aarhus, Denmark
| | | | | | - Liam G McCoy
- Faculty of Medicine and Dentistry, University of Alberta, Edmonton, Alberta, Canada
| | - Timothy Miller
- Computational Health Informatics Program, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA
| | - Amy Moreno
- Department of Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Nikolaj Munch
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA
- Cognitive Science, Aarhus University, Jens Chr. Skou 2, Aarhus, Denmark
| | - David Restrepo
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA
- Departamento de Telematica, Universidad del Cauca, Popayan, Colombia
| | - Guergana Savova
- Computational Health Informatics Program, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA
| | | | - Judy Wawira Gichoya
- Department of Radiology, Emory University School of Medicine, Atlanta, GA, USA
| | - Gary S Collins
- Centre for Statistics in Medicine, Nuffield Department of Orthopaedics, Rheumatology, and Musculoskeletal Sciences, University of Oxford, Oxford, UK
- UK EQUATOR Centre, Nuffield Department of Orthopaedics, Rheumatology, and Musculoskeletal Sciences, University of Oxford, Oxford, UK
| | - Karel G M Moons
- Julius Center for Health Sciences and Primary Care, UMC Utrecht, Utrecht University, Utrecht, the Netherlands
- Health Innovation Netherlands (HINL), Utrecht, the Netherlands
| | - Leo A Celi
- Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA
- Division of Pulmonary, Critical Care and Sleep Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Danielle S Bitterman
- Artificial Intelligence in Medicine (AIM) Program, Mass General Brigham, Harvard Medical School, Boston, MA, USA.
- Department of Radiation Oncology, Brigham and Women's Hospital/Dana-Farber Cancer Institute, Boston, MA, USA.
| |
Collapse
|
30
|
Hartman V, Zhang X, Poddar R, McCarty M, Fortenko A, Sholle E, Sharma R, Campion T, Steel PAD. Developing and Evaluating Large Language Model-Generated Emergency Medicine Handoff Notes. JAMA Netw Open 2024; 7:e2448723. [PMID: 39625719 PMCID: PMC11615705 DOI: 10.1001/jamanetworkopen.2024.48723] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/21/2024] [Accepted: 10/07/2024] [Indexed: 12/06/2024] Open
Abstract
Importance An emergency medicine (EM) handoff note generated by a large language model (LLM) has the potential to reduce physician documentation burden without compromising the safety of EM-to-inpatient (IP) handoffs. Objective To develop LLM-generated EM-to-IP handoff notes and evaluate their accuracy and safety compared with physician-written notes. Design, Setting, and Participants This cohort study used EM patient medical records with acute hospital admissions that occurred in 2023 at NewYork-Presbyterian/Weill Cornell Medical Center. A customized clinical LLM pipeline was trained, tested, and evaluated to generate templated EM-to-IP handoff notes. Using both conventional automated methods (ie, recall-oriented understudy for gisting evaluation [ROUGE], bidirectional encoder representations from transformers score [BERTScore], and source chunking approach for large-scale inconsistency evaluation [SCALE]) and a novel patient safety-focused framework, LLM-generated handoff notes vs physician-written notes were compared. Data were analyzed from October 2023 to March 2024. Exposure LLM-generated EM handoff notes. Main Outcomes and Measures LLM-generated handoff notes were evaluated for (1) lexical similarity with respect to physician-written notes using ROUGE and BERTScore; (2) fidelity with respect to source notes using SCALE; and (3) readability, completeness, curation, correctness, usefulness, and implications for patient safety using a novel framework. Results In this study of 1600 EM patient records (832 [52%] female and mean [SD] age of 59.9 [18.9] years), LLM-generated handoff notes, compared with physician-written ones, had higher ROUGE (0.322 vs 0.088), BERTScore (0.859 vs 0.796), and SCALE scores (0.691 vs 0.456), indicating the LLM-generated summaries exhibited greater similarity and more detail. As reviewed by 3 board-certified EM physicians, a subsample of 50 LLM-generated summaries had a mean (SD) usefulness score of 4.04 (0.86) out of 5 (compared with 4.36 [0.71] for physician-written) and mean (SD) patient safety scores of 4.06 (0.86) out of 5 (compared with 4.50 [0.56] for physician-written). None of the LLM-generated summaries were classified as a critical patient safety risk. Conclusions and Relevance In this cohort study of 1600 EM patient medical records, LLM-generated EM-to-IP handoff notes were determined superior compared with physician-written summaries via conventional automated evaluation methods, but marginally inferior in usefulness and safety via a novel evaluation framework. This study suggests the importance of a physician-in-loop implementation design for this model and demonstrates an effective strategy to measure preimplementation patient safety of LLM models.
Collapse
Affiliation(s)
| | | | | | - Matthew McCarty
- Department of Emergency Medicine, NewYork-Presbyterian/Weill Cornell Medicine, New York
| | - Alexander Fortenko
- Department of Emergency Medicine, NewYork-Presbyterian/Weill Cornell Medicine, New York
| | - Evan Sholle
- Department of Population Health, NewYork-Presbyterian/Weill Cornell Medicine, New York
| | - Rahul Sharma
- Department of Emergency Medicine, NewYork-Presbyterian/Weill Cornell Medicine, New York
| | - Thomas Campion
- Department of Population Health, NewYork-Presbyterian/Weill Cornell Medicine, New York
- Clinical and Translational Science Center, Weill Cornell Medicine, New York, New York
| | - Peter A. D. Steel
- Department of Emergency Medicine, NewYork-Presbyterian/Weill Cornell Medicine, New York
| |
Collapse
|
31
|
Swisher AR, Wu AW, Liu GC, Lee MK, Carle TR, Tang DM. Enhancing Health Literacy: Evaluating the Readability of Patient Handouts Revised by ChatGPT's Large Language Model. Otolaryngol Head Neck Surg 2024; 171:1751-1757. [PMID: 39105460 DOI: 10.1002/ohn.927] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2024] [Revised: 07/03/2024] [Accepted: 07/20/2024] [Indexed: 08/07/2024]
Abstract
OBJECTIVE To use an artificial intelligence (AI)-powered large language model (LLM) to improve readability of patient handouts. STUDY DESIGN Review of online material modified by AI. SETTING Academic center. METHODS Five handout materials obtained from the American Rhinologic Society (ARS) and the American Academy of Facial Plastic and Reconstructive Surgery websites were assessed using validated readability metrics. The handouts were inputted into OpenAI's ChatGPT-4 after prompting: "Rewrite the following at a 6th-grade reading level." The understandability and actionability of both native and LLM-revised versions were evaluated using the Patient Education Materials Assessment Tool (PEMAT). Results were compared using Wilcoxon rank-sum tests. RESULTS The mean readability scores of the standard (ARS, American Academy of Facial Plastic and Reconstructive Surgery) materials corresponded to "difficult," with reading categories ranging between high school and university grade levels. Conversely, the LLM-revised handouts had an average seventh-grade reading level. LLM-revised handouts had better readability in nearly all metrics tested: Flesch-Kincaid Reading Ease (70.8 vs 43.9; P < .05), Gunning Fog Score (10.2 vs 14.42; P < .05), Simple Measure of Gobbledygook (9.9 vs 13.1; P < .05), Coleman-Liau (8.8 vs 12.6; P < .05), and Automated Readability Index (8.2 vs 10.7; P = .06). PEMAT scores were significantly higher in the LLM-revised handouts for understandability (91 vs 74%; P < .05) with similar actionability (42 vs 34%; P = .15) when compared to the standard materials. CONCLUSION Patient-facing handouts can be augmented by ChatGPT with simple prompting to tailor information with improved readability. This study demonstrates the utility of LLMs to aid in rewriting patient handouts and may serve as a tool to help optimize education materials. LEVEL OF EVIDENCE Level VI.
Collapse
Affiliation(s)
- Austin R Swisher
- Department of Otolaryngology-Head and Neck Surgery, Mayo Clinic, Phoenix, Arizona, USA
| | - Arthur W Wu
- Division of Otolaryngology-Head and Neck Surgery, Cedars-Sinai, Los Angeles, California, USA
| | - Gene C Liu
- Division of Otolaryngology-Head and Neck Surgery, Cedars-Sinai, Los Angeles, California, USA
| | - Matthew K Lee
- Division of Otolaryngology-Head and Neck Surgery, Cedars-Sinai, Los Angeles, California, USA
| | - Taylor R Carle
- Division of Otolaryngology-Head and Neck Surgery, Cedars-Sinai, Los Angeles, California, USA
| | - Dennis M Tang
- Division of Otolaryngology-Head and Neck Surgery, Cedars-Sinai, Los Angeles, California, USA
| |
Collapse
|
32
|
Lumbiganon S, Abou Chawareb E, Moukhtar Hammad MA, Azad B, Shah D, Yafi FA. Artificial Intelligence as a Tool for Creating Patient Visit Summary: A Scoping Review and Guide to Implementation in an Erectile Dysfunction Clinic. Curr Urol Rep 2024; 26:20. [PMID: 39556140 DOI: 10.1007/s11934-024-01237-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/14/2024] [Indexed: 11/19/2024]
Abstract
PURPOSE OF REVIEW In modern healthcare, the integration of artificial intelligence (AI) has revolutionized clinical practices, particularly in data management and patient visit summary creation. Manual creation of patient summary is repetitive, time-consuming, prone to errors, and increases clinicians' workload. AI, through voice recognition and Natural Language Processing (NLP), can automate this task more accurately and efficiently. Erectile dysfunction (ED) clinics, which deal with specific pattern of conditions together with an involvement of broader systemic issues, can greatly benefit from AI-driven patient summary. This scoping review examined the evidence on AI-generated patient summary and evaluated their implementation in ED clinics. RECENT FINDINGS A total of 381 articles were initially identified, 11 studies were included for the analysis. These studies showcased various methodologies, such as AI-assisted clinical notes and NLP algorithms. Most studies have demonstrated the ability of AI to be used in real life clinical scenarios. Major electronic health record platforms are also integrating AI to their system. However, to date, no studies have specifically addressed AI for patient summary creation in ED clinics.
Collapse
Affiliation(s)
- Supanut Lumbiganon
- Department of Urology, University of California, Irvine, CA, USA
- Department of Surgery, Faculty of Medicine, Khon Kaen University, Khon Kaen, Thailand
| | | | | | - Babak Azad
- Department of Urology, University of California, Irvine, CA, USA
| | - Dillan Shah
- Department of Urology, University of California, Irvine, CA, USA
| | - Faysal A Yafi
- Department of Urology, University of California, Irvine, CA, USA.
| |
Collapse
|
33
|
Klang E, Apakama D, Abbott EE, Vaid A, Lampert J, Sakhuja A, Freeman R, Charney AW, Reich D, Kraft M, Nadkarni GN, Glicksberg BS. A strategy for cost-effective large language model use at health system-scale. NPJ Digit Med 2024; 7:320. [PMID: 39558090 PMCID: PMC11574261 DOI: 10.1038/s41746-024-01315-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2024] [Accepted: 10/24/2024] [Indexed: 11/20/2024] Open
Abstract
Large language models (LLMs) can optimize clinical workflows; however, the economic and computational challenges of their utilization at the health system scale are underexplored. We evaluated how concatenating queries with multiple clinical notes and tasks simultaneously affects model performance under increasing computational loads. We assessed ten LLMs of different capacities and sizes utilizing real-world patient data. We conducted >300,000 experiments of various task sizes and configurations, measuring accuracy in question-answering and the ability to properly format outputs. Performance deteriorated as the number of questions and notes increased. High-capacity models, like Llama-3-70b, had low failure rates and high accuracies. GPT-4-turbo-128k was similarly resilient across task burdens, but performance deteriorated after 50 tasks at large prompt sizes. After addressing mitigable failures, these two models can concatenate up to 50 simultaneous tasks effectively, with validation on a public medical question-answering dataset. An economic analysis demonstrated up to a 17-fold cost reduction at 50 tasks using concatenation. These results identify the limits of LLMs for effective utilization and highlight avenues for cost-efficiency at the enterprise scale.
Collapse
Affiliation(s)
- Eyal Klang
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| | - Donald Apakama
- Department of Emergency Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Institute for Health Equity Research, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Ethan E Abbott
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Department of Emergency Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Institute for Health Equity Research, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Akhil Vaid
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Joshua Lampert
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Mount Sinai Fuster Heart Hospital, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Ankit Sakhuja
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Robert Freeman
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Alexander W Charney
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - David Reich
- Department of Anesthesiology, Perioperative and Pain Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Monica Kraft
- The Samuel Bronfman Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Girish N Nadkarni
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| | - Benjamin S Glicksberg
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
34
|
Lee C, Britto S, Diwan K. Evaluating the Impact of Artificial Intelligence (AI) on Clinical Documentation Efficiency and Accuracy Across Clinical Settings: A Scoping Review. Cureus 2024; 16:e73994. [PMID: 39703286 PMCID: PMC11658896 DOI: 10.7759/cureus.73994] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/18/2024] [Indexed: 12/21/2024] Open
Abstract
Artificial intelligence (AI) technologies (natural language processing (NLP), speech recognition (SR), and machine learning (ML)) can transform clinical documentation in healthcare. This scoping review evaluates the impact of AI on the accuracy and efficiency of clinical documentation across various clinical settings (hospital wards, emergency departments, and outpatient clinics). We found 176 articles by applying a specific search string on Ovid. To ensure a more comprehensive search process, we also performed manual searches on PubMed and BMJ, examining any relevant references we encountered. In this way, we were able to add 46 more articles, resulting in 222 articles in total. After removing duplicates, 208 articles were screened. This led to the inclusion of 36 studies. We were mostly interested in articles discussing the impact of AI technologies, such as NLP, ML, and SR, and their accuracy and efficiency in clinical documentation. To ensure that our research reflected recent work, we focused our efforts on studies published in 2019 and beyond. This criterion was pilot-tested beforehand and necessary adjustments were made. After comparing screened articles independently, we ensured inter-rater reliability (Cohen's kappa=1.0), and data extraction was completed on these 36 articles. We conducted this study according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines. This scoping review shows improvements in clinical documentation using AI technologies, with an emphasis on accuracy and efficiency. There was a reduction in clinician workload, with the streamlining of the documentation processes. Subsequently, doctors also had more time for patient care. However, these articles also raised various challenges surrounding the use of AI in clinical settings. These challenges included the management of errors, legal liability, and integration of AI with electronic health records (EHRs). There were also some ethical concerns regarding the use of AI with patient data. AI shows massive potential for improving the day-to-day work life of doctors across various clinical settings. However, more research is needed to address the many challenges associated with its use. Studies demonstrate improved accuracy and efficiency in clinical documentation with the use of AI. With better regulatory frameworks, implementation, and research, AI can significantly reduce the burden placed on doctors by documentation.
Collapse
Affiliation(s)
- Craig Lee
- General Internal Medicine, University Hospitals Plymouth NHS Trust, Plymouth, GBR
| | - Shawn Britto
- General Internal Medicine, University Hospitals Plymouth NHS Trust, Plymouth, GBR
| | - Khaled Diwan
- General Internal Medicine, University Hospitals Plymouth NHS Trust, Plymouth, GBR
| |
Collapse
|
35
|
Aydin S, Karabacak M, Vlachos V, Margetis K. Large language models in patient education: a scoping review of applications in medicine. Front Med (Lausanne) 2024; 11:1477898. [PMID: 39534227 PMCID: PMC11554522 DOI: 10.3389/fmed.2024.1477898] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2024] [Accepted: 10/03/2024] [Indexed: 11/16/2024] Open
Abstract
Introduction Large Language Models (LLMs) are sophisticated algorithms that analyze and generate vast amounts of textual data, mimicking human communication. Notable LLMs include GPT-4o by Open AI, Claude 3.5 Sonnet by Anthropic, and Gemini by Google. This scoping review aims to synthesize the current applications and potential uses of LLMs in patient education and engagement. Materials and methods Following the PRISMA-ScR checklist and methodologies by Arksey, O'Malley, and Levac, we conducted a scoping review. We searched PubMed in June 2024, using keywords and MeSH terms related to LLMs and patient education. Two authors conducted the initial screening, and discrepancies were resolved by consensus. We employed thematic analysis to address our primary research question. Results The review identified 201 studies, predominantly from the United States (58.2%). Six themes emerged: generating patient education materials, interpreting medical information, providing lifestyle recommendations, supporting customized medication use, offering perioperative care instructions, and optimizing doctor-patient interaction. LLMs were found to provide accurate responses to patient queries, enhance existing educational materials, and translate medical information into patient-friendly language. However, challenges such as readability, accuracy, and potential biases were noted. Discussion LLMs demonstrate significant potential in patient education and engagement by creating accessible educational materials, interpreting complex medical information, and enhancing communication between patients and healthcare providers. Nonetheless, issues related to the accuracy and readability of LLM-generated content, as well as ethical concerns, require further research and development. Future studies should focus on improving LLMs and ensuring content reliability while addressing ethical considerations.
Collapse
Affiliation(s)
- Serhat Aydin
- School of Medicine, Koç University, Istanbul, Türkiye
| | - Mert Karabacak
- Department of Neurosurgery, Mount Sinai Health System, New York, NY, United States
| | - Victoria Vlachos
- College of Human Ecology, Cornell University, Ithaca, NY, United States
| | | |
Collapse
|
36
|
Sushil M, Zack T, Mandair D, Zheng Z, Wali A, Yu YN, Quan Y, Lituiev D, Butte AJ. A comparative study of large language model-based zero-shot inference and task-specific supervised classification of breast cancer pathology reports. J Am Med Inform Assoc 2024; 31:2315-2327. [PMID: 38900207 DOI: 10.1093/jamia/ocae146] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2024] [Revised: 05/27/2024] [Accepted: 06/03/2024] [Indexed: 06/21/2024] Open
Abstract
OBJECTIVE Although supervised machine learning is popular for information extraction from clinical notes, creating large annotated datasets requires extensive domain expertise and is time-consuming. Meanwhile, large language models (LLMs) have demonstrated promising transfer learning capability. In this study, we explored whether recent LLMs could reduce the need for large-scale data annotations. MATERIALS AND METHODS We curated a dataset of 769 breast cancer pathology reports, manually labeled with 12 categories, to compare zero-shot classification capability of the following LLMs: GPT-4, GPT-3.5, Starling, and ClinicalCamel, with task-specific supervised classification performance of 3 models: random forests, long short-term memory networks with attention (LSTM-Att), and the UCSF-BERT model. RESULTS Across all 12 tasks, the GPT-4 model performed either significantly better than or as well as the best supervised model, LSTM-Att (average macro F1-score of 0.86 vs 0.75), with advantage on tasks with high label imbalance. Other LLMs demonstrated poor performance. Frequent GPT-4 error categories included incorrect inferences from multiple samples and from history, and complex task design, and several LSTM-Att errors were related to poor generalization to the test set. DISCUSSION On tasks where large annotated datasets cannot be easily collected, LLMs can reduce the burden of data labeling. However, if the use of LLMs is prohibitive, the use of simpler models with large annotated datasets can provide comparable results. CONCLUSIONS GPT-4 demonstrated the potential to speed up the execution of clinical NLP studies by reducing the need for large annotated datasets. This may increase the utilization of NLP-based variables and outcomes in clinical studies.
Collapse
Affiliation(s)
- Madhumita Sushil
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA 94158, United States
| | - Travis Zack
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA 94158, United States
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA 94158, United States
| | - Divneet Mandair
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA 94158, United States
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA 94158, United States
| | - Zhiwei Zheng
- University of California, Berkeley, Berkeley, CA 94720, United States
| | - Ahmed Wali
- University of California, Berkeley, Berkeley, CA 94720, United States
| | - Yan-Ning Yu
- University of California, Berkeley, Berkeley, CA 94720, United States
| | - Yuwei Quan
- University of California, Berkeley, Berkeley, CA 94720, United States
| | - Dmytro Lituiev
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA 94158, United States
| | - Atul J Butte
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA 94158, United States
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA 94158, United States
- Center for Data-driven Insights and Innovation, University of California, Office of the President, Oakland, CA 94607, United States
- Department of Pediatrics, University of California, San Francisco, San Francisco, CA 94158, United States
| |
Collapse
|
37
|
Zhou H, Wang HL, Duan YY, Yan ZN, Luo R, Lv XX, Xie Y, Zhang JY, Yang JM, Xue MD, Fang Y, Lu L, Liu PR, Ye ZW. Enhancing Orthopedic Knowledge Assessments: The Performance of Specialized Generative Language Model Optimization. Curr Med Sci 2024; 44:1001-1005. [PMID: 39368054 DOI: 10.1007/s11596-024-2929-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2024] [Accepted: 08/18/2024] [Indexed: 10/07/2024]
Abstract
OBJECTIVE This study aimed to evaluate and compare the effectiveness of knowledge base-optimized and unoptimized large language models (LLMs) in the field of orthopedics to explore optimization strategies for the application of LLMs in specific fields. METHODS This research constructed a specialized knowledge base using clinical guidelines from the American Academy of Orthopaedic Surgeons (AAOS) and authoritative orthopedic publications. A total of 30 orthopedic-related questions covering aspects such as anatomical knowledge, disease diagnosis, fracture classification, treatment options, and surgical techniques were input into both the knowledge base-optimized and unoptimized versions of the GPT-4, ChatGLM, and Spark LLM, with their generated responses recorded. The overall quality, accuracy, and comprehensiveness of these responses were evaluated by 3 experienced orthopedic surgeons. RESULTS Compared with their unoptimized LLMs, the optimized version of GPT-4 showed improvements of 15.3% in overall quality, 12.5% in accuracy, and 12.8% in comprehensiveness; ChatGLM showed improvements of 24.8%, 16.1%, and 19.6%, respectively; and Spark LLM showed improvements of 6.5%, 14.5%, and 24.7%, respectively. CONCLUSION The optimization of knowledge bases significantly enhances the quality, accuracy, and comprehensiveness of the responses provided by the 3 models in the orthopedic field. Therefore, knowledge base optimization is an effective method for improving the performance of LLMs in specific fields.
Collapse
Affiliation(s)
- Hong Zhou
- Department of Orthopedics Surgery, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
- Laboratory of Intelligent Medicine, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
| | - Hong-Lin Wang
- Department of Orthopedics Surgery, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
- Laboratory of Intelligent Medicine, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
| | - Yu-Yu Duan
- Laboratory of Intelligent Medicine, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
- College of Chinese Medicine, Hubei University of Chinese Medicine, Wuhan, 433065, China
| | - Zi-Neng Yan
- Department of Orthopedics Surgery, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
- Laboratory of Intelligent Medicine, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
| | - Rui Luo
- Department of Orthopedics Surgery, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
- Laboratory of Intelligent Medicine, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
| | - Xiang-Xin Lv
- Department of Orthopedics Surgery, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
- Laboratory of Intelligent Medicine, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
| | - Yi Xie
- Department of Orthopedics Surgery, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
- Laboratory of Intelligent Medicine, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
| | - Jia-Yao Zhang
- Department of Orthopedics Surgery, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
- Laboratory of Intelligent Medicine, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
| | - Jia-Ming Yang
- Department of Orthopedics Surgery, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
- Laboratory of Intelligent Medicine, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
| | - Ming-di Xue
- Department of Orthopedics Surgery, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
- Laboratory of Intelligent Medicine, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
| | - Ying Fang
- Department of Orthopedics Surgery, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
- Laboratory of Intelligent Medicine, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China
| | - Lin Lu
- Laboratory of Intelligent Medicine, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China.
- Department of Orthopedics, Renmin Hospital of Wuhan University, Wuhan, 433060, China.
| | - Peng-Ran Liu
- Department of Orthopedics Surgery, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China.
- Laboratory of Intelligent Medicine, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China.
| | - Zhe-Wei Ye
- Department of Orthopedics Surgery, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China.
- Laboratory of Intelligent Medicine, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, 430022, China.
| |
Collapse
|
38
|
Markowitz DM. From complexity to clarity: How AI enhances perceptions of scientists and the public's understanding of science. PNAS NEXUS 2024; 3:pgae387. [PMID: 39290437 PMCID: PMC11406778 DOI: 10.1093/pnasnexus/pgae387] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Accepted: 08/27/2024] [Indexed: 09/19/2024]
Abstract
This article evaluated the effectiveness of using generative AI to simplify science communication and enhance the public's understanding of science. By comparing lay summaries of journal articles from PNAS, yoked to those generated by AI, this work first assessed linguistic simplicity differences across such summaries and public perceptions in follow-up experiments. Specifically, study 1a analyzed simplicity features of PNAS abstracts (scientific summaries) and significance statements (lay summaries), observing that lay summaries were indeed linguistically simpler, but effect size differences were small. Study 1b used a large language model, GPT-4, to create significance statements based on paper abstracts and this more than doubled the average effect size without fine-tuning. Study 2 experimentally demonstrated that simply-written generative pre-trained transformer (GPT) summaries facilitated more favorable perceptions of scientists (they were perceived as more credible and trustworthy, but less intelligent) than more complexly written human PNAS summaries. Crucially, study 3 experimentally demonstrated that participants comprehended scientific writing better after reading simple GPT summaries compared to complex PNAS summaries. In their own words, participants also summarized scientific papers in a more detailed and concrete manner after reading GPT summaries compared to PNAS summaries of the same article. AI has the potential to engage scientific communities and the public via a simple language heuristic, advocating for its integration into scientific dissemination for a more informed society.
Collapse
Affiliation(s)
- David M Markowitz
- Department of Communication, Michigan State University, East Lansing, MI 48824, USA
| |
Collapse
|
39
|
Dihan Q, Chauhan MZ, Eleiwa TK, Hassan AK, Sallam AB, Khouri AS, Chang TC, Elhusseiny AM. Using Large Language Models to Generate Educational Materials on Childhood Glaucoma. Am J Ophthalmol 2024; 265:28-38. [PMID: 38614196 DOI: 10.1016/j.ajo.2024.04.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 03/29/2024] [Accepted: 04/03/2024] [Indexed: 04/15/2024]
Abstract
PURPOSE To evaluate the quality, readability, and accuracy of large language model (LLM)-generated patient education materials (PEMs) on childhood glaucoma, and their ability to improve existing the readability of online information. DESIGN Cross-sectional comparative study. METHODS We evaluated responses of ChatGPT-3.5, ChatGPT-4, and Bard to 3 separate prompts requesting that they write PEMs on "childhood glaucoma." Prompt A required PEMs be "easily understandable by the average American." Prompt B required that PEMs be written "at a 6th-grade level using Simple Measure of Gobbledygook (SMOG) readability formula." We then compared responses' quality (DISCERN questionnaire, Patient Education Materials Assessment Tool [PEMAT]), readability (SMOG, Flesch-Kincaid Grade Level [FKGL]), and accuracy (Likert Misinformation scale). To assess the improvement of readability for existing online information, Prompt C requested that LLM rewrite 20 resources from a Google search of keyword "childhood glaucoma" to the American Medical Association-recommended "6th-grade level." Rewrites were compared on key metrics such as readability, complex words (≥3 syllables), and sentence count. RESULTS All 3 LLMs generated PEMs that were of high quality, understandability, and accuracy (DISCERN ≥4, ≥70% PEMAT understandability, Misinformation score = 1). Prompt B responses were more readable than Prompt A responses for all 3 LLM (P ≤ .001). ChatGPT-4 generated the most readable PEMs compared to ChatGPT-3.5 and Bard (P ≤ .001). Although Prompt C responses showed consistent reduction of mean SMOG and FKGL scores, only ChatGPT-4 achieved the specified 6th-grade reading level (4.8 ± 0.8 and 3.7 ± 1.9, respectively). CONCLUSIONS LLMs can serve as strong supplemental tools in generating high-quality, accurate, and novel PEMs, and improving the readability of existing PEMs on childhood glaucoma.
Collapse
Affiliation(s)
- Qais Dihan
- Chicago Medical School (Q.D.), Rosalind Franklin University of Medicine and Science, North Chicago, Illinois, USA; Department of Ophthalmology (Q.D., M.Z.C., A.B.S., A.M.E.), Harvey and Bernice Jones Eye Institute, University of Arkansas for Medical Sciences, Little Rock, Arkansas, USA
| | - Muhammad Z Chauhan
- Department of Ophthalmology (Q.D., M.Z.C., A.B.S., A.M.E.), Harvey and Bernice Jones Eye Institute, University of Arkansas for Medical Sciences, Little Rock, Arkansas, USA
| | - Taher K Eleiwa
- Department of Ophthalmology (T.K.E.), Benha Faculty of Medicine, Benha University, Benha, Egypt
| | - Amr K Hassan
- Department of Ophthalmology (A.K.H.), Faculty of Medicine, South Valley University, Qena, Egypt
| | - Ahmed B Sallam
- Department of Ophthalmology (Q.D., M.Z.C., A.B.S., A.M.E.), Harvey and Bernice Jones Eye Institute, University of Arkansas for Medical Sciences, Little Rock, Arkansas, USA; Department of Ophthalmology (A.B.S.), Faculty of Medicine, Ain Shams University, Cairo, Egypt
| | - Albert S Khouri
- Institute of Ophthalmology & Visual Science (A.S.K.), Rutgers New Jersey Medical School, Newark, New Jersey, USA
| | - Ta C Chang
- Department of Ophthalmology (T.C.C.), Bascom Palmer Eye Institute, University of Miami Miller School of Medicine, Miami, Florida, USA
| | - Abdelrahman M Elhusseiny
- Department of Ophthalmology (Q.D., M.Z.C., A.B.S., A.M.E.), Harvey and Bernice Jones Eye Institute, University of Arkansas for Medical Sciences, Little Rock, Arkansas, USA; Department of Ophthalmology (A.M.E.), Boston Children's Hospital, Harvard Medical School, Boston, Massachusetts, USA.
| |
Collapse
|
40
|
Huerta N, Rao SJ, Isath A, Wang Z, Glicksberg BS, Krittanawong C. The premise, promise, and perils of artificial intelligence in critical care cardiology. Prog Cardiovasc Dis 2024; 86:2-12. [PMID: 38936757 DOI: 10.1016/j.pcad.2024.06.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/23/2024] [Accepted: 06/23/2024] [Indexed: 06/29/2024]
Abstract
Artificial intelligence (AI) is an emerging technology with numerous healthcare applications. AI could prove particularly useful in the cardiac intensive care unit (CICU) where its capacity to analyze large datasets in real-time would assist clinicians in making more informed decisions. This systematic review aimed to explore current research on AI as it pertains to the CICU. A PRISMA search strategy was carried out to identify the pertinent literature on topics including vascular access, heart failure care, circulatory support, cardiogenic shock, ultrasound, and mechanical ventilation. Thirty-eight studies were included. Although AI is still in its early stages of development, this review illustrates its potential to yield numerous benefits in the CICU.
Collapse
Affiliation(s)
- Nicholas Huerta
- Department of Medicine, MedStar Union Memorial Hospital, Baltimore, MD, USA
| | - Shiavax J Rao
- Department of Medicine, MedStar Union Memorial Hospital, Baltimore, MD, USA
| | - Ameesh Isath
- Department of Cardiology, Westchester Medical Center and New York Medical College, Valhalla, NY, USA
| | - Zhen Wang
- Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, MN, USA; Division of Health Care Policy and Research, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Benjamin S Glicksberg
- Hasso Plattner Institute for Digital Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | | |
Collapse
|
41
|
Du X, Zhou Z, Wang Y, Chuang YW, Yang R, Zhang W, Wang X, Zhang R, Hong P, Bates DW, Zhou L. Generative Large Language Models in Electronic Health Records for Patient Care Since 2023: A Systematic Review. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.08.11.24311828. [PMID: 39228726 PMCID: PMC11370524 DOI: 10.1101/2024.08.11.24311828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2024]
Abstract
Background Generative Large language models (LLMs) represent a significant advancement in natural language processing, achieving state-of-the-art performance across various tasks. However, their application in clinical settings using real electronic health records (EHRs) is still rare and presents numerous challenges. Objective This study aims to systematically review the use of generative LLMs, and the effectiveness of relevant techniques in patient care-related topics involving EHRs, summarize the challenges faced, and suggest future directions. Methods A Boolean search for peer-reviewed articles was conducted on May 19th, 2024 using PubMed and Web of Science to include research articles published since 2023, which was one month after the release of ChatGPT. The search results were deduplicated. Multiple reviewers, including biomedical informaticians, computer scientists, and a physician, screened the publications for eligibility and conducted data extraction. Only studies utilizing generative LLMs to analyze real EHR data were included. We summarized the use of prompt engineering, fine-tuning, multimodal EHR data, and evaluation matrices. Additionally, we identified current challenges in applying LLMs in clinical settings as reported by the included studies and proposed future directions. Results The initial search identified 6,328 unique studies, with 76 studies included after eligibility screening. Of these, 67 studies (88.2%) employed zero-shot prompting, five of them reported 100% accuracy on five specific clinical tasks. Nine studies used advanced prompting strategies; four tested these strategies experimentally, finding that prompt engineering improved performance, with one study noting a non-linear relationship between the number of examples in a prompt and performance improvement. Eight studies explored fine-tuning generative LLMs, all reported performance improvements on specific tasks, but three of them noted potential performance degradation after fine-tuning on certain tasks. Only two studies utilized multimodal data, which improved LLM-based decision-making and enabled accurate rare disease diagnosis and prognosis. The studies employed 55 different evaluation metrics for 22 purposes, such as correctness, completeness, and conciseness. Two studies investigated LLM bias, with one detecting no bias and the other finding that male patients received more appropriate clinical decision-making suggestions. Six studies identified hallucinations, such as fabricating patient names in structured thyroid ultrasound reports. Additional challenges included but were not limited to the impersonal tone of LLM consultations, which made patients uncomfortable, and the difficulty patients had in understanding LLM responses. Conclusion Our review indicates that few studies have employed advanced computational techniques to enhance LLM performance. The diverse evaluation metrics used highlight the need for standardization. LLMs currently cannot replace physicians due to challenges such as bias, hallucinations, and impersonal responses.
Collapse
Affiliation(s)
- Xinsong Du
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115
| | - Zhengyang Zhou
- Department of Computer Science, Brandeis University, Waltham, MA 02453
| | - Yifei Wang
- Department of Computer Science, Brandeis University, Waltham, MA 02453
| | - Ya-Wen Chuang
- Division of Nephrology, Department of Internal Medicine, Taichung Veterans General Hospital, Taichung, Taiwan, 407219
- Department of Post-Baccalaureate Medicine, College of Medicine, National Chung Hsing University, Taichung, Taiwan, 402202
- School of Medicine, College of Medicine, China Medical University, Taichung, Taiwan, 404328
| | - Richard Yang
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
| | - Wenyu Zhang
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115
| | - Xinyi Wang
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115
| | - Rui Zhang
- Division of Computational Health Sciences, University of Minnesota, Minneapolis, MN 55455
| | - Pengyu Hong
- Department of Computer Science, Brandeis University, Waltham, MA 02453
| | - David W. Bates
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Health Policy and Management, Harvard T.H. Chan School of Public Health, Boston, MA 02115
| | - Li Zhou
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115
| |
Collapse
|
42
|
Levin C, Suliman M, Naimi E, Saban M. Augmenting intensive care unit nursing practice with generative AI: A formative study of diagnostic synergies using simulation-based clinical cases. J Clin Nurs 2024. [PMID: 39101368 DOI: 10.1111/jocn.17384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2024] [Revised: 05/14/2024] [Accepted: 07/15/2024] [Indexed: 08/06/2024]
Abstract
BACKGROUND As generative artificial intelligence (GenAI) tools continue advancing, rigorous evaluations are needed to understand their capabilities relative to experienced clinicians and nurses. The aim of this study was to objectively compare the diagnostic accuracy and response formats of ICU nurses versus various GenAI models, with a qualitative interpretation of the quantitative results. METHODS This formative study utilized four written clinical scenarios representative of real ICU patient cases to simulate diagnostic challenges. The scenarios were developed by expert nurses and underwent validation against current literature. Seventy-four ICU nurses participated in a simulation-based assessment involving four written clinical scenarios. Simultaneously, we asked ChatGPT-4 and Claude-2.0 to provide initial assessments and treatment recommendations for the same scenarios. The responses from ChatGPT-4 and Claude-2.0 were then scored by certified ICU nurses for accuracy, completeness and response. RESULTS Nurses consistently achieved higher diagnostic accuracy than AI across open-ended scenarios, though certain models matched or exceeded human performance on standardized cases. Reaction times also diverged substantially. Qualitative response format differences emerged such as concision versus verbosity. Variations in GenAI models system performance across cases highlighted generalizability challenges. CONCLUSIONS While GenAI demonstrated valuable skills, experienced nurses outperformed in open-ended domains requiring holistic judgement. Continued development to strengthen generalized decision-making abilities is warranted before autonomous clinical integration. Response format interfaces should consider leveraging distinct strengths. Rigorous mixed methods research involving diverse stakeholders can help iteratively inform safe, beneficial human-GenAI partnerships centred on experience-guided care augmentation. RELEVANCE TO CLINICAL PRACTICE This mixed-methods simulation study provides formative insights into optimizing collaborative models of GenAI and nursing knowledge to support patient assessment and decision-making in intensive care. The findings can help guide development of explainable GenAI decision support tailored for critical care environments. PATIENT OR PUBLIC CONTRIBUTION Patients or public were not involved in the design and implementation of the study or the analysis and interpretation of the data.
Collapse
Affiliation(s)
- Chedva Levin
- Nursing Department, Faculty of School of Life and Health Sciences, The Jerusalem College of Technology-lev Academic Center, Jerusalem, Israel
- Department of Vascular Surgery, The Chaim Sheba Medical Center, Ramat Gan, Tel Aviv, Israel
| | - Moriya Suliman
- Intensive Care Unit, The Chaim Sheba Medical Center, Ramat Gan, Tel Aviv, Israel
| | - Etti Naimi
- Department of Nursing, School of Health Professions, Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel
| | - Mor Saban
- Department of Nursing, School of Health Professions, Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
43
|
Huang T, Safranek C, Socrates V, Chartash D, Wright D, Dilip M, Sangal RB, Taylor RA. Patient-Representing Population's Perceptions of GPT-Generated Versus Standard Emergency Department Discharge Instructions: Randomized Blind Survey Assessment. J Med Internet Res 2024; 26:e60336. [PMID: 39094112 PMCID: PMC11329854 DOI: 10.2196/60336] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2024] [Revised: 06/07/2024] [Accepted: 06/22/2024] [Indexed: 08/04/2024] Open
Abstract
BACKGROUND Discharge instructions are a key form of documentation and patient communication in the time of transition from the emergency department (ED) to home. Discharge instructions are time-consuming and often underprioritized, especially in the ED, leading to discharge delays and possibly impersonal patient instructions. Generative artificial intelligence and large language models (LLMs) offer promising methods of creating high-quality and personalized discharge instructions; however, there exists a gap in understanding patient perspectives of LLM-generated discharge instructions. OBJECTIVE We aimed to assess the use of LLMs such as ChatGPT in synthesizing accurate and patient-accessible discharge instructions in the ED. METHODS We synthesized 5 unique, fictional ED encounters to emulate real ED encounters that included a diverse set of clinician history, physical notes, and nursing notes. These were passed to GPT-4 in Azure OpenAI Service (Microsoft) to generate LLM-generated discharge instructions. Standard discharge instructions were also generated for each of the 5 unique ED encounters. All GPT-generated and standard discharge instructions were then formatted into standardized after-visit summary documents. These after-visit summaries containing either GPT-generated or standard discharge instructions were randomly and blindly administered to Amazon MTurk respondents representing patient populations through Amazon MTurk Survey Distribution. Discharge instructions were assessed based on metrics of interpretability of significance, understandability, and satisfaction. RESULTS Our findings revealed that survey respondents' perspectives regarding GPT-generated and standard discharge instructions were significantly (P=.01) more favorable toward GPT-generated return precautions, and all other sections were considered noninferior to standard discharge instructions. Of the 156 survey respondents, GPT-generated discharge instructions were assigned favorable ratings, "agree" and "strongly agree," more frequently along the metric of interpretability of significance in discharge instruction subsections regarding diagnosis, procedures, treatment, post-ED medications or any changes to medications, and return precautions. Survey respondents found GPT-generated instructions to be more understandable when rating procedures, treatment, post-ED medications or medication changes, post-ED follow-up, and return precautions. Satisfaction with GPT-generated discharge instruction subsections was the most favorable in procedures, treatment, post-ED medications or medication changes, and return precautions. Wilcoxon rank-sum test of Likert responses revealed significant differences (P=.01) in the interpretability of significant return precautions in GPT-generated discharge instructions compared to standard discharge instructions but not for other evaluation metrics and discharge instruction subsections. CONCLUSIONS This study demonstrates the potential for LLMs such as ChatGPT to act as a method of augmenting current documentation workflows in the ED to reduce the documentation burden of physicians. The ability of LLMs to provide tailored instructions for patients by improving readability and making instructions more applicable to patients could improve upon the methods of communication that currently exist.
Collapse
Affiliation(s)
- Thomas Huang
- Department of Emergency Medicine, Yale School of Medicine, New Haven, CT, United States
- Department for Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, United States
| | - Conrad Safranek
- Department of Emergency Medicine, Yale School of Medicine, New Haven, CT, United States
- Department for Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, United States
| | - Vimig Socrates
- Department for Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, United States
- Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT, United States
| | - David Chartash
- Department for Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, United States
- School of Medicine, University College Dublin, National University of Ireland, Dublin, Ireland
| | - Donald Wright
- Department of Emergency Medicine, Yale School of Medicine, New Haven, CT, United States
- Department for Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, United States
| | - Monisha Dilip
- Department of Emergency Medicine, Yale School of Medicine, New Haven, CT, United States
| | - Rohit B Sangal
- Department of Emergency Medicine, Yale School of Medicine, New Haven, CT, United States
| | - Richard Andrew Taylor
- Department of Emergency Medicine, Yale School of Medicine, New Haven, CT, United States
- Department for Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT, United States
| |
Collapse
|
44
|
Sridharan K, Sivaramakrishnan G. Enhancing readability of USFDA patient communications through large language models: a proof-of-concept study. Expert Rev Clin Pharmacol 2024; 17:731-741. [PMID: 38823007 DOI: 10.1080/17512433.2024.2363840] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Accepted: 05/31/2024] [Indexed: 06/03/2024]
Abstract
BACKGROUND The US Food and Drug Administration (USFDA) communicates new drug safety concerns through drug safety communications (DSCs) and medication guides (MGs), which often challenge patients with average reading abilities due to their complexity. This study assesses whether large language models (LLMs) can enhance the readability of these materials. METHODS We analyzed the latest DSCs and MGs, using ChatGPT 4.0© and Gemini© to simplify them to a sixth-grade reading level. Outputs were evaluated for readability, technical accuracy, and content inclusiveness. RESULTS Original materials were difficult to read (DSCs grade level 13, MGs 22). LLMs significantly improved readability, reducing the grade levels to more accessible readings (Single prompt - DSCs: ChatGPT 4.0© 10.1, Gemini© 8; MGs: ChatGPT 4.0© 7.1, Gemini© 6.5. Multiple prompts - DSCs: ChatGPT 4.0© 10.3, Gemini© 7.5; MGs: ChatGPT 4.0© 8, Gemini© 6.8). LLM outputs retained technical accuracy and key messages. CONCLUSION LLMs can significantly simplify complex health-related information, making it more accessible to patients. Future research should extend these findings to other languages and patient groups in real-world settings.
Collapse
Affiliation(s)
- Kannan Sridharan
- Department of Pharmacology & Therapeutics, College of Medicine & Medical Sciences, Arabian Gulf University, Manama, Kingdom of Bahrain
| | - Gowri Sivaramakrishnan
- Speciality Dental Residency Program, Primary Health Care Centers, Manama, Kingdom of Bahrain
| |
Collapse
|
45
|
Zeng S, Kong Q, Wu X, Ma T, Wang L, Xu L, Kou G, Zhang M, Yang X, Zuo X, Li Y, Li Y. Artificial Intelligence-Generated Patient Education Materials for Helicobacter pylori Infection: A Comparative Analysis. Helicobacter 2024; 29:e13115. [PMID: 39097925 DOI: 10.1111/hel.13115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/21/2024] [Revised: 07/02/2024] [Accepted: 07/03/2024] [Indexed: 08/06/2024]
Abstract
BACKGROUND Patient education contributes to improve public awareness of Helicobacter pylori. Large language models (LLMs) offer opportunities to revolutionize patient education transformatively. This study aimed to assess the quality of patient educational materials (PEMs) generated by LLMs and compared with physician sourced. MATERIALS AND METHODS Unified instruction about composing a PEM about H. pylori at a sixth-grade reading level in both English and Chinese were given to physician and five LLMs (Bing Copilot, Claude 3 Opus, Gemini Pro, ChatGPT-4, and ERNIE Bot 4.0). The assessments of the completeness and comprehensibility of the Chinese PEMs were conducted by five gastroenterologists and 50 patients according to three-point Likert scale. Gastroenterologists were asked to evaluate both English and Chinese PEMs and determine the accuracy and safety. The accuracy was assessed by six-point Likert scale. The minimum acceptable scores were 4, 2, and 2 for accuracy, completeness, and comprehensibility, respectively. The Flesch-Kincaid and Simple Measure of Gobbledygook scoring systems were employed as readability assessment tools. RESULTS Accuracy and comprehensibility were acceptable for English PEMs of all sources, while completence was not satisfactory. Physician-sourced PEM had the highest accuracy mean score of 5.60 and LLM-generated English PEMs ranged from 4.00 to 5.40. The completeness score was comparable between physician-sourced PEM and LLM-generated PEMs in English. Chinese PEMs from LLMs proned to have lower score in accuracy and completeness assessment than English PEMs. The mean score for completeness of five LLM-generated Chinese PEMs was 1.82-2.70 in patients' perspective, which was higher than gastroenterologists' assessment. Comprehensibility was satisfactory for all PEMs. No PEM met the recommended sixth-grade reading level. CONCLUSION LLMs have potential in assisting patient education. The accuracy and comprehensibility of LLM-generated PEMs were acceptable, but further optimization on improving completeness and accounting for a variety of linguistic contexts are essential for enhancing the feasibility.
Collapse
Affiliation(s)
- Shuyan Zeng
- Department of Gastroenterology, Qilu Hospital of Shandong University, Jinan, Shandong, China
- Shandong Provincial Clinical Research Center for Digestive Disease, Qilu Hospital of Shandong University, Jinan, Shandong, China
| | - Qingzhou Kong
- Department of Gastroenterology, Qilu Hospital of Shandong University, Jinan, Shandong, China
- Shandong Provincial Clinical Research Center for Digestive Disease, Qilu Hospital of Shandong University, Jinan, Shandong, China
| | - Xiaoqi Wu
- Department of Gastroenterology, Qilu Hospital of Shandong University, Jinan, Shandong, China
- Shandong Provincial Clinical Research Center for Digestive Disease, Qilu Hospital of Shandong University, Jinan, Shandong, China
| | - Tian Ma
- Department of Gastroenterology, Qilu Hospital of Shandong University, Jinan, Shandong, China
- Shandong Provincial Clinical Research Center for Digestive Disease, Qilu Hospital of Shandong University, Jinan, Shandong, China
| | - Limei Wang
- Department of Gastroenterology, Qilu Hospital of Shandong University, Jinan, Shandong, China
- Shandong Provincial Clinical Research Center for Digestive Disease, Qilu Hospital of Shandong University, Jinan, Shandong, China
| | - Leiqi Xu
- Department of Gastroenterology, Qilu Hospital of Shandong University, Jinan, Shandong, China
- Shandong Provincial Clinical Research Center for Digestive Disease, Qilu Hospital of Shandong University, Jinan, Shandong, China
| | - Guanjun Kou
- Department of Gastroenterology, Qilu Hospital of Shandong University, Jinan, Shandong, China
- Shandong Provincial Clinical Research Center for Digestive Disease, Qilu Hospital of Shandong University, Jinan, Shandong, China
| | - Mingming Zhang
- Department of Gastroenterology, Qilu Hospital of Shandong University, Jinan, Shandong, China
- Shandong Provincial Clinical Research Center for Digestive Disease, Qilu Hospital of Shandong University, Jinan, Shandong, China
| | - Xiaoyun Yang
- Department of Gastroenterology, Qilu Hospital of Shandong University, Jinan, Shandong, China
- Shandong Provincial Clinical Research Center for Digestive Disease, Qilu Hospital of Shandong University, Jinan, Shandong, China
| | - Xiuli Zuo
- Department of Gastroenterology, Qilu Hospital of Shandong University, Jinan, Shandong, China
- Shandong Provincial Clinical Research Center for Digestive Disease, Qilu Hospital of Shandong University, Jinan, Shandong, China
| | - Yueyue Li
- Department of Gastroenterology, Qilu Hospital of Shandong University, Jinan, Shandong, China
- Shandong Provincial Clinical Research Center for Digestive Disease, Qilu Hospital of Shandong University, Jinan, Shandong, China
| | - Yanqing Li
- Department of Gastroenterology, Qilu Hospital of Shandong University, Jinan, Shandong, China
- Shandong Provincial Clinical Research Center for Digestive Disease, Qilu Hospital of Shandong University, Jinan, Shandong, China
| |
Collapse
|
46
|
Santos J, Santos HDP, Ulbrich AHDPS, Faccio D, Tabalipa FDO, Nogueira RF, Costa MM. Evaluating LLMs for Diagnosis Summarization. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2024; 2024:1-7. [PMID: 40039708 DOI: 10.1109/embc53108.2024.10782231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2025]
Abstract
During a patient's hospitalization, extensive information is documented in clinical notes. The efficient summarization of this information is vital for keeping healthcare professionals abreast of the patient's status. This paper proposes a methodology to assess the efficacy of six large language models (LLMs) in automating the task of diagnosis summarization, particularly in discharge summaries. Our approach involves defining an automatic metric based on LLMs, highly correlated with human assessments. We evaluate the performance of the six models using the F1-Score and compare the results with those of healthcare specialists. The experiments reveal that there is room for improvement in the medical knowledge and diagnostic capabilities of LLMs. The source code and data for these experiments are available on the project's GitHub page.
Collapse
|
47
|
Heinke A, Radgoudarzi N, Huang BB, Baxter SL. A review of ophthalmology education in the era of generative artificial intelligence. Asia Pac J Ophthalmol (Phila) 2024; 13:100089. [PMID: 39134176 PMCID: PMC11934932 DOI: 10.1016/j.apjo.2024.100089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2024] [Revised: 07/31/2024] [Accepted: 08/02/2024] [Indexed: 08/18/2024] Open
Abstract
PURPOSE To explore the integration of generative AI, specifically large language models (LLMs), in ophthalmology education and practice, addressing their applications, benefits, challenges, and future directions. DESIGN A literature review and analysis of current AI applications and educational programs in ophthalmology. METHODS Analysis of published studies, reviews, articles, websites, and institutional reports on AI use in ophthalmology. Examination of educational programs incorporating AI, including curriculum frameworks, training methodologies, and evaluations of AI performance on medical examinations and clinical case studies. RESULTS Generative AI, particularly LLMs, shows potential to improve diagnostic accuracy and patient care in ophthalmology. Applications include aiding in patient, physician, and medical students' education. However, challenges such as AI hallucinations, biases, lack of interpretability, and outdated training data limit clinical deployment. Studies revealed varying levels of accuracy of LLMs on ophthalmology board exam questions, underscoring the need for more reliable AI integration. Several educational programs nationwide provide AI and data science training relevant to clinical medicine and ophthalmology. CONCLUSIONS Generative AI and LLMs offer promising advancements in ophthalmology education and practice. Addressing challenges through comprehensive curricula that include fundamental AI principles, ethical guidelines, and updated, unbiased training data is crucial. Future directions include developing clinically relevant evaluation metrics, implementing hybrid models with human oversight, leveraging image-rich data, and benchmarking AI performance against ophthalmologists. Robust policies on data privacy, security, and transparency are essential for fostering a safe and ethical environment for AI applications in ophthalmology.
Collapse
Affiliation(s)
- Anna Heinke
- Division of Ophthalmology Informatics and Data Science, The Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California San Diego, 9415 Campus Point Drive, La Jolla, CA 92037, USA; Jacobs Retina Center, 9415 Campus Point Drive, La Jolla, CA 92037, USA
| | - Niloofar Radgoudarzi
- Division of Ophthalmology Informatics and Data Science, The Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California San Diego, 9415 Campus Point Drive, La Jolla, CA 92037, USA; Division of Biomedical Informatics, Department of Medicine, University of California San Diego Health System, University of California San Diego, La Jolla, CA, USA
| | - Bonnie B Huang
- Division of Ophthalmology Informatics and Data Science, The Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California San Diego, 9415 Campus Point Drive, La Jolla, CA 92037, USA; Division of Biomedical Informatics, Department of Medicine, University of California San Diego Health System, University of California San Diego, La Jolla, CA, USA; Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| | - Sally L Baxter
- Division of Ophthalmology Informatics and Data Science, The Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California San Diego, 9415 Campus Point Drive, La Jolla, CA 92037, USA; Division of Biomedical Informatics, Department of Medicine, University of California San Diego Health System, University of California San Diego, La Jolla, CA, USA.
| |
Collapse
|