1
|
Rokhshad R, Khoury ZH, Mohammad-Rahimi H, Motie P, Price JB, Tavares T, Jessri M, Bavarian R, Sciubba JJ, Sultan AS. Efficacy and empathy of AI chatbots in answering frequently asked questions on oral oncology. Oral Surg Oral Med Oral Pathol Oral Radiol 2025; 139:719-728. [PMID: 39843286 DOI: 10.1016/j.oooo.2024.12.028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2024] [Revised: 12/13/2024] [Accepted: 12/30/2024] [Indexed: 01/24/2025]
Abstract
OBJECTIVES Artificial intelligence chatbots have demonstrated feasibility and efficacy in improving health outcomes. In this study, responses from 5 different publicly available AI chatbots-Bing, GPT-3.5, GPT-4, Google Bard, and Claude-to frequently asked questions related to oral cancer were evaluated. STUDY DESIGN Relevant patient-related frequently asked questions about oral cancer were obtained from two main sources: public health websites and social media platforms. From these sources, 20 oral cancer-related questions were selected. Four board-certified specialists in oral medicine/oral and maxillofacial pathology assessed the answers using modified version of the global quality score on a 5-point Likert scale. Additionally, readability was measured using the Flesch-Kincaid Grade Level and Flesch Reading Ease scores. Responses were also assessed for empathy using a validated 5-point scale. RESULTS Specialists ranked GPT-4 with highest total score of 17.3 ± 1.5, while Bing received the lowest at 14.9 ± 2.2. Bard had the highest Flesch Reading Ease score of 62 ± 7; and ChatGPT-3.5 and Claude received the lowest scores (more challenging readability). GPT-4 and Bard emerged as the most superior chatbots in terms of empathy and accurate citations on patient-related frequently asked questions pertaining to oral cancer. GPT-4 had highest overall quality, whereas Bing showed the lowest level of quality, empathy, and accuracy for citations. CONCLUSION GPT-4 demonstrated the highest quality responses to frequently asked questions pertaining to oral cancer. Although impressive in their ability to guide patients on common oral cancer topics, most chatbots did not perform well when assessed for empathy or citation accuracy.
Collapse
Affiliation(s)
- Rata Rokhshad
- Department of Pediatric Dentistry, Loma Linda School of Dentistry, CA, USA
| | - Zaid H Khoury
- Department of Oral Diagnostic Sciences & Research, School of Dentistry, Meharry Medical College, TN, USA
| | | | - Parisa Motie
- Medical Image and Signal Processing Research Center, Isfahan University of Medical Sciences, Isfahan, Iran
| | - Jeffery B Price
- Division of Artificial Intelligence Research, Department of Oncology and Diagnostic Sciences, University of Maryland School of Dentistry, Baltimore, MD, USA
| | - Tiffany Tavares
- Department of Comprehensive Dentistry, UT Health San Antonio, School of Dentistry, San Antonio, TX, USA
| | - Maryam Jessri
- Oral Medicine and Pathology Department, School of Dentistry, University of Queensland, Herston, QLD, Australia; Oral Medicine Department, MetroNorth Hospital and Health Services, Queensland Health, QLD, Australia
| | - Roxanne Bavarian
- Department of Oral and Maxillofacial Surgery, Massachusetts General Hospital, Boston, MA, USA; Department of Oral and Maxillofacial Surgery, Harvard School of Dental Medicine, Boston, MA, USA
| | - James J Sciubba
- Department of Otolaryngology, Head & Neck Surgery, The Johns Hopkins University, Baltimore, MD, USA
| | - Ahmed S Sultan
- Division of Artificial Intelligence Research, Department of Oncology and Diagnostic Sciences, University of Maryland School of Dentistry, Baltimore, MD, USA; University of Maryland Marlene and Stewart Greenebaum Comprehensive Cancer Center, Baltimore, MD, USA.
| |
Collapse
|
2
|
Rao A, Mu A, Enichen E, Gupta D, Hall N, Koranteng E, Marks W, Senter-Zapata MJ, Whitehead DC, White BA, Saini S, Landman AB, Succi MD. A Future of Self-Directed Patient Internet Research: Large Language Model-Based Tools Versus Standard Search Engines. Ann Biomed Eng 2025; 53:1199-1208. [PMID: 40025252 DOI: 10.1007/s10439-025-03701-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2024] [Accepted: 02/22/2025] [Indexed: 03/04/2025]
Abstract
PURPOSE As generalist large language models (LLMs) become more commonplace, patients will inevitably increasingly turn to these tools instead of traditional search engines. Here, we evaluate publicly available LLM-based chatbots as tools for patient education through physician review of responses provided by Google, Bard, GPT-3.5 and GPT-4 to commonly searched queries about prevalent chronic health conditions in the United States. METHODS Five distinct commonly Google-searched queries were selected for (i) hypertension, (ii) hyperlipidemia, (iii) diabetes, (iv) anxiety, and (v) mood disorders and prompted into each model of interest. Responses were assessed by board-certified physicians for accuracy, comprehensiveness, and overall quality on a five-point Likert scale. The Flesch-Kincaid Grade Levels were calculated to assess readability. RESULTS GPT-3.5 (4.40 ± 0.48, 4.29 ± 0.43) and GPT-4 (4.35 ± 0.30, 4.24 ± 0.28) received higher ratings in comprehensiveness and quality than Bard (3.79 ± 0.36, 3.87 ± 0.32) and Google (1.87 ± 0.42, 2.11 ± 0.47), all p < 0.05. However, Bard (9.45 ± 1.35) and Google responses (9.92 ± 5.31) had a lower average Flesch-Kincaid Grade Level compared to GPT-3.5 (14.69 ± 1.57) and GPT-4 (12.88 ± 2.02), indicating greater readability. CONCLUSION This study suggests that publicly available LLM-based tools may provide patients with more accurate responses to queries on chronic health conditions than answers provided by Google search. These results provide support for the use of these tools in place of traditional search engines for health-related queries.
Collapse
Affiliation(s)
- Arya Rao
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - Andrew Mu
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - Elizabeth Enichen
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - Dhruva Gupta
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - Nathan Hall
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
| | - Erica Koranteng
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - William Marks
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Harvard Business School, Boston, MA, USA
| | - Michael J Senter-Zapata
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
- Mass General Brigham, Boston, MA, USA
| | - David C Whitehead
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - Benjamin A White
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - Sanjay Saini
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - Adam B Landman
- Harvard Medical School, Boston, MA, USA
- Mass General Brigham, Boston, MA, USA
| | - Marc D Succi
- Harvard Medical School, Boston, MA, USA.
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA.
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA.
- Mass General Brigham, Boston, MA, USA.
| |
Collapse
|
3
|
Zhu J, Jiang Y, Chen D, Lu Y, Huang Y, Lin Y, Fan P. High identification and positive-negative discrimination but limited detailed grading accuracy of ChatGPT-4o in knee osteoarthritis radiographs. Knee Surg Sports Traumatol Arthrosc 2025; 33:1911-1919. [PMID: 40053915 DOI: 10.1002/ksa.12639] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/29/2025] [Revised: 02/13/2025] [Accepted: 02/15/2025] [Indexed: 03/09/2025]
Abstract
PURPOSE To explore the potential of ChatGPT-4o in analysing radiographic images of knee osteoarthritis (OA) and to assess its grading accuracy, feature identification and reliability, thereby helping surgeons to improve diagnostic accuracy and efficiency. METHODS A total of 117 anterior‒posterior knee radiographs from patients (23.1% men, 76.9% women, mean age 69.7 ± 7.99 years) were analysed. Two senior orthopaedic surgeons and ChatGPT-4o independently graded images with the Kellgren-Lawrence (K-L), Ahlbäck and International Knee Documentation Committee (IKDC) systems. A consensus reference standard was established by a third radiologist. ChatGPT-4o's performance metrics (accuracy, precision, recall and F1 score) were calculated, and its reliability was assessed via two evaluations separated by a 2-week interval, with intraclass correlation coefficients (ICCs) determined. RESULTS ChatGPT-4o achieved a 100% identification rate for knee radiographs and demonstrated strong binary classification performance (precision: 0.95, recall: 0.83, F score: 0.88). However, its detailed grading accuracy (35%) was substantially lower than that of surgeons (89.6%). Severe underestimation of OA severity occurred in 49.3% of the cases. Interrater reliability for surgeons was excellent (ICC: 0.78-0.91), whereas ChatGPT-4o showed poor initial consistency (ICC: 0.16-0.28), improving marginally in the second evaluation (ICC: 0.22-0.39). CONCLUSION ChatGPT-4o has the potential to rapidly identify and binary classify knee OA on radiographs. However, its detailed grading accuracy remains suboptimal, with a notable tendency to underestimate severe cases. This limits its current clinical utility for precise staging. Future research should focus on optimising its grading performance and improving accuracy to enhance diagnostic reliability. LEVEL OF EVIDENCE Level III, retrospective comparative study.
Collapse
Affiliation(s)
- Jiesheng Zhu
- Department of Orthopedics, The Second Affiliated Hospital of Wenzhou Medical University, Yuying Children's Hospital, Wenzhou, China
| | - Yilun Jiang
- Department of Orthopedics, The Second Affiliated Hospital of Wenzhou Medical University, Yuying Children's Hospital, Wenzhou, China
| | - Daosen Chen
- Department of Orthopedics, The Second Affiliated Hospital of Wenzhou Medical University, Yuying Children's Hospital, Wenzhou, China
| | - Yi Lu
- Department of Radiology, The Second Affiliated Hospital and Yuying Children's Hospital of Wenzhou Medical University, Wenzhou, Zhejiang, China
| | - Yijiang Huang
- Department of Orthopedics, The Second Affiliated Hospital of Wenzhou Medical University, Yuying Children's Hospital, Wenzhou, China
| | - Yimu Lin
- Department of Orthopedics, The Second Affiliated Hospital of Wenzhou Medical University, Yuying Children's Hospital, Wenzhou, China
| | - Pei Fan
- Department of Orthopedics, The Second Affiliated Hospital of Wenzhou Medical University, Yuying Children's Hospital, Wenzhou, China
| |
Collapse
|
4
|
Harkos C, Hadjigeorgiou AG, Voutouri C, Kumar AS, Stylianopoulos T, Jain RK. Using mathematical modelling and AI to improve delivery and efficacy of therapies in cancer. Nat Rev Cancer 2025; 25:324-340. [PMID: 39972158 DOI: 10.1038/s41568-025-00796-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 01/30/2025] [Indexed: 02/21/2025]
Abstract
Mathematical modelling has proven to be a valuable tool in predicting the delivery and efficacy of molecular, antibody-based, nano and cellular therapy in solid tumours. Mathematical models based on our understanding of the biological processes at subcellular, cellular and tissue level are known as mechanistic models that, in turn, are divided into continuous and discrete models. Continuous models are further divided into lumped parameter models - for describing the temporal distribution of medicine in tumours and normal organs - and distributed parameter models - for studying the spatiotemporal distribution of therapy in tumours. Discrete models capture interactions at the cellular and subcellular levels. Collectively, these models are useful for optimizing the delivery and efficacy of molecular, nanoscale and cellular therapy in tumours by incorporating the biological characteristics of tumours, the physicochemical properties of drugs, the interactions among drugs, cancer cells and various components of the tumour microenvironment, and for enabling patient-specific predictions when combined with medical imaging. Artificial intelligence-based methods, such as machine learning, have ushered in a new era in oncology. These data-driven approaches complement mechanistic models and have immense potential for improving cancer detection, treatment and drug discovery. Here we review these diverse approaches and suggest ways to combine mechanistic and artificial intelligence-based models to further improve patient treatment outcomes.
Collapse
Affiliation(s)
- Constantinos Harkos
- Cancer Biophysics Laboratory, Department of Mechanical and Manufacturing Engineering, University of Cyprus, Nicosia, Cyprus
| | - Andreas G Hadjigeorgiou
- Cancer Biophysics Laboratory, Department of Mechanical and Manufacturing Engineering, University of Cyprus, Nicosia, Cyprus
| | - Chrysovalantis Voutouri
- Cancer Biophysics Laboratory, Department of Mechanical and Manufacturing Engineering, University of Cyprus, Nicosia, Cyprus
| | - Ashwin S Kumar
- Edwin L. Steele Laboratories, Department of Radiation Oncology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | - Triantafyllos Stylianopoulos
- Cancer Biophysics Laboratory, Department of Mechanical and Manufacturing Engineering, University of Cyprus, Nicosia, Cyprus.
| | - Rakesh K Jain
- Edwin L. Steele Laboratories, Department of Radiation Oncology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
5
|
Ozmen BB, Mathur P. Evidence-based artificial intelligence: Implementing retrieval-augmented generation models to enhance clinical decision support in plastic surgery. J Plast Reconstr Aesthet Surg 2025; 104:414-416. [PMID: 40174259 DOI: 10.1016/j.bjps.2025.03.053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2025] [Revised: 03/24/2025] [Accepted: 03/26/2025] [Indexed: 04/04/2025]
Abstract
The rapid advancement of large language models (LLMs) has generated significant enthusiasm within healthcare, especially in supporting clinical decision-making and patient management. However, inherent limitations including hallucinations, outdated clinical context, and unreliable references pose serious concerns for their clinical utility. Retrieval-Augmented Generation (RAG) models address these limitations by integrating validated, curated medical literature directly into AI workflows, significantly enhancing the accuracy, relevance, and transparency of generated outputs. This viewpoint discusses how RAG frameworks can specifically benefit plastic and reconstructive surgery by providing contextually accurate, evidence-based, and clinically grounded support for decision-making. Potential clinical applications include clinical decision support, efficient evidence synthesis, customizable patient education, informed consent materials, multilingual capabilities, and structured surgical documentation. By querying specialized databases that incorporate contemporary guidelines and literature, RAG models can markedly reduce inaccuracies and increase the reliability of AI-generated responses. However, the implementation of RAG technology demands rigorous database curation, regular updating with guidelines from surgical societies, and ongoing validation to maintain clinical relevance. Addressing challenges related to data privacy, governance, ethical considerations, and user training remains critical for successful clinical adoption. In conclusion, RAG models represent a significant advancement in overcoming traditional LLM limitations, promoting transparency and clinical accuracy with great potential for plastic surgery. Plastic surgeons and researchers are encouraged to explore and integrate these innovative generative AI frameworks to enhance patient care, surgical outcomes, communication, documentation quality, and education.
Collapse
Affiliation(s)
- Berk B Ozmen
- Department of Plastic Surgery, Cleveland Clinic, Cleveland, OH, USA.
| | - Piyush Mathur
- Department of General Anesthesiology, Cleveland Clinic, Cleveland, OH, USA; BrainXAI ReSearch, BrainX LLC, Cleveland, OH, USA
| |
Collapse
|
6
|
Luo D, Liu M, Yu R, Liu Y, Jiang W, Fan Q, Kuang N, Gao Q, Yin T, Zheng Z. Evaluating the performance of GPT-3.5, GPT-4, and GPT-4o in the Chinese National Medical Licensing Examination. Sci Rep 2025; 15:14119. [PMID: 40269046 PMCID: PMC12018924 DOI: 10.1038/s41598-025-98949-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Accepted: 04/15/2025] [Indexed: 04/25/2025] Open
Abstract
This study aims to compare and evaluate the performance of GPT-3.5, GPT-4, and GPT-4o in the 2020 and 2021 Chinese National Medical Licensing Examination (NMLE), exploring their potential value in medical education and clinical applications. Six hundred original test questions from the 2020 and 2021 NMLE (covering five types of questions) were selected and input into GPT-3.5, GPT-4, and GPT-4o for response. The accuracy of the models across different question types and units was recorded and analyzed. Statistical methods were employed to compare the performance differences among the three models. GPT-4o demonstrated significantly higher overall accuracy than GPT-4 and GPT-3.5 (P < 0.001). In the 2020 and 2021 exams, GPT-4o achieved accuracy rates of 84.2% and 88.2%, respectively, with the highest accuracy observed in questions related to the digestive system (Unit 3), reaching 94.75%. GPT-4 showed moderate performance, while GPT - 3.5 had the lowest accuracy. Additionally, GPT-4o exhibited a clear advantage in complex question formats, such as case analysis questions (A3/A4 type) and standard matching questions (B1 type). GPT-4o outperformed its predecessors in the NMLE, demonstrating exceptional comprehension and problem-solving abilities in non-English medical examinations. This study provides important insights into the application and promotion of generative AI in medical education and clinical practice.
Collapse
Affiliation(s)
- Dingyuan Luo
- Department of Rehabilitation Medicine Center, Affiliated Tai'an Central Hospital, Qingdao University, No. 29, Longtan Road, Taishan District, Tai'an City, 271000, Shandong, China
| | - Mengke Liu
- Department of Radiology, Affiliated Shandong Provincial Hospital, Shandong First Medical University, Jinan, 250021, Shandong, China
| | - Runyuan Yu
- Department of Rehabilitation Medicine Center, Affiliated Tai'an Central Hospital, Qingdao University, No. 29, Longtan Road, Taishan District, Tai'an City, 271000, Shandong, China
| | - Yulian Liu
- Department of Rehabilitation Medicine Center, Affiliated Tai'an Central Hospital, Qingdao University, No. 29, Longtan Road, Taishan District, Tai'an City, 271000, Shandong, China
| | - Wenjun Jiang
- Department of Rehabilitation Medicine Center, Affiliated Tai'an Central Hospital, Qingdao University, No. 29, Longtan Road, Taishan District, Tai'an City, 271000, Shandong, China
| | - Qi Fan
- Department of Rehabilitation Medicine Center, Affiliated Tai'an Central Hospital, Qingdao University, No. 29, Longtan Road, Taishan District, Tai'an City, 271000, Shandong, China
| | - Naifeng Kuang
- Department of Rehabilitation Medicine Center, Affiliated Tai'an Central Hospital, Qingdao University, No. 29, Longtan Road, Taishan District, Tai'an City, 271000, Shandong, China
| | - Qiang Gao
- Department of Rehabilitation Medicine Center, Affiliated Tai'an Central Hospital, Qingdao University, No. 29, Longtan Road, Taishan District, Tai'an City, 271000, Shandong, China
| | - Tao Yin
- Department of Rehabilitation Medicine Center, Affiliated Tai'an Central Hospital, Qingdao University, No. 29, Longtan Road, Taishan District, Tai'an City, 271000, Shandong, China.
| | - Zuncheng Zheng
- Department of Rehabilitation Medicine Center, Affiliated Tai'an Central Hospital, Qingdao University, No. 29, Longtan Road, Taishan District, Tai'an City, 271000, Shandong, China.
| |
Collapse
|
7
|
Mikhail D, Milad D, Antaki F, Milad J, Farah A, Khairy T, El-Khoury J, Bachour K, Szigiato AA, Nayman T, Mullie GA, Duval R. Multimodal Performance of GPT-4 in Complex Ophthalmology Cases. J Pers Med 2025; 15:160. [PMID: 40278339 PMCID: PMC12028970 DOI: 10.3390/jpm15040160] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2025] [Revised: 04/04/2025] [Accepted: 04/10/2025] [Indexed: 04/26/2025] Open
Abstract
Objectives: The integration of multimodal capabilities into GPT-4 represents a transformative leap for artificial intelligence in ophthalmology, yet its utility in scenarios requiring advanced reasoning remains underexplored. This study evaluates GPT-4's multimodal performance on open-ended diagnostic and next-step reasoning tasks in complex ophthalmology cases, comparing it against human expertise. Methods: GPT-4 was assessed across three study arms: (1) text-based case details with figure descriptions, (2) cases with text and accompanying ophthalmic figures, and (3) cases with figures only (no figure descriptions). We compared GPT-4's diagnostic and next-step accuracy across arms and benchmarked its performance against three board-certified ophthalmologists. Results: GPT-4 achieved 38.4% (95% CI [33.9%, 43.1%]) diagnostic accuracy and 57.8% (95% CI [52.8%, 62.2%]) next-step accuracy when prompted with figures without descriptions. Diagnostic accuracy declined significantly compared to text-only prompts (p = 0.007), though the next-step performance was similar (p = 0.140). Adding figure descriptions restored diagnostic accuracy (49.3%) to near parity with text-only prompts (p = 0.684). Using figures without descriptions, GPT-4's diagnostic accuracy was comparable to two ophthalmologists (p = 0.30, p = 0.41) but fell short of the highest-performing ophthalmologist (p = 0.0004). For next-step accuracy, GPT-4 was similar to one ophthalmologist (p = 0.22) but underperformed relative to the other two (p = 0.0015, p = 0.0017). Conclusions: GPT-4's diagnostic performance diminishes when relying solely on ophthalmic images without textual context, highlighting limitations in its current multimodal capabilities. Despite this, GPT-4 demonstrated comparable performance to at least one ophthalmologist on both diagnostic and next-step reasoning tasks, emphasizing its potential as an assistive tool. Future research should refine multimodal prompts and explore iterative or sequential prompting strategies to optimize AI-driven interpretation of complex ophthalmic datasets.
Collapse
Affiliation(s)
- David Mikhail
- Temerty Faculty of Medicine, University of Toronto, Toronto, ON M5S 1A1, Canada;
| | - Daniel Milad
- Department of Ophthalmology, University of Montreal, Montreal, QC H3T 1J4, Canada (J.E.-K.); (K.B.)
- Department of Ophthalmology, Hôpital Maisonneuve-Rosemont, Montreal, QC H1T 2M4, Canada
| | - Fares Antaki
- Department of Ophthalmology, University of Montreal, Montreal, QC H3T 1J4, Canada (J.E.-K.); (K.B.)
- Cole Eye Institute, Cleveland Clinic, Cleveland, OH 44195, USA
- CHUM School of Artificial Intelligence in Healthcare (SAIH), Centre Hospitalier de l’Université de Montréal (CHUM), Montreal, QC H2X 3E4, Canada
| | - Jason Milad
- Department of Software Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada;
| | - Andrew Farah
- Faculty of Medicine, McGill University, Montreal, QC H3A 0G4, Canada
| | - Thomas Khairy
- Faculty of Medicine, McGill University, Montreal, QC H3A 0G4, Canada
| | - Jonathan El-Khoury
- Department of Ophthalmology, University of Montreal, Montreal, QC H3T 1J4, Canada (J.E.-K.); (K.B.)
- Department of Ophthalmology, Hôpital Maisonneuve-Rosemont, Montreal, QC H1T 2M4, Canada
| | - Kenan Bachour
- Department of Ophthalmology, University of Montreal, Montreal, QC H3T 1J4, Canada (J.E.-K.); (K.B.)
- Department of Ophthalmology, Hôpital Maisonneuve-Rosemont, Montreal, QC H1T 2M4, Canada
| | | | - Taylor Nayman
- Department of Ophthalmology, University of Montreal, Montreal, QC H3T 1J4, Canada (J.E.-K.); (K.B.)
- Department of Ophthalmology, Hôpital Maisonneuve-Rosemont, Montreal, QC H1T 2M4, Canada
| | - Guillaume A. Mullie
- Department of Ophthalmology, University of Montreal, Montreal, QC H3T 1J4, Canada (J.E.-K.); (K.B.)
- Department of Ophthalmology, St. Mary’s Hospital Center, Montreal, QC H3T 1M5, Canada
| | - Renaud Duval
- Department of Ophthalmology, University of Montreal, Montreal, QC H3T 1J4, Canada (J.E.-K.); (K.B.)
- Department of Ophthalmology, Hôpital Maisonneuve-Rosemont, Montreal, QC H1T 2M4, Canada
| |
Collapse
|
8
|
Li S, Jiang J, Yang X. Preliminary assessment of large language models' performance in answering questions on developmental dysplasia of the hip. J Child Orthop 2025:18632521251331772. [PMID: 40248439 PMCID: PMC11999979 DOI: 10.1177/18632521251331772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/18/2024] [Accepted: 03/14/2025] [Indexed: 04/19/2025] Open
Abstract
Objective To evaluate the performance of three large language models in answering questions regarding pediatric developmental dysplasia of the hip. Methods We formulated 18 open-ended clinical questions in both Chinese and English and established a gold standard set of answers to benchmark the responses of the large language models. These questions were presented to ChatGPT-4o, Gemini, and Claude 3.5 Sonnet. The responses were evaluated by two independent reviewers using a 5-point scale. The average score, rounded to the nearest whole number, was taken as the final score. A final score of 4 or 5 indicated an accurate response, whereas a final score of 1, 2, or 3 indicated an inaccurate response. Results The raters demonstrated a high level of agreement in scoring the answers, with weighted Kappa coefficients of 0.865 for Chinese responses (p < 0.001) and 0.875 for English responses (p < 0.001). No significant differences were observed among the three large language models in terms of accuracy when answering questions, with rates of 83.3%, 77.8%, and 77.8% for Claude 3.5 Sonnet, ChatGPT-4o, and Gemini in the Chinese responses (p = 1), and 83.3%, 83.3%, and 72.2% for ChatGPT-4o, Claude 3.5 Sonnet, and Gemini in the English responses (p = 0.761). In addition, there was no significant difference in the performance of the same large language model between the Chinese and English settings. Conclusions Large language models demonstrate high accuracy in delivering information on dysplasia of the hip, maintaining consistent performance across both Chinese and English, which suggests their potential utility as medical support tools. Level of evidence Level II.
Collapse
Affiliation(s)
- Shiwei Li
- Department of Pediatric Surgery, West China Hospital, Sichuan University, Chengdu, China
| | - Jun Jiang
- Department of Pediatric Surgery, West China Hospital, Sichuan University, Chengdu, China
| | - Xiaodong Yang
- Department of Pediatric Surgery, West China Hospital, Sichuan University, Chengdu, China
| |
Collapse
|
9
|
Tang T, Li A, Tan X, Ji Q, Si L, Bao L. Bridging Data Gaps in Oncology: Large Language Models and Collaborative Filtering for Cancer Treatment Recommendations. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2025:2025.04.07.25325243. [PMID: 40297440 PMCID: PMC12036386 DOI: 10.1101/2025.04.07.25325243] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/30/2025]
Abstract
Background Patients with rare cancers face substantial challenges due to limited evidence-based treatment options, resulting from sparse clinical trials. Advances in large language models (LLMs) and recommendation algorithms offer new opportunities to utilize all clinical trial information to improve clinical decisions. Methods We used LLM to systematically extract and standardize more than 100,000 cancer trials from ClinicalTrials.gov. Each trial was annotated using a customized scoring system reflecting cancer-treatment interactions based on clinical outcomes and trial attributes. Using this structured data set, we implemented three state-of-the-art collaborative filtering algorithms to recommend potentially effective treatments across different cancer types. Results The LLM-driven data extraction process successfully generated a comprehensive and rigorously curated database from fragmented clinical trial information, covering 78 cancer types and 5,315 distinct interventions. Recommendation models demonstrated high predictive accuracy (cross-validated RMSE: 0.49-0.62) and identified clinically meaningful new treatments for melanoma, independently validated by oncology experts. Conclusions Our study establishes a proof of concept demonstrating that the combination of LLMs with sophisticated recommendation algorithms can systematically identify novel and clinically plausible cancer treatments. This integrated approach may accelerate the identification of effective therapies for rare cancers, ultimately improving patient outcomes by generating evidence-based treatment recommendations where traditional data sources remain limited.
Collapse
Affiliation(s)
- Tengjie Tang
- Department of Statistics, Pennsylvania State University, University Park, Pennsylvania, U.S.A
| | - Angkai Li
- Department of Statistics, Pennsylvania State University, University Park, Pennsylvania, U.S.A
| | - Xingye Tan
- Department of Statistics, Pennsylvania State University, University Park, Pennsylvania, U.S.A
| | - Qingli Ji
- Department of Statistics, Pennsylvania State University, University Park, Pennsylvania, U.S.A
| | - Lu Si
- Key laboratory of Carcinogenesis and Translational Research, Department of Melanoma and Sarcoma, Peking University, Cancer Hospital & Institute, Beijing 100142, China
| | - Le Bao
- Department of Statistics, Pennsylvania State University, University Park, Pennsylvania, U.S.A
| |
Collapse
|
10
|
Hasan SS, Fury MS, Woo JJ, Kunze KN, Ramkumar PN. Ethical Application of Generative Artificial Intelligence in Medicine. Arthroscopy 2025; 41:874-885. [PMID: 39689842 DOI: 10.1016/j.arthro.2024.12.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/09/2024] [Revised: 11/25/2024] [Accepted: 12/03/2024] [Indexed: 12/19/2024]
Abstract
Generative artificial intelligence (AI) may revolutionize health care, providing solutions that range from enhancing diagnostic accuracy to personalizing treatment plans. However, its rapid and largely unregulated integration into medicine raises ethical concerns related to data integrity, patient safety, and appropriate oversight. One of the primary ethical challenges lies in generative AI's potential to produce misleading or fabricated information, posing risks of misdiagnosis or inappropriate treatment recommendations, which underscore the necessity for robust physician oversight. Transparency also remains a critical concern, as the closed-source nature of many large-language models prevents both patients and health care providers from understanding the reasoning behind AI-generated outputs, potentially eroding trust. The lack of regulatory approval for AI as a medical device, combined with concerns around the security of patient-derived data and AI-generated synthetic data, further complicates its safe integration into clinical workflows. Furthermore, synthetic datasets generated by AI, although valuable for augmenting research in areas with scarce data, complicate questions of data ownership, patient consent, and scientific validity. In addition, generative AI's ability to streamline administrative tasks risks depersonalizing care, further distancing providers from patients. These challenges compound the deeper issues plaguing the health care system, including the emphasis of volume and speed over value and expertise. The use of generative AI in medicine brings about mass scaling of synthetic information, thereby necessitating careful adoption to protect patient care and medical advancement. Given these considerations, generative AI applications warrant regulatory and critical scrutiny. Key starting points include establishing strict standards for data security and transparency, implementing oversight akin to institutional review boards to govern data usage, and developing interdisciplinary guidelines that involve developers, clinicians, and ethicists. By addressing these concerns, we can better align generative AI adoption with the core foundations of humanistic health care, preserving patient safety, autonomy, and trust while harnessing AI's transformative potential. LEVEL OF EVIDENCE: Level V, expert opinion.
Collapse
Affiliation(s)
| | - Matthew S Fury
- Baton Rouge Orthopaedic Clinic, Baton Rouge, Louisiana, U.S.A
| | - Joshua J Woo
- Brown University/The Warren Alpert School of Brown University, Providence, Rhode Island, U.S.A
| | - Kyle N Kunze
- Hospital for Special Surgery, New York, New York, U.S.A
| | | |
Collapse
|
11
|
Wu W, Guo Y, Li Q, Jia C. Exploring the potential of large language models in identifying metabolic dysfunction-associated steatotic liver disease: A comparative study of non-invasive tests and artificial intelligence-generated responses. Liver Int 2025; 45:e16112. [PMID: 39526465 DOI: 10.1111/liv.16112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/14/2024] [Revised: 09/05/2024] [Accepted: 09/11/2024] [Indexed: 11/16/2024]
Abstract
BACKGROUND AND AIMS This study sought to assess the capabilities of large language models (LLMs) in identifying clinically significant metabolic dysfunction-associated steatotic liver disease (MASLD). METHODS We included individuals from NHANES 2017-2018. The validity and reliability of MASLD diagnosis by GPT-3.5 and GPT-4 were quantitatively examined and compared with those of the Fatty Liver Index (FLI) and United States FLI (USFLI). A receiver operating characteristic curve was conducted to assess the accuracy of MASLD diagnosis via different scoring systems. Additionally, GPT-4V's potential in clinical diagnosis using ultrasound images from MASLD patients was evaluated to provide assessments of LLM capabilities in both textual and visual data interpretation. RESULTS GPT-4 demonstrated comparable performance in MASLD diagnosis to FLI and USFLI with the AUROC values of .831 (95% CI .796-.867), .817 (95% CI .797-.837) and .827 (95% CI .807-.848), respectively. GPT-4 exhibited a trend of enhanced accuracy, clinical relevance and efficiency compared to GPT-3.5 based on clinician evaluation. Additionally, Pearson's r values between GPT-4 and FLI, as well as USFLI, were .718 and .695, respectively, indicating robust and moderate correlations. Moreover, GPT-4V showed potential in understanding characteristics from hepatic ultrasound imaging but exhibited limited interpretive accuracy in diagnosing MASLD compared to skilled radiologists. CONCLUSIONS GPT-4 achieved performance comparable to traditional risk scores in diagnosing MASLD and exhibited improved convenience, versatility and the capacity to offer user-friendly outputs. The integration of GPT-4V highlights the capacities of LLMs in handling both textual and visual medical data, reinforcing their expansive utility in healthcare practice.
Collapse
Affiliation(s)
- Wanying Wu
- Department of Cardiology, Guangdong Cardiovascular Institute, Guangdong Provincial People's Hospital, Guangdong Academy of Medical Sciences, Guangzhou, China
- Department of Guangdong Provincial Key Laboratory of Coronary Heart Disease Prevention, Guangdong Cardiovascular Institute, Guangdong Provincial People's Hospital, Guangdong Academy of Medical Sciences, Guangzhou, China
| | - Yuhu Guo
- Faculty of Science and Engineering, The University of Manchester, Manchester, UK
| | - Qi Li
- Department of Neurology, The First Affiliated Hospital of Hebei North University, Zhangjiakou, China
| | - Congzhuo Jia
- Department of Cardiology, Guangdong Cardiovascular Institute, Guangdong Provincial People's Hospital, Guangdong Academy of Medical Sciences, Guangzhou, China
- Department of Guangdong Provincial Key Laboratory of Coronary Heart Disease Prevention, Guangdong Cardiovascular Institute, Guangdong Provincial People's Hospital, Guangdong Academy of Medical Sciences, Guangzhou, China
| |
Collapse
|
12
|
Succi MD, Chang BS, Rao AS. Building the AI-Enabled Medical School of the Future. JAMA 2025:2832147. [PMID: 40163081 DOI: 10.1001/jama.2025.2789] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/02/2025]
Abstract
This Viewpoint discusses preparing medical students to succeed in AI-integrated medical schools.
Collapse
Affiliation(s)
- Marc D Succi
- Harvard Medical School, Boston, Massachusetts
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston
| | - Bernard S Chang
- Harvard Medical School, Boston, Massachusetts
- Beth Israel Deaconess Medical Center, Boston, Massachusetts
| | - Arya S Rao
- Harvard Medical School, Boston, Massachusetts
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston
| |
Collapse
|
13
|
Pavlik EJ, Land Woodward J, Lawton F, Swiecki-Sikora AL, Ramaiah DD, Rives TA. Artificial Intelligence in Relation to Accurate Information and Tasks in Gynecologic Oncology and Clinical Medicine-Dunning-Kruger Effects and Ultracrepidarianism. Diagnostics (Basel) 2025; 15:735. [PMID: 40150078 PMCID: PMC11941301 DOI: 10.3390/diagnostics15060735] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2025] [Revised: 02/28/2025] [Accepted: 03/10/2025] [Indexed: 03/29/2025] Open
Abstract
Publications on the application of artificial intelligence (AI) to many situations, including those in clinical medicine, created in 2023-2024 are reviewed here. Because of the short time frame covered, here, it is not possible to conduct exhaustive analysis as would be the case in meta-analyses or systematic reviews. Consequently, this literature review presents an examination of narrative AI's application in relation to contemporary topics related to clinical medicine. The landscape of the findings reviewed here span 254 papers published in 2024 topically reporting on AI in medicine, of which 83 articles are considered in the present review because they contain evidence-based findings. In particular, the types of cases considered deal with AI accuracy in initial differential diagnoses, cancer treatment recommendations, board-style exams, and performance in various clinical tasks, including clinical imaging. Importantly, summaries of the validation techniques used to evaluate AI findings are presented. This review focuses on AIs that have a clinical relevancy evidenced by application and evaluation in clinical publications. This relevancy speaks to both what has been promised and what has been delivered by various AI systems. Readers will be able to understand when generative AI may be expressing views without having the necessary information (ultracrepidarianism) or is responding as if the generative AI had expert knowledge when it does not. A lack of awareness that AIs may deliver inadequate or confabulated information can result in incorrect medical decisions and inappropriate clinical applications (Dunning-Kruger effect). As a result, in certain cases, a generative AI system might underperform and provide results which greatly overestimate any medical or clinical validity.
Collapse
Affiliation(s)
- Edward J. Pavlik
- Division of Gynecologic Oncology, Department of Obstetrics and Gynecology, Chandler Medical Center-Markey Cancer Center, University of Kentucky College of Medicine, Lexington, KY 40536-0293, USA (T.A.R.)
| | - Jamie Land Woodward
- University of Kentucky College of Medicine, Lexington, KY 40536-0293, USA; (J.L.W.); (D.D.R.)
| | - Frank Lawton
- SE London Gynecological Cancer Centre, Emeritus Surgeon, London SE5 9RS, UK;
| | - Allison L. Swiecki-Sikora
- Division of Gynecologic Oncology, Department of Obstetrics and Gynecology, Chandler Medical Center-Markey Cancer Center, University of Kentucky College of Medicine, Lexington, KY 40536-0293, USA (T.A.R.)
| | - Dharani D. Ramaiah
- University of Kentucky College of Medicine, Lexington, KY 40536-0293, USA; (J.L.W.); (D.D.R.)
| | - Taylor A. Rives
- Division of Gynecologic Oncology, Department of Obstetrics and Gynecology, Chandler Medical Center-Markey Cancer Center, University of Kentucky College of Medicine, Lexington, KY 40536-0293, USA (T.A.R.)
| |
Collapse
|
14
|
Kunze KN, Nwachukwu BU, Cote MP, Ramkumar PN. Large Language Models Applied to Health Care Tasks May Improve Clinical Efficiency, Value of Care Rendered, Research, and Medical Education. Arthroscopy 2025; 41:547-556. [PMID: 39694303 DOI: 10.1016/j.arthro.2024.12.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/19/2024] [Revised: 12/01/2024] [Accepted: 12/02/2024] [Indexed: 12/20/2024]
Abstract
Large language models (LLMs) are generative artificial intelligence models that create content on the basis of the data on which it was trained. Processing capabilities have evolved from text only to being multimodal including text, images, audio, and video features. In health care settings, LLMs are being applied to several clinically important areas, including patient care and workflow efficiency, communications, hospital operations and data management, medical education, practice management, and health care research. Under the umbrella of patient care, several core use cases of LLMs include simplifying documentation tasks, enhancing patient communication (interactive language and written), conveying medical knowledge, and performing medical triage and diagnosis. However, LLMs warrant scrutiny when applied to health care tasks, as errors may have negative implications for health care outcomes, specifically in the context of perpetuating bias, ethical considerations, and cost-effectiveness. Customized LLMs developed for more narrow purposes may help overcome certain performance limitations, transparency challenges, and biases present in contemporary generalized LLMs by curating training data. Methods of customizing LLMs broadly fall under 4 categories: prompt engineering, retrieval augmented generation, fine-tuning, and agentic augmentation, with each approach conferring different information-retrieval properties for the LLM. LEVEL OF EVIDENCE: Level V, expert opinion.
Collapse
Affiliation(s)
- Kyle N Kunze
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A..
| | - Benedict U Nwachukwu
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A
| | - Mark P Cote
- Department of Orthopaedic Surgery, Massachusetts General Hospital, Boston, Massachusetts, U.S.A
| | | |
Collapse
|
15
|
Nasef H, Patel H, Amin Q, Baum S, Ratnasekera A, Ang D, Havron WS, Nakayama D, Elkbuli A. Evaluating the Accuracy, Comprehensiveness, and Validity of ChatGPT Compared to Evidence-Based Sources Regarding Common Surgical Conditions: Surgeons' Perspectives. Am Surg 2025; 91:325-335. [PMID: 38794965 DOI: 10.1177/00031348241256075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/27/2024]
Abstract
BackgroundThis study aims to assess the accuracy, comprehensiveness, and validity of ChatGPT compared to evidence-based sources regarding the diagnosis and management of common surgical conditions by surveying the perceptions of U.S. board-certified practicing surgeons.MethodsAn anonymous cross-sectional survey was distributed to U.S. practicing surgeons from June 2023 to March 2024. The survey comprised 94 multiple-choice questions evaluating diagnostic and management information for five common surgical conditions from evidence-based sources or generated by ChatGPT. Statistical analysis included descriptive statistics and paired-sample t-tests.ResultsParticipating surgeons were primarily aged 40-50 years (43%), male (86%), White (57%), and had 5-10 years or >15 years of experience (86%). The majority of surgeons had no prior experience with ChatGPT in surgical practice (86%). For material discussing both acute cholecystitis and upper gastrointestinal hemorrhage, evidence-based sources were rated as significantly more comprehensive (3.57 (±.535) vs 2.00 (±1.16), P = .025) (4.14 (±.69) vs 2.43 (±.98), P < .001) and valid (3.71 (±.488) vs 2.86 (±1.07), P = .045) (3.71 (±.76) vs 2.71 (±.95) P = .038) than ChatGPT. However, there was no significant difference in accuracy between the two sources (3.71 vs 3.29, P = .289) (3.57 vs 2.71, P = .111).ConclusionSurveyed U.S. board-certified practicing surgeons rated evidence-based sources as significantly more comprehensive and valid compared to ChatGPT across the majority of surveyed surgical conditions. However, there was no significant difference in accuracy between the sources across the majority of surveyed conditions. While ChatGPT may offer potential benefits in surgical practice, further refinement and validation are necessary to enhance its utility and acceptance among surgeons.
Collapse
Affiliation(s)
- Hazem Nasef
- NOVA Southeastern University, Kiran Patel College of Allopathic Medicine, Fort Lauderdale, FL, USA
| | - Heli Patel
- NOVA Southeastern University, Kiran Patel College of Allopathic Medicine, Fort Lauderdale, FL, USA
| | - Quratulain Amin
- NOVA Southeastern University, Kiran Patel College of Allopathic Medicine, Fort Lauderdale, FL, USA
| | - Samuel Baum
- Louisiana State University Health Science Center, College of Medicine, New Orleans, LA, USA
| | | | - Darwin Ang
- Department of Surgery, Ocala Regional Medical Center, Ocala, FL, USA
| | - William S Havron
- Department of Surgical Education, Orlando Regional Medical Center, Orlando, FL, USA
- Department of Surgery, Division of Trauma and Surgical Critical Care, Orlando Regional Medical Center, Orlando, FL, USA
| | - Don Nakayama
- Mercer University School of Medicine, Columbus, GA, USA
| | - Adel Elkbuli
- Department of Surgical Education, Orlando Regional Medical Center, Orlando, FL, USA
- Department of Surgery, Division of Trauma and Surgical Critical Care, Orlando Regional Medical Center, Orlando, FL, USA
| |
Collapse
|
16
|
Yu A, Li A, Ahmed W, Saturno M, Cho SK. Evaluating Artificial Intelligence in Spinal Cord Injury Management: A Comparative Analysis of ChatGPT-4o and Google Gemini Against American College of Surgeons Best Practices Guidelines for Spine Injury. Global Spine J 2025:21925682251321837. [PMID: 39959933 PMCID: PMC11833805 DOI: 10.1177/21925682251321837] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/24/2024] [Revised: 02/03/2025] [Accepted: 02/04/2025] [Indexed: 02/20/2025] Open
Abstract
STUDY DESIGN Comparative Analysis. OBJECTIVES The American College of Surgeons developed the 2022 Best Practice Guidelines to provide evidence-based recommendations for managing spinal injuries. This study aims to assess the concordance of ChatGPT-4o and Gemini Advanced with the 2022 ACS Best Practice Guidelines, offering the first expert evaluation of these models in managing spinal cord injuries. METHODS The 2022 ACS Trauma Quality Program Best Practices Guidelines for Spine Injury were used to create 52 questions based on key clinical recommendations. These were grouped into informational (8), diagnostic (14), and treatment (30) categories and posed to ChatGPT-4o and Google Gemini Advanced. Responses were graded for concordance with ACS guidelines and validated by a board-certified spine surgeon. RESULTS ChatGPT was concordant with ACS guidelines on 38 of 52 questions (73.07%) and Gemini on 36 (69.23%). Most non-concordant answers were due to insufficient information. The models disagreed on 8 questions, with ChatGPT concordant in 5 and Gemini in 3. Both achieved 75% concordance on clinical information; Gemini outperformed on diagnostics (78.57% vs 71.43%), while ChatGPT had higher concordance on treatment questions (73.33% vs 63.33%). CONCLUSIONS ChatGPT-4o and Gemini Advanced demonstrate potential as valuable assets in spinal injury management by providing responses aligned with current best practices. The marginal differences in concordance rates suggest that neither model exhibits a superior ability to deliver recommendations concordant with validated clinical guidelines. Despite LLMs increasing sophistication and utility, existing limitations currently prevent them from being clinically safe and practical in trauma-based settings.
Collapse
Affiliation(s)
- Alexander Yu
- Department of Orthopaedics, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Albert Li
- Department of Orthopaedics, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Wasil Ahmed
- Department of Orthopaedics, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Michael Saturno
- Department of Orthopaedics, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Samuel K. Cho
- Department of Orthopaedics, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
17
|
Alessandro L, Crema S, Castiglione JI, Dossi D, Eberbach F, Kohler A, Laffue A, Marone A, Nagel V, Pastor Rueda JM, Varela F, Fernandez Slezak D, Rodríguez Murúa S, Debasa C, Claudio P, Farez MF. Validation of an Artificial Intelligence-Powered Virtual Assistant for Emergency Triage in Neurology. Neurologist 2025:00127893-990000000-00177. [PMID: 39912331 DOI: 10.1097/nrl.0000000000000594] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2025]
Abstract
OBJECTIVES Neurological emergencies pose significant challenges in medical care in resource-limited countries. Artificial intelligence (AI), particularly health chatbots, offers a promising solution. Rigorous validation is required to ensure safety and accuracy. Our objective is to evaluate the diagnostic safety and effectiveness of an AI-powered virtual assistant (VA) designed for the triage of neurological pathologies. METHODS The performance of an AI-powered VA for emergency neurological triage was tested. Ten patients over 18 years old with urgent neurological pathologies were selected. In the first stage, 9 neurologists assessed the safety of the VA using their clinical records. In the second stage, the assistant's accuracy when used by patients was evaluated. Finally, VA performance was compared with ChatGPT 3.5 and 4. RESULTS In stage 1, neurologists agreed with the VA in 98.5% of the cases for syndromic diagnosis, and in all cases, the definitive diagnosis was among the top 5 differentials. In stage 2, neurologists agreed with all diagnostic parameters and recommendations suggested by the assistant to patients. The average use time was 5.5 minutes (average of 16.5 questions). VA showed superiority over both versions of ChatGPT in all evaluated diagnostic and safety aspects (P<0.0001). In 57.8% of the evaluations, neurologists rated the VA as "excellent" (suggesting adequate utility). CONCLUSIONS In this study, the VA showcased promising diagnostic accuracy and user satisfaction, bolstering confidence in further development. These outcomes encourage proceeding to a comprehensive phase 1/2 trial with 100 patients to thoroughly assess its "real-time" application in emergency neurological triage.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | - Diego Fernandez Slezak
- Entelai
- Department of Computing, Faculty of Exact and Natural Sciences, University of Buenos Aires (UBA)
- Institute of Research in Computer Science (ICC), CONICET-UBA, Buenos Aires, Argentina
| | | | | | | | - Mauricio F Farez
- Center for Research in Neuroimmunological Diseases (CIEN), Fleni
- Entelai
| |
Collapse
|
18
|
Huo B, Boyle A, Marfo N, Tangamornsuksan W, Steen JP, McKechnie T, Lee Y, Mayol J, Antoniou SA, Thirunavukarasu AJ, Sanger S, Ramji K, Guyatt G. Large Language Models for Chatbot Health Advice Studies: A Systematic Review. JAMA Netw Open 2025; 8:e2457879. [PMID: 39903463 PMCID: PMC11795331 DOI: 10.1001/jamanetworkopen.2024.57879] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 11/26/2024] [Indexed: 02/06/2025] Open
Abstract
Importance There is much interest in the clinical integration of large language models (LLMs) in health care. Many studies have assessed the ability of LLMs to provide health advice, but the quality of their reporting is uncertain. Objective To perform a systematic review to examine the reporting variability among peer-reviewed studies evaluating the performance of generative artificial intelligence (AI)-driven chatbots for summarizing evidence and providing health advice to inform the development of the Chatbot Assessment Reporting Tool (CHART). Evidence Review A search of MEDLINE via Ovid, Embase via Elsevier, and Web of Science from inception to October 27, 2023, was conducted with the help of a health sciences librarian to yield 7752 articles. Two reviewers screened articles by title and abstract followed by full-text review to identify primary studies evaluating the clinical accuracy of generative AI-driven chatbots in providing health advice (chatbot health advice studies). Two reviewers then performed data extraction for 137 eligible studies. Findings A total of 137 studies were included. Studies examined topics in surgery (55 [40.1%]), medicine (51 [37.2%]), and primary care (13 [9.5%]). Many studies focused on treatment (91 [66.4%]), diagnosis (60 [43.8%]), or disease prevention (29 [21.2%]). Most studies (136 [99.3%]) evaluated inaccessible, closed-source LLMs and did not provide enough information to identify the version of the LLM under evaluation. All studies lacked a sufficient description of LLM characteristics, including temperature, token length, fine-tuning availability, layers, and other details. Most studies (136 [99.3%]) did not describe a prompt engineering phase in their study. The date of LLM querying was reported in 54 (39.4%) studies. Most studies (89 [65.0%]) used subjective means to define the successful performance of the chatbot, while less than one-third addressed the ethical, regulatory, and patient safety implications of the clinical integration of LLMs. Conclusions and Relevance In this systematic review of 137 chatbot health advice studies, the reporting quality was heterogeneous and may inform the development of the CHART reporting standards. Ethical, regulatory, and patient safety considerations are crucial as interest grows in the clinical integration of LLMs.
Collapse
Affiliation(s)
- Bright Huo
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Amy Boyle
- Michael G. DeGroote School of Medicine, McMaster University, Hamilton, Ontario, Canada
| | - Nana Marfo
- H. Ross University School of Medicine, Miramar, Florida
| | - Wimonchat Tangamornsuksan
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| | - Jeremy P. Steen
- Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
| | - Tyler McKechnie
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Yung Lee
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Julio Mayol
- Hospital Clinico San Carlos, IdISSC, Universidad Complutense de Madrid, Madrid, Spain
| | | | | | - Stephanie Sanger
- Health Science Library, Faculty of Health Sciences, McMaster University, Hamilton, Ontario, Canada
| | - Karim Ramji
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Gordon Guyatt
- Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada
| |
Collapse
|
19
|
Guirguis PG, Youssef MP, Punreddy A, Botros M, Raiford M, McDowell S. Is Information About Musculoskeletal Malignancies From Large Language Models or Web Resources at a Suitable Reading Level for Patients? Clin Orthop Relat Res 2025; 483:306-315. [PMID: 39330944 PMCID: PMC11753740 DOI: 10.1097/corr.0000000000003263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Accepted: 09/06/2024] [Indexed: 09/28/2024]
Abstract
BACKGROUND Patients and caregivers may experience immense distress when receiving the diagnosis of a primary musculoskeletal malignancy and subsequently turn to internet resources for more information. It is not clear whether these resources, including Google and ChatGPT, offer patients information that is readable, a measure of how easy text is to understand. Since many patients turn to Google and artificial intelligence resources for healthcare information, we thought it was important to ascertain whether the information they find is readable and easy to understand. The objective of this study was to compare readability of Google search results and ChatGPT answers to frequently asked questions and assess whether these sources meet NIH recommendations for readability. QUESTIONS/PURPOSES (1) What is the readability of ChatGPT-3.5 as a source of patient information for the three most common primary bone malignancies compared with top online resources from Google search? (2) Do ChatGPT-3.5 responses and online resources meet NIH readability guidelines for patient education materials? METHODS This was a cross-sectional analysis of the 12 most common online questions about osteosarcoma, chondrosarcoma, and Ewing sarcoma. To be consistent with other studies of similar design that utilized national society frequently asked questions lists, questions were selected from the American Cancer Society and categorized based on content, including diagnosis, treatment, and recovery and prognosis. Google was queried using all 36 questions, and top responses were recorded. Author types, such as hospital systems, national health organizations, or independent researchers, were recorded. ChatGPT-3.5 was provided each question in independent queries without further prompting. Responses were assessed with validated reading indices to determine readability by grade level. An independent t-test was performed with significance set at p < 0.05. RESULTS Google (n = 36) and ChatGPT-3.5 (n = 36) answers were recorded, 12 for each of the three cancer types. Reading grade levels based on mean readability scores were 11.0 ± 2.9 and 16.1 ± 3.6, respectively. This corresponds to the eleventh grade reading level for Google and a fourth-year undergraduate student level for ChatGPT-3.5. Google answers were more readable across all individual indices, without differences in word count. No difference in readability was present across author type, question category, or cancer type. Of 72 total responses across both search modalities, none met NIH readability criteria at the sixth-grade level. CONCLUSION Google material was presented at a high school reading level, whereas ChatGPT-3.5 was at an undergraduate reading level. The readability of both resources was inadequate based on NIH recommendations. Improving readability is crucial for better patient understanding during cancer treatment. Physicians should assess patients' needs, offer them tailored materials, and guide them to reliable resources to prevent reliance on online information that is hard to understand. LEVEL OF EVIDENCE Level III, prognostic study.
Collapse
Affiliation(s)
- Paul G. Guirguis
- University of Rochester School of Medicine and Dentistry, Rochester, NY, USA
| | | | - Ankit Punreddy
- University of Rochester School of Medicine and Dentistry, Rochester, NY, USA
| | - Mina Botros
- Department of Orthopaedics and Physical Performance, University of Rochester Medical Center, Rochester, NY, USA
| | - Mattie Raiford
- Department of Orthopaedics and Physical Performance, University of Rochester Medical Center, Rochester, NY, USA
| | - Susan McDowell
- Department of Orthopaedics and Physical Performance, University of Rochester Medical Center, Rochester, NY, USA
| |
Collapse
|
20
|
Osborne MR, Bailey ER. Me vs. the machine? Subjective evaluations of human- and AI-generated advice. Sci Rep 2025; 15:3980. [PMID: 39893236 PMCID: PMC11787321 DOI: 10.1038/s41598-025-86623-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2024] [Accepted: 01/13/2025] [Indexed: 02/04/2025] Open
Abstract
Artificial intelligence ("AI") has the potential to vastly improve human decision-making. In line with this, researchers have increasingly sought to understand how people view AI, often documenting skepticism and even outright aversion to these tools. In the present research, we complement these findings by documenting the performance of LLMs in the personal advice domain. In addition, we shift the focus in a new direction-exploring how interacting with AI tools, specifically large language models, impacts the user's view of themselves. In five preregistered experiments (N = 1,722), we explore evaluations of human- and ChatGPT-generated advice along three dimensions: quality, effectiveness, and authenticity. We find that ChatGPT produces superior advice relative to the average online participant even in a domain in which people strongly prefer human-generated advice (dating and relationships). We also document a bias against ChatGPT-generated advice which is present only when participants are aware the advice was generated by ChatGPT. Novel to the present investigation, we then explore how interacting with these tools impacts self-evaluations. We manipulate the order in which people interact with these tools relative to self-generation and find that generating advice before interacting with ChatGPT advice boosts the quality ratings of the ChatGPT advice. At the same time, interacting with ChatGPT-generated advice before self-generating advice decreases self-ratings of authenticity. Taken together, we document a bias towards AI in the context of personal advice. Further, we identify an important externality in the use of these tools-they can invoke social comparisons of me vs. the machine.
Collapse
Affiliation(s)
| | - Erica R Bailey
- U.C. Berkeley, Haas School of Business, Berkeley, United States
| |
Collapse
|
21
|
Young CC, Enichen E, Rivera C, Auger CA, Grant N, Rao A, Succi MD. Diagnostic Accuracy of a Custom Large Language Model on Rare Pediatric Disease Case Reports. Am J Med Genet A 2025; 197:e63878. [PMID: 39268988 DOI: 10.1002/ajmg.a.63878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2024] [Revised: 08/10/2024] [Accepted: 08/29/2024] [Indexed: 09/15/2024]
Abstract
Accurately diagnosing rare pediatric diseases frequently represent a clinical challenge due to their complex and unusual clinical presentations. Here, we explore the capabilities of three large language models (LLMs), GPT-4, Gemini Pro, and a custom-built LLM (GPT-4 integrated with the Human Phenotype Ontology [GPT-4 HPO]), by evaluating their diagnostic performance on 61 rare pediatric disease case reports. The performance of the LLMs were assessed for accuracy in identifying specific diagnoses, listing the correct diagnosis among a differential list, and broad disease categories. In addition, GPT-4 HPO was tested on 100 general pediatrics case reports previously assessed on other LLMs to further validate its performance. The results indicated that GPT-4 was able to predict the correct diagnosis with a diagnostic accuracy of 13.1%, whereas both GPT-4 HPO and Gemini Pro had diagnostic accuracies of 8.2%. Further, GPT-4 HPO showed an improved performance compared with the other two LLMs in identifying the correct diagnosis among its differential list and the broad disease category. Although these findings underscore the potential of LLMs for diagnostic support, particularly when enhanced with domain-specific ontologies, they also stress the need for further improvement prior to integration into clinical practice.
Collapse
Affiliation(s)
- Cameron C Young
- Harvard Medical School, Boston, Massachusetts, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center, Mass General Brigham, Boston, Massachusetts, USA
| | - Ellie Enichen
- Harvard Medical School, Boston, Massachusetts, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center, Mass General Brigham, Boston, Massachusetts, USA
| | - Christian Rivera
- Harvard Medical School, Boston, Massachusetts, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center, Mass General Brigham, Boston, Massachusetts, USA
| | - Corinne A Auger
- Harvard Medical School, Boston, Massachusetts, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center, Mass General Brigham, Boston, Massachusetts, USA
| | - Nathan Grant
- Harvard Medical School, Boston, Massachusetts, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center, Mass General Brigham, Boston, Massachusetts, USA
| | - Arya Rao
- Harvard Medical School, Boston, Massachusetts, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center, Mass General Brigham, Boston, Massachusetts, USA
| | - Marc D Succi
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center, Mass General Brigham, Boston, Massachusetts, USA
- Department of Radiology, Massachusetts General Hospital, Boston, Massachusetts, USA
| |
Collapse
|
22
|
Badia JM, Casanova-Portoles D, Membrilla E, Rubiés C, Pujol M, Sancho J. Evaluation of ChatGPT-4 for the detection of surgical site infections from electronic health records after colorectal surgery: A pilot diagnostic accuracy study. J Infect Public Health 2025; 18:102627. [PMID: 39740340 DOI: 10.1016/j.jiph.2024.102627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2024] [Revised: 11/29/2024] [Accepted: 12/16/2024] [Indexed: 01/02/2025] Open
Abstract
BACKGROUND Surveillance of surgical site infection (SSI) relies on manual methods that are time-consuming and prone to subjectivity. This study evaluates the diagnostic accuracy of ChatGPT for detecting SSI from electronic health records after colorectal surgery via comparison with the results of a nationwide surveillance programme. METHODS This pilot, retrospective, multicentre analysis included 122 patients who underwent colorectal surgery. Patient records were reviewed by both manual surveillance and ChatGPT, which was tasked with identifying SSI and categorizing them as superficial, deep, or organ-space infections. Sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated. Receiver operating characteristic (ROC) curve analysis determined the model's diagnostic performance. RESULTS ChatGPT achieved a sensitivity of 100 %, correctly identifying all SSIs detected by manual methods. The specificity was 54 %, indicating the presence of false positives. The PPV was 67 %, and the NPV was 100 %. The area under the ROC curve was 0.77, indicating good overall accuracy for distinguishing between SSI and non-SSI cases. Minor differences in outcomes were observed between colon and rectal surgeries, as well as between the hospitals participating in the study. CONCLUSIONS ChatGPT shows high sensitivity and good overall accuracy for detecting SSI. It appears to be a useful tool for initial screenings and for reducing manual review workload. The moderate specificity suggests a need for further refinement to reduce the rate of false positives. The integration of ChatGPT alongside electronic medical records, antibiotic consumption and imaging data results for real-time analysis may further improve the surveillance of SSI. CLINICALTRIALS gov Identifier: NCT06556017.
Collapse
Affiliation(s)
- Josep M Badia
- Department of Surgery, Hospital General de Granollers, Granollers, Spain; Universitat Internacional de Catalunya. Sant Cugat del Vallès, Barcelona, Spain.
| | - Daniel Casanova-Portoles
- Department of Surgery, Hospital General de Granollers, Granollers, Spain; Universitat Internacional de Catalunya. Sant Cugat del Vallès, Barcelona, Spain.
| | | | - Carles Rubiés
- Department of Digital Transformation, Hospital General de Granollers, Granollers, Spain.
| | - Miquel Pujol
- VINCat Program, Servei Català de la Salut, Catalonia, Spain; Centro de Investigación Biomédica en Red de Enfermedades Infecciosas (CIBERINFEC), Instituto de Salud Carlos III, Madrid, Spain. VINCat Program, Barcelona, Catalonia, Spain; Department of Infectious Diseases, Hospital Universitari de Bellvitge - IDIBELL. L'Hospitalet de Llobregat, Spain.
| | - Joan Sancho
- Department of Surgery, Hospital del Mar, Barcelona, Spain.
| |
Collapse
|
23
|
Tangsrivimol JA, Darzidehkalani E, Virk HUH, Wang Z, Egger J, Wang M, Hacking S, Glicksberg BS, Strauss M, Krittanawong C. Benefits, limits, and risks of ChatGPT in medicine. Front Artif Intell 2025; 8:1518049. [PMID: 39949509 PMCID: PMC11821943 DOI: 10.3389/frai.2025.1518049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2024] [Accepted: 01/15/2025] [Indexed: 02/16/2025] Open
Abstract
ChatGPT represents a transformative technology in healthcare, with demonstrated impacts across clinical practice, medical education, and research. Studies show significant efficiency gains, including 70% reduction in administrative time for discharge summaries and achievement of medical professional-level performance on standardized tests (60% accuracy on USMLE, 78.2% on PubMedQA). ChatGPT offers personalized learning platforms, automated scoring, and instant access to vast medical knowledge in medical education, addressing resource limitations and enhancing training efficiency. It streamlines clinical workflows by supporting triage processes, generating discharge summaries, and alleviating administrative burdens, allowing healthcare professionals to focus more on patient care. Additionally, ChatGPT facilitates remote monitoring and chronic disease management, providing personalized advice, medication reminders, and emotional support, thus bridging gaps between clinical visits. Its ability to process and synthesize vast amounts of data accelerates research workflows, aiding in literature reviews, hypothesis generation, and clinical trial designs. This paper aims to gather and analyze published studies involving ChatGPT, focusing on exploring its advantages and disadvantages within the healthcare context. To aid in understanding and progress, our analysis is organized into six key areas: (1) Information and Education, (2) Triage and Symptom Assessment, (3) Remote Monitoring and Support, (4) Mental Healthcare Assistance, (5) Research and Decision Support, and (6) Language Translation. Realizing ChatGPT's full potential in healthcare requires addressing key limitations, such as its lack of clinical experience, inability to process visual data, and absence of emotional intelligence. Ethical, privacy, and regulatory challenges further complicate its integration. Future improvements should focus on enhancing accuracy, developing multimodal AI models, improving empathy through sentiment analysis, and safeguarding against artificial hallucination. While not a replacement for healthcare professionals, ChatGPT can serve as a powerful assistant, augmenting their expertise to improve efficiency, accessibility, and quality of care. This collaboration ensures responsible adoption of AI in transforming healthcare delivery. While ChatGPT demonstrates significant potential in healthcare transformation, systematic evaluation of its implementation across different healthcare settings reveals varying levels of evidence quality-from robust randomized trials in medical education to preliminary observational studies in clinical practice. This heterogeneity in evidence quality necessitates a structured approach to future research and implementation.
Collapse
Affiliation(s)
- Jonathan A. Tangsrivimol
- Department of Neurosurgery, and Neuroscience, Weill Cornell Medicine, NewYork-Presbyterian Hospital, New York, NY, United States
- Department of Neurosurgery, Chulabhorn Hospital, Chulabhorn Royal Academy, Bangkok, Thailand
| | - Erfan Darzidehkalani
- MIT Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, United States
| | - Hafeez Ul Hassan Virk
- Harrington Heart & Vascular Institute, University Hospitals Cleveland Medical Center, Case Western Reserve University, Cleveland, OH, United States
| | - Zhen Wang
- Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, MN, United States
- Division of Health Care Policy and Research, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Jan Egger
- Institute for Artificial Intelligence in Medicine, University Hospital Essen (AöR), Essen, Germany
| | - Michelle Wang
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, United States
| | - Sean Hacking
- Department of Pathology, NYU Grossman School of Medicine, New York, NY, United States
| | - Benjamin S. Glicksberg
- Hasso Plattner Institute for Digital Health, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Markus Strauss
- Department of Cardiology I, Coronary and Peripheral Vascular Disease, Heart Failure Medicine, University Hospital Muenster, Muenster, Germany
- Department of Cardiology, Sector Preventive Medicine, Health Promotion, Faculty of Health, School of Medicine, University Witten/Herdecke, Hagen, Germany
| | - Chayakrit Krittanawong
- Cardiology Division, New York University Langone Health, New York University School of Medicine, New York, NY, United States
- HumanX, Delaware, DE, United States
| |
Collapse
|
24
|
Wang S, Wang Y, Jiang L, Chang Y, Zhang S, Zhao K, Chen L, Gao C. Assessing the clinical support capabilities of ChatGPT 4o and ChatGPT 4o mini in managing lumbar disc herniation. Eur J Med Res 2025; 30:45. [PMID: 39844276 PMCID: PMC11753088 DOI: 10.1186/s40001-025-02296-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2024] [Accepted: 01/13/2025] [Indexed: 01/24/2025] Open
Abstract
PURPOSE This study evaluated and compared the clinical support capabilities of ChatGPT 4o and ChatGPT 4o mini in diagnosing and treating lumbar disc herniation (LDH) with radiculopathy. METHODS Twenty-one questions (across 5 categories) from NASS Clinical Guidelines were input into ChatGPT 4o and ChatGPT 4o mini. Five orthopedic surgeons assessed their responses using a 5-point Likert scale for accuracy and completeness, and a 7-point scale for reliability. Flesch Reading Ease scores were calculated to assess readability. Additionally, ChatGPT 4o analyzed lumbar images from 53 patients, comparing its recognizable agreement with orthopedic surgeons using Kappa values. RESULTS Both models demonstrated strong clinical support capabilities with no significant differences in accuracy or reliability. However, ChatGPT 4o provided more comprehensive and consistent responses. The Flesch Reading Ease scores for both models indicated that their generated content was "very difficult to read," potentially limiting patient accessibility. In evaluating lumbar disc herniation images, ChatGPT 4o achieved an overall accuracy of 0.81, with LDH recognition precision, recall, and F1 scores exceeding 0.80. The AUC was 0.80, and the Kappa value was 0.61, indicating moderate agreement between the model's predictions and actual diagnoses, though with room for improvement. CONCLUSION While both models are effective, ChatGPT 4o offers more comprehensive clinical responses, making it more suitable for high-integrity medical tasks. However, the difficulty in reading AI-generated content and occasional use of misleading terms, such as "tumor," indicate a need for further improvements to reduce patient anxiety.
Collapse
Affiliation(s)
- Suning Wang
- Department of Orthopedics, The Second Hospital of Shandong University, Qilu Hospital of Shandong University, Shandong University, Jinan, 250000, China
| | - Ying Wang
- Shandong University, NO 44, Wenhuaxi Road, Jinan, 250012, China
| | - Linlin Jiang
- Department of Orthopedics, Qilu Hospital of Shandong University, The Second Hospital of Shandong University, Jinan, 250012, China
- Shandong University, NO 44, Wenhuaxi Road, Jinan, 250012, China
| | - Yong Chang
- Department of Orthopedics, Qilu Hospital of Shandong University, The Second Hospital of Shandong University, Jinan, 250012, China
- Shandong University, NO 44, Wenhuaxi Road, Jinan, 250012, China
| | - Shiji Zhang
- Department of Orthopedics, Qilu Hospital of Shandong University, The Second Hospital of Shandong University, Jinan, 250012, China
- Shandong University, NO 44, Wenhuaxi Road, Jinan, 250012, China
| | - Kun Zhao
- Department of Orthopedics, The Second Hospital of Shandong University, Qilu Hospital of Shandong University, Shandong University, Jinan, 250000, China.
| | - Lu Chen
- Department of Orthopedics, The Second Hospital of Shandong University, Qilu Hospital of Shandong University, Shandong University, Jinan, 250000, China.
- Department of Orthopedics, Qilu Hospital of Shandong University, The Second Hospital of Shandong University, Jinan, 250012, China.
| | - Chunzheng Gao
- Department of Orthopedics, The Second Hospital of Shandong University, Qilu Hospital of Shandong University, Shandong University, Jinan, 250000, China.
| |
Collapse
|
25
|
Sequí-Sabater JM, Benavent D. Artificial intelligence in rheumatology research: what is it good for? RMD Open 2025; 11:e004309. [PMID: 39778924 PMCID: PMC11748787 DOI: 10.1136/rmdopen-2024-004309] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2024] [Accepted: 12/08/2024] [Indexed: 01/11/2025] Open
Abstract
Artificial intelligence (AI) is transforming rheumatology research, with a myriad of studies aiming to improve diagnosis, prognosis and treatment prediction, while also showing potential capability to optimise the research workflow, improve drug discovery and clinical trials. Machine learning, a key element of discriminative AI, has demonstrated the ability of accurately classifying rheumatic diseases and predicting therapeutic outcomes by using diverse data types, including structured databases, imaging and text. In parallel, generative AI, driven by large language models, is becoming a powerful tool for optimising the research workflow by supporting with content generation, literature review automation and clinical decision support. This review explores the current applications and future potential of both discriminative and generative AI in rheumatology. It also highlights the challenges posed by these technologies, such as ethical concerns and the need for rigorous validation and regulatory oversight. The integration of AI in rheumatology promises substantial advancements but requires a balanced approach to optimise benefits and minimise potential possible downsides.
Collapse
Affiliation(s)
- José Miguel Sequí-Sabater
- Rheumatology Department, La Ribera University Hospital, Alzira, Spain
- Rheumatology Deparment, La Fe University and Polytechnic Hospital, Valencia, Spain
- Division of Rheumatology, Department of Medicine Solna, Karolinska Institutet and Karolinska University Hospital, Stockholm, Sweden
| | - Diego Benavent
- Rheumatology Department, Hospital Universitari de Bellvitge, L'Hospitalet de Llobregat, Barcelona, Spain
| |
Collapse
|
26
|
Hurt RT, Stephenson CR, Gilman EA, Aakre CA, Croghan IT, Mundi MS, Ghosh K, Edakkanambeth Varayil J. The Use of an Artificial Intelligence Platform OpenEvidence to Augment Clinical Decision-Making for Primary Care Physicians. J Prim Care Community Health 2025; 16:21501319251332215. [PMID: 40238861 PMCID: PMC12033599 DOI: 10.1177/21501319251332215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2025] [Revised: 02/06/2025] [Accepted: 02/10/2025] [Indexed: 04/18/2025] Open
Abstract
BACKGROUND Artificial intelligence (AI) platforms can potentially enhance clinical decision-making (CDM) in primary care settings. OpenEvidence (OE), an AI tool, draws from trusted sources to generate evidence-based medicine (EBM) recommendations to address clinical questions. However, its effectiveness in real-world primary care cases remains unknown. OBJECTIVE To evaluate the performance of OE in providing EBM recommendations for five common chronic conditions in primary care: hypertension, hyperlipidemia, diabetes mellitus type 2, depression, and obesity. METHODS Five patient cases were retrospectively analyzed. Physicians posed specific clinical questions, and OE responses were evaluated on clarity, relevance, evidence support, impact on CDM, and overall satisfaction. Four independent physicians provided ratings using a 0 to 4 scale. RESULTS OE provided accurate, evidence-based recommendations in all cases, aligning with physician plans. OE was scored on a scale of zero to four, where zero was very unclear, and four was very clear. Mean scores across cases were clarity (3.55 ± 0.60), relevance (3.75 ± 0.44), support (3.35 ± 0.49), and satisfaction (3.60 ± 0.60). However, the impact on CDM was limited (1.95 ± 1.05), as OE primarily reinforced rather than modified plans. CONCLUSION OE was rated high in clarity, relevance, and evidence-based support, reinforcing physician decisions in common chronic conditions. While the impact on CDM was minimal due to the study's retrospective nature, OE shows promise in augmenting the primary care physician. Prospective trials are needed to evaluate its utility in complex cases and multidisciplinary settings.
Collapse
|
27
|
Chen JS, Reddy AJ, Al-Sharif E, Shoji MK, Kalaw FGP, Eslani M, Lang PZ, Arya M, Koretz ZA, Bolo KA, Arnett JJ, Roginiel AC, Do JL, Robbins SL, Camp AS, Scott NL, Rudell JC, Weinreb RN, Baxter SL, Granet DB. Analysis of ChatGPT Responses to Ophthalmic Cases: Can ChatGPT Think like an Ophthalmologist? OPHTHALMOLOGY SCIENCE 2025; 5:100600. [PMID: 39346575 PMCID: PMC11437840 DOI: 10.1016/j.xops.2024.100600] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/20/2024] [Revised: 08/09/2024] [Accepted: 08/13/2024] [Indexed: 10/01/2024]
Abstract
Objective Large language models such as ChatGPT have demonstrated significant potential in question-answering within ophthalmology, but there is a paucity of literature evaluating its ability to generate clinical assessments and discussions. The objectives of this study were to (1) assess the accuracy of assessment and plans generated by ChatGPT and (2) evaluate ophthalmologists' abilities to distinguish between responses generated by clinicians versus ChatGPT. Design Cross-sectional mixed-methods study. Subjects Sixteen ophthalmologists from a single academic center, of which 10 were board-eligible and 6 were board-certified, were recruited to participate in this study. Methods Prompt engineering was used to ensure ChatGPT output discussions in the style of the ophthalmologist author of the Medical College of Wisconsin Ophthalmic Case Studies. Cases where ChatGPT accurately identified the primary diagnoses were included and then paired. Masked human-generated and ChatGPT-generated discussions were sent to participating ophthalmologists to identify the author of the discussions. Response confidence was assessed using a 5-point Likert scale score, and subjective feedback was manually reviewed. Main Outcome Measures Accuracy of ophthalmologist identification of discussion author, as well as subjective perceptions of human-generated versus ChatGPT-generated discussions. Results Overall, ChatGPT correctly identified the primary diagnosis in 15 of 17 (88.2%) cases. Two cases were excluded from the paired comparison due to hallucinations or fabrications of nonuser-provided data. Ophthalmologists correctly identified the author in 77.9% ± 26.6% of the 13 included cases, with a mean Likert scale confidence rating of 3.6 ± 1.0. No significant differences in performance or confidence were found between board-certified and board-eligible ophthalmologists. Subjectively, ophthalmologists found that discussions written by ChatGPT tended to have more generic responses, irrelevant information, hallucinated more frequently, and had distinct syntactic patterns (all P < 0.01). Conclusions Large language models have the potential to synthesize clinical data and generate ophthalmic discussions. While these findings have exciting implications for artificial intelligence-assisted health care delivery, more rigorous real-world evaluation of these models is necessary before clinical deployment. Financial Disclosures The author(s) have no proprietary or commercial interest in any materials discussed in this article.
Collapse
Affiliation(s)
- Jimmy S. Chen
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - Akshay J. Reddy
- School of Medicine, California University of Science and Medicine, Colton, California
| | - Eman Al-Sharif
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
- Surgery Department, College of Medicine, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia
| | - Marissa K. Shoji
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Fritz Gerald P. Kalaw
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - Medi Eslani
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Paul Z. Lang
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Malvika Arya
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Zachary A. Koretz
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Kyle A. Bolo
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Justin J. Arnett
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Aliya C. Roginiel
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Jiun L. Do
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Shira L. Robbins
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Andrew S. Camp
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Nathan L. Scott
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Jolene C. Rudell
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Robert N. Weinreb
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - Sally L. Baxter
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - David B. Granet
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| |
Collapse
|
28
|
Nguyen D, Rao A, Mazumder A, Succi MD. Exploring the accuracy of embedded ChatGPT-4 and ChatGPT-4o in generating BI-RADS scores: a pilot study in radiologic clinical support. Clin Imaging 2025; 117:110335. [PMID: 39549561 DOI: 10.1016/j.clinimag.2024.110335] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2024] [Revised: 10/17/2024] [Accepted: 10/24/2024] [Indexed: 11/18/2024]
Abstract
This study evaluates the accuracy of ChatGPT-4 and ChatGPT-4o in generating Breast Imaging Reporting and Data System (BI-RADS) scores from radiographic images. We tested both models using 77 breast cancer images from radiopaedia.org, including mammograms and ultrasounds. Images were analyzed in separate sessions to avoid bias. ChatGPT-4 and ChatGPT-4o achieved a 66.2 % accuracy across all BI-RADS cases. Performance was highest in BI-RADS 5 cases, with ChatGPT-4 and ChatGPT4o scoring 84.4 % and 88.9 %, respectively. However, both models struggled with BIRADS 1-3 cases, often assigning higher severity ratings. This study highlights the limitations of current LLMs in accurately grading these images and emphasizes the need for further research in these technologies before clinical integration.
Collapse
Affiliation(s)
- Dan Nguyen
- University of Massachusetts Chan Medical School, Worcester, MA, United States
| | - Arya Rao
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, United States; Harvard Medical School, Boston, MA, United States; Department of Radiology, Mass General Brigham, Boston, MA, United States
| | - Aneesh Mazumder
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, United States
| | - Marc D Succi
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, United States; Harvard Medical School, Boston, MA, United States; Department of Radiology, Mass General Brigham, Boston, MA, United States; Mass General Brigham Innovation, Mass General Brigham, Boston, MA, United States.
| |
Collapse
|
29
|
Van Meter AR, Wheaton MG, Cosgrove VE, Andreadis K, Robertson RE. The Goldilocks Zone: Finding the right balance of user and institutional risk for suicide-related generative AI queries. PLOS DIGITAL HEALTH 2025; 4:e0000711. [PMID: 39774367 PMCID: PMC11709298 DOI: 10.1371/journal.pdig.0000711] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/06/2024] [Accepted: 11/25/2024] [Indexed: 01/11/2025]
Abstract
Generative artificial intelligence (genAI) has potential to improve healthcare by reducing clinician burden and expanding services, among other uses. There is a significant gap between the need for mental health care and available clinicians in the United States-this makes it an attractive target for improved efficiency through genAI. Among the most sensitive mental health topics is suicide, and demand for crisis intervention has grown in recent years. We aimed to evaluate the quality of genAI tool responses to suicide-related queries. We entered 10 suicide-related queries into five genAI tools-ChatGPT 3.5, GPT-4, a version of GPT-4 safe for protected health information, Gemini, and Bing Copilot. The response to each query was coded on seven metrics including presence of a suicide hotline number, content related to evidence-based suicide interventions, supportive content, harmful content. Pooling across tools, most of the responses (79%) were supportive. Only 24% of responses included a crisis hotline number and only 4% included content consistent with evidence-based suicide prevention interventions. Harmful content was rare (5%); all such instances were delivered by Bing Copilot. Our results suggest that genAI developers have taken a very conservative approach to suicide-related content and constrained their models' responses to suggest support-seeking, but little else. Finding balance between providing much needed evidence-based mental health information without introducing excessive risk is within the capabilities of genAI developers. At this nascent stage of integrating genAI tools into healthcare systems, ensuring mental health parity should be the goal of genAI developers and healthcare organizations.
Collapse
Affiliation(s)
- Anna R. Van Meter
- Department of Child and Adolescent Psychiatry, NYU Grossman School of Medicine, New York, New York, United States of America
| | - Michael G. Wheaton
- Department of Psychology, Barnard College, New York, New York, United States of America
| | - Victoria E. Cosgrove
- Division of Child and Adolescent Psychiatry, Stanford University School of Medicine, Palo Alto, California, United States of America
| | - Katerina Andreadis
- Department of Population Health, NYU Grossman School of Medicine, New York, New York, United States of America
| | - Ronald E. Robertson
- Stanford Internet Observatory, Stanford University, Stanford, California, United States of America
| |
Collapse
|
30
|
Chang Y, Yin JM, Li JM, Liu C, Cao LY, Lin SY. Applications and Future Prospects of Medical LLMs: A Survey Based on the M-KAT Conceptual Framework. J Med Syst 2024; 48:112. [PMID: 39725770 DOI: 10.1007/s10916-024-02132-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2024] [Accepted: 12/10/2024] [Indexed: 12/28/2024]
Abstract
The success of large language models (LLMs) in general areas have sparked a wave of research into their applications in the medical field. However, enhancing the medical professionalism of these models remains a major challenge. This study proposed a novel model training theoretical framework, the M-KAT framework, which integrated domain-specific training methods for LLMs with the unique characteristics of the medical discipline. This framework aimed to improve the medical professionalism of the models from three perspectives: general knowledge acquisition, specialized skill development, and alignment with clinical thinking. This study summarized the outcomes of medical LLMs across four tasks: clinical diagnosis and treatment, medical question answering, medical research, and health management. Using the M-KAT framework, we analyzed the contribution to enhancement of professionalism of models through different training stages. At the same time, for some of the potential risks associated with medical LLMs, targeted solutions can be achieved through pre-training, SFT, and model alignment based on cultivated professional capabilities. Additionally, this study identified main directions for future research on medical LLMs: advancing professional evaluation datasets and metrics tailored to the needs of medical tasks, conducting in-depth studies on medical multimodal large language models (MLLMs) capable of integrating diverse data types, and exploring the forms of medical agents and multi-agent frameworks that can interact with real healthcare environments and support clinical decision-making. It is hoped that predictions of work can provide a reference for subsequent research.
Collapse
Affiliation(s)
- Ying Chang
- School of Basic Medical Sciences, Zhejiang Chinese Medical University, 548 Binwen Road, Binjiang District, Hangzhou, 310053, China
| | - Jian-Ming Yin
- School of Basic Medical Sciences, Zhejiang Chinese Medical University, 548 Binwen Road, Binjiang District, Hangzhou, 310053, China
| | - Jian-Min Li
- Gancao Doctor Chinese Medicine Artificial Intelligence Joint Engineering Center, Zhejiang Chinese Medical University, Zhejiang Chinese Medical University, Hangzhou, China
| | - Chang Liu
- School of Basic Medical Sciences, Zhejiang Chinese Medical University, 548 Binwen Road, Binjiang District, Hangzhou, 310053, China
- Gancao Doctor Chinese Medicine Artificial Intelligence Joint Engineering Center, Zhejiang Chinese Medical University, Zhejiang Chinese Medical University, Hangzhou, China
- Breast Disease Specialist Hospital of Guangdong Provincial Hospital of Chinese Medicine, Guangdong Provincial Hospital of Chinese Medicine, Guangzhou, China
| | - Ling-Yong Cao
- School of Basic Medical Sciences, Zhejiang Chinese Medical University, 548 Binwen Road, Binjiang District, Hangzhou, 310053, China.
| | - Shu-Yuan Lin
- School of Basic Medical Sciences, Zhejiang Chinese Medical University, 548 Binwen Road, Binjiang District, Hangzhou, 310053, China.
- Gancao Doctor Chinese Medicine Artificial Intelligence Joint Engineering Center, Zhejiang Chinese Medical University, Zhejiang Chinese Medical University, Hangzhou, China.
| |
Collapse
|
31
|
Selim R, Basu A, Anto A, Foscht T, Eisingerich AB. Effects of Large Language Model-Based Offerings on the Well-Being of Students: Qualitative Study. JMIR Form Res 2024; 8:e64081. [PMID: 39729617 PMCID: PMC11724218 DOI: 10.2196/64081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2024] [Revised: 08/21/2024] [Accepted: 11/20/2024] [Indexed: 12/29/2024] Open
Abstract
BACKGROUND In recent years, the adoption of large language model (LLM) applications, such as ChatGPT, has seen a significant surge, particularly among students. These artificial intelligence-driven tools offer unprecedented access to information and conversational assistance, which is reshaping the way students engage with academic content and manage the learning process. Despite the growing prevalence of LLMs and reliance on these technologies, there remains a notable gap in qualitative in-depth research examining the emotional and psychological effects of LLMs on users' mental well-being. OBJECTIVE In order to address these emerging and critical issues, this study explores the role of LLM-based offerings, such as ChatGPT, in students' lives, namely, how postgraduate students use such offerings and how they make students feel, and examines the impact on students' well-being. METHODS To address the aims of this study, we employed an exploratory approach, using in-depth, semistructured, qualitative, face-to-face interviews with 23 users (13 female and 10 male users; mean age 23 years, SD 1.55 years) of ChatGPT-4o, who were also university students at the time (inclusion criteria). Interviewees were invited to reflect upon how they use ChatGPT, how it makes them feel, and how it may influence their lives. RESULTS The current findings from the exploratory qualitative interviews showed that users appreciate the functional support (8/23, 35%), escapism (8/23, 35%), and fantasy fulfillment (7/23, 30%) they receive from LLM-based offerings, such as ChatGPT, but at the same time, such usage is seen as a "double-edged sword," with respondents indicating anxiety (8/23, 35%), dependence (11/23, 48%), concerns about deskilling (12/23, 52%), and angst or pessimism about the future (11/23, 48%). CONCLUSIONS This study employed exploratory in-depth interviews to examine how the usage of LLM-based offerings, such as ChatGPT, makes users feel and assess the effects of using LLM-based offerings on mental well-being. The findings of this study show that students used ChatGPT to make their lives easier and felt a sense of cognitive escapism and even fantasy fulfillment, but this came at the cost of feeling anxious and pessimistic about the future.
Collapse
Affiliation(s)
- Rania Selim
- Faculty of Medicine, Imperial College London, London, United Kingdom
| | - Arunima Basu
- Faculty of Medicine, Imperial College London, London, United Kingdom
| | - Ailin Anto
- Faculty of Medicine, Imperial College London, London, United Kingdom
| | - Thomas Foscht
- Department of Marketing, University of Graz, Styria, Austria
| | | |
Collapse
|
32
|
Ramasubramanian S, Balaji S, Kannan T, Jeyaraman N, Sharma S, Migliorini F, Balasubramaniam S, Jeyaraman M. Comparative evaluation of artificial intelligence systems' accuracy in providing medical drug dosages: A methodological study. World J Methodol 2024; 14:92802. [PMID: 39712564 PMCID: PMC11287534 DOI: 10.5662/wjm.v14.i4.92802] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 05/29/2024] [Accepted: 06/25/2024] [Indexed: 07/26/2024] Open
Abstract
BACKGROUND Medication errors, especially in dosage calculation, pose risks in healthcare. Artificial intelligence (AI) systems like ChatGPT and Google Bard may help reduce errors, but their accuracy in providing medication information remains to be evaluated. AIM To evaluate the accuracy of AI systems (ChatGPT 3.5, ChatGPT 4, Google Bard) in providing drug dosage information per Harrison's Principles of Internal Medicine. METHODS A set of natural language queries mimicking real-world medical dosage inquiries was presented to the AI systems. Responses were analyzed using a 3-point Likert scale. The analysis, conducted with Python and its libraries, focused on basic statistics, overall system accuracy, and disease-specific and organ system accuracies. RESULTS ChatGPT 4 outperformed the other systems, showing the highest rate of correct responses (83.77%) and the best overall weighted accuracy (0.6775). Disease-specific accuracy varied notably across systems, with some diseases being accurately recognized, while others demonstrated significant discrepancies. Organ system accuracy also showed variable results, underscoring system-specific strengths and weaknesses. CONCLUSION ChatGPT 4 demonstrates superior reliability in medical dosage information, yet variations across diseases emphasize the need for ongoing improvements. These results highlight AI's potential in aiding healthcare professionals, urging continuous development for dependable accuracy in critical medical situations.
Collapse
Affiliation(s)
- Swaminathan Ramasubramanian
- Department of Orthopaedics, Government Medical College, Omandurar Government Estate, Chennai 600002, Tamil Nadu, India
| | - Sangeetha Balaji
- Department of Orthopaedics, Government Medical College, Omandurar Government Estate, Chennai 600002, Tamil Nadu, India
| | - Tejashri Kannan
- Department of Orthopaedics, Government Medical College, Omandurar Government Estate, Chennai 600002, Tamil Nadu, India
| | - Naveen Jeyaraman
- Department of Orthopaedics, ACS Medical College and Hospital, Dr MGR Educational and Research Institute, Chennai 600077, Tamil Nadu, India
| | - Shilpa Sharma
- Department of Paediatric Surgery, All India Institute of Medical Sciences, New Delhi 110029, India
| | - Filippo Migliorini
- Department of Life Sciences, Health, Link Campus University, Rome 00165, Italy
- Department of Orthopaedic and Trauma Surgery, Academic Hospital of Bolzano (SABES-ASDAA), Teaching Hospital of the Paracelsus Medical University, Bolzano 39100, Italy
| | - Suhasini Balasubramaniam
- Department of Radio-Diagnosis, Government Stanley Medical College and Hospital, Chennai 600001, Tamil Nadu, India
| | - Madhan Jeyaraman
- Department of Orthopaedics, ACS Medical College and Hospital, Dr MGR Educational and Research Institute, Chennai 600077, Tamil Nadu, India
| |
Collapse
|
33
|
Succi MD, Rao AS. Beyond the AJR: Towards Large Language Models for Radiology Decision-Making in the Emergency Department. AJR Am J Roentgenol 2024. [PMID: 39660831 DOI: 10.2214/ajr.24.32465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2024]
Affiliation(s)
- Marc D Succi
- Harvard Medical School, Boston, MA, United States
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, United States
- Innovation Office, Mass General Brigham, Boston, MA, United States
- Enterprise Radiology, Mass General Brigham, Boston, MA, United States
| | - Arya S Rao
- Harvard Medical School, Boston, MA, United States
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, United States
| |
Collapse
|
34
|
Jin HK, Kim E. Performance of GPT-3.5 and GPT-4 on the Korean Pharmacist Licensing Examination: Comparison Study. JMIR MEDICAL EDUCATION 2024; 10:e57451. [PMID: 39630413 PMCID: PMC11633516 DOI: 10.2196/57451] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/17/2024] [Revised: 08/28/2024] [Accepted: 10/09/2024] [Indexed: 12/13/2024]
Abstract
Background ChatGPT, a recently developed artificial intelligence chatbot and a notable large language model, has demonstrated improved performance on medical field examinations. However, there is currently little research on its efficacy in languages other than English or in pharmacy-related examinations. Objective This study aimed to evaluate the performance of GPT models on the Korean Pharmacist Licensing Examination (KPLE). Methods We evaluated the percentage of correct answers provided by 2 different versions of ChatGPT (GPT-3.5 and GPT-4) for all multiple-choice single-answer KPLE questions, excluding image-based questions. In total, 320, 317, and 323 questions from the 2021, 2022, and 2023 KPLEs, respectively, were included in the final analysis, which consisted of 4 units: Biopharmacy, Industrial Pharmacy, Clinical and Practical Pharmacy, and Medical Health Legislation. Results The 3-year average percentage of correct answers was 86.5% (830/960) for GPT-4 and 60.7% (583/960) for GPT-3.5. GPT model accuracy was highest in Biopharmacy (GPT-3.5 77/96, 80.2% in 2022; GPT-4 87/90, 96.7% in 2021) and lowest in Medical Health Legislation (GPT-3.5 8/20, 40% in 2022; GPT-4 12/20, 60% in 2022). Additionally, when comparing the performance of artificial intelligence with that of human participants, pharmacy students outperformed GPT-3.5 but not GPT-4. Conclusions In the last 3 years, GPT models have performed very close to or exceeded the passing threshold for the KPLE. This study demonstrates the potential of large language models in the pharmacy domain; however, extensive research is needed to evaluate their reliability and ensure their secure application in pharmacy contexts due to several inherent challenges. Addressing these limitations could make GPT models more effective auxiliary tools for pharmacy education.
Collapse
Affiliation(s)
- Hye Kyung Jin
- Research Institute of Pharmaceutical Sciences, College of Pharmacy, Chung-Ang University, Seoul, Republic of Korea
- Data Science, Evidence-Based and Clinical Research Laboratory, Department of Health, Social, and Clinical Pharmacy, College of Pharmacy, Chung-Ang University, Seoul, Republic of Korea
| | - EunYoung Kim
- Research Institute of Pharmaceutical Sciences, College of Pharmacy, Chung-Ang University, Seoul, Republic of Korea
- Data Science, Evidence-Based and Clinical Research Laboratory, Department of Health, Social, and Clinical Pharmacy, College of Pharmacy, Chung-Ang University, Seoul, Republic of Korea
- Division of Licensing of Medicines and Regulatory Science, The Graduate School of Pharmaceutical Management and Regulatory Science Policy, The Graduate School of Pharmaceutical Regulatory Sciences, Chung-Ang University, 84 Heukseok-Ro, Dongjak-gu, Seoul, 06974, Republic of Korea, 82 2-820-5791, 82 2-816-7338
| |
Collapse
|
35
|
El Gharib K, Jundi B, Furfaro D, Abdulnour REE. AI-assisted human clinical reasoning in the ICU: beyond "to err is human". Front Artif Intell 2024; 7:1506676. [PMID: 39712469 PMCID: PMC11659639 DOI: 10.3389/frai.2024.1506676] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2024] [Accepted: 11/19/2024] [Indexed: 12/24/2024] Open
Abstract
Diagnostic errors pose a significant public health challenge, affecting nearly 800,000 Americans annually, with even higher rates globally. In the ICU, these errors are particularly prevalent, leading to substantial morbidity and mortality. The clinical reasoning process aims to reduce diagnostic uncertainty and establish a plausible differential diagnosis but is often hindered by cognitive load, patient complexity, and clinician burnout. These factors contribute to cognitive biases that compromise diagnostic accuracy. Emerging technologies like large language models (LLMs) offer potential solutions to enhance clinical reasoning and improve diagnostic precision. In this perspective article, we explore the roles of LLMs, such as GPT-4, in addressing diagnostic challenges in critical care settings through a case study of a critically ill patient managed with LLM assistance.
Collapse
Affiliation(s)
- Khalil El Gharib
- Division of Pulmonary and Critical Care Medicine, Rutgers Robert Wood Johnson Medical School, New Brunswick, NJ, United States
| | - Bakr Jundi
- Division of Pulmonary and Critical Care Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, United States
| | - David Furfaro
- Division of Pulmonary and Critical Care Medicine, Beth Israel Deaconess Medical Center and Harvard Medical School, Boston, MA, United States
| | - Raja-Elie E. Abdulnour
- Division of Pulmonary and Critical Care Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, United States
| |
Collapse
|
36
|
Abou Karam G. Revolutionizing Medical Education: ChatGPT3.5 Ability to Behave as a Virtual Patient. MEDICAL SCIENCE EDUCATOR 2024; 34:1559-1564. [PMID: 39758479 PMCID: PMC11699164 DOI: 10.1007/s40670-024-02121-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 07/16/2024] [Indexed: 01/07/2025]
Abstract
ChatGPT3.5 is a promising tool for medical education. It provides an affordable, accessible platform for medical students to practice their clinical skills across various medical scenarios. Additionally, it complements, not replaces, actual patient-doctor interactions due to its lack of real physical examination and emotional and non-verbal communication capabilities. Finally, effective utilization of ChatGPT requires careful prompt crafting and clinician guidance to navigate its limitations. Supplementary Information The online version contains supplementary material available at 10.1007/s40670-024-02121-w.
Collapse
Affiliation(s)
- Gaby Abou Karam
- Department of Radiology and Biomedical Imaging, Yale School of Medicine, 333 Cedar St, New Haven, CT 06510 USA
| |
Collapse
|
37
|
Munir F, Gehres A, Wai D, Song L. Evaluation of ChatGPT as a Tool for Answering Clinical Questions in Pharmacy Practice. J Pharm Pract 2024; 37:1303-1310. [PMID: 38775367 DOI: 10.1177/08971900241256731] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/12/2024]
Abstract
Background: In the healthcare field, there has been a growing interest in using artificial intelligence (AI)-powered tools to assist healthcare professionals, including pharmacists, in their daily tasks. Objectives: To provide commentary and insight into the potential for generative AI language models such as ChatGPT as a tool for answering practice-based, clinical questions and the challenges that need to be addressed before implementation in pharmacy practice settings. Methods: To assess ChatGPT, pharmacy-based questions were prompted to ChatGPT (Version 3.5; free version) and responses were recorded. Question types included 6 drug information questions, 6 enhanced prompt drug information questions, 5 patient case questions, 5 calculations questions, and 10 drug knowledge questions (e.g., top 200 drugs). After all responses were collected, ChatGPT responses were assessed for appropriateness. Results: ChatGPT responses were generated from 32 questions in 5 categories and evaluated on a total of 44 possible points. Among all ChatGPT responses and categories, the overall score was 21 of 44 points (47.73%). ChatGPT scored higher in pharmacy calculation (100%), drug information (83%), and top 200 drugs (80%) categories and lower in drug information enhanced prompt (33%) and patient case (20%) categories. Conclusion: This study suggests that ChatGPT has limited success as a tool to answer pharmacy-based questions. ChatGPT scored higher in calculation and multiple-choice questions but scored lower in drug information and patient case questions, generating misleading or fictional answers and citations.
Collapse
Affiliation(s)
- Faria Munir
- University of Illinois Chicago College of Pharmacy, Chicago, IL, USA
| | - Anna Gehres
- College of Pharmacy, The Ohio State University, Columbus, OH, USA
| | - David Wai
- Department of Pharmacy, Ohio State University Wexner Medical Center, Columbus, OH, USA
| | - Leah Song
- University of Illinois Chicago College of Pharmacy, Chicago, IL, USA
| |
Collapse
|
38
|
Lee J, Park S, Shin J, Cho B. Analyzing evaluation methods for large language models in the medical field: a scoping review. BMC Med Inform Decis Mak 2024; 24:366. [PMID: 39614219 PMCID: PMC11606129 DOI: 10.1186/s12911-024-02709-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Accepted: 10/03/2024] [Indexed: 12/01/2024] Open
Abstract
BACKGROUND Owing to the rapid growth in the popularity of Large Language Models (LLMs), various performance evaluation studies have been conducted to confirm their applicability in the medical field. However, there is still no clear framework for evaluating LLMs. OBJECTIVE This study reviews studies on LLM evaluations in the medical field and analyzes the research methods used in these studies. It aims to provide a reference for future researchers designing LLM studies. METHODS & MATERIALS We conducted a scoping review of three databases (PubMed, Embase, and MEDLINE) to identify LLM-related articles published between January 1, 2023, and September 30, 2023. We analyzed the types of methods, number of questions (queries), evaluators, repeat measurements, additional analysis methods, use of prompt engineering, and metrics other than accuracy. RESULTS A total of 142 articles met the inclusion criteria. LLM evaluation was primarily categorized as either providing test examinations (n = 53, 37.3%) or being evaluated by a medical professional (n = 80, 56.3%), with some hybrid cases (n = 5, 3.5%) or a combination of the two (n = 4, 2.8%). Most studies had 100 or fewer questions (n = 18, 29.0%), 15 (24.2%) performed repeated measurements, 18 (29.0%) performed additional analyses, and 8 (12.9%) used prompt engineering. For medical assessment, most studies used 50 or fewer queries (n = 54, 64.3%), had two evaluators (n = 43, 48.3%), and 14 (14.7%) used prompt engineering. CONCLUSIONS More research is required regarding the application of LLMs in healthcare. Although previous studies have evaluated performance, future studies will likely focus on improving performance. A well-structured methodology is required for these studies to be conducted systematically.
Collapse
Affiliation(s)
- Junbok Lee
- Institute for Innovation in Digital Healthcare, Yonsei University, Seoul, Republic of Korea
- Department of Human Systems Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Sungkyung Park
- Department of Bigdata AI Management Information, Seoul National University of Science and Technology, Seoul, Republic of Korea
| | - Jaeyong Shin
- Department of Preventive Medicine and Public Health, Yonsei University College of Medicine, 50-1, Yonsei-ro, Seodaemun-gu, Seoul, 03722, Republic of Korea.
- Institute of Health Services Research, Yonsei University College of Medicine, Seoul, Korea.
| | - Belong Cho
- Department of Human Systems Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea.
- Department of Family Medicine, Seoul National University Hospital, Seoul, Republic of Korea.
- Department of Family Medicine, Seoul National University College of Medicine, 101 Daehak-ro, Jongno-gu, Seoul, 03080, Republic of Korea.
| |
Collapse
|
39
|
Brügge E, Ricchizzi S, Arenbeck M, Keller MN, Schur L, Stummer W, Holling M, Lu MH, Darici D. Large language models improve clinical decision making of medical students through patient simulation and structured feedback: a randomized controlled trial. BMC MEDICAL EDUCATION 2024; 24:1391. [PMID: 39609823 PMCID: PMC11605890 DOI: 10.1186/s12909-024-06399-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/18/2024] [Accepted: 11/25/2024] [Indexed: 11/30/2024]
Abstract
BACKGROUND Clinical decision-making (CDM) refers to physicians' ability to gather, evaluate, and interpret relevant diagnostic information. An integral component of CDM is the medical history conversation, traditionally practiced on real or simulated patients. In this study, we explored the potential of using Large Language Models (LLM) to simulate patient-doctor interactions and provide structured feedback. METHODS We developed AI prompts to simulate patients with different symptoms, engaging in realistic medical history conversations. In our double-blind randomized design, the control group participated in simulated medical history conversations with AI patients (control group), while the intervention group, in addition to simulated conversations, also received AI-generated feedback on their performances (feedback group). We examined the influence of feedback based on their CDM performance, which was evaluated by two raters (ICC = 0.924) using the Clinical Reasoning Indicator - History Taking Inventory (CRI-HTI). The data was analyzed using an ANOVA for repeated measures. RESULTS Our final sample included 21 medical students (agemean = 22.10 years, semestermean = 4, 14 females). At baseline, the feedback group (mean = 3.28 ± 0.09 [standard deviation]) and the control group (3.21 ± 0.08) achieved similar CRI-HTI scores, indicating successful randomization. After only four training sessions, the feedback group (3.60 ± 0.13) outperformed the control group (3.02 ± 0.12), F (1,18) = 4.44, p = .049 with a strong effect size, partial η2 = 0.198. Specifically, the feedback group showed improvements in the subdomains of CDM of creating context (p = .046) and securing information (p = .018), while their ability to focus questions did not improve significantly (p = .265). CONCLUSION The results suggest that AI-simulated medical history conversations can support CDM training, especially when combined with structured feedback. Such training format may serve as a cost-effective supplement to existing training methods, better preparing students for real medical history conversations.
Collapse
Affiliation(s)
- Emilia Brügge
- Connectome - Student Association for Neurosurgery, Neurology and Neurosciences, Berlin, Germany
- Department of Neurosurgery, University Hospital of Münster, Münster, Germany
| | - Sarah Ricchizzi
- Connectome - Student Association for Neurosurgery, Neurology and Neurosciences, Berlin, Germany
- Department of Neurosurgery, University Hospital of Münster, Münster, Germany
| | - Malin Arenbeck
- Connectome - Student Association for Neurosurgery, Neurology and Neurosciences, Berlin, Germany
- Department of Neurosurgery, University Hospital of Münster, Münster, Germany
| | - Marius Niklas Keller
- Connectome - Student Association for Neurosurgery, Neurology and Neurosciences, Berlin, Germany
- Department of Neurosurgery, University Hospital of Münster, Münster, Germany
| | - Lina Schur
- Connectome - Student Association for Neurosurgery, Neurology and Neurosciences, Berlin, Germany
- Department of Neurosurgery, University Hospital of Münster, Münster, Germany
| | - Walter Stummer
- Department of Neurosurgery, University Hospital of Münster, Münster, Germany
| | - Markus Holling
- Department of Neurosurgery, University Hospital of Münster, Münster, Germany
| | - Max Hao Lu
- Harvard Graduate School of Education, Cambridge, USA
| | - Dogus Darici
- Institute of Anatomy and Neurobiology, University of Münster, Vesaliusweg 2-4, 48149, Münster, Germany.
| |
Collapse
|
40
|
Ho CN, Tian T, Ayers AT, Aaron RE, Phillips V, Wolf RM, Mathioudakis N, Dai T, Klonoff DC. Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review. BMC Med Inform Decis Mak 2024; 24:357. [PMID: 39593074 PMCID: PMC11590327 DOI: 10.1186/s12911-024-02757-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2024] [Accepted: 11/08/2024] [Indexed: 11/28/2024] Open
Abstract
BACKGROUND The large language models (LLMs), most notably ChatGPT, released since November 30, 2022, have prompted shifting attention to their use in medicine, particularly for supporting clinical decision-making. However, there is little consensus in the medical community on how LLM performance in clinical contexts should be evaluated. METHODS We performed a literature review of PubMed to identify publications between December 1, 2022, and April 1, 2024, that discussed assessments of LLM-generated diagnoses or treatment plans. RESULTS We selected 108 relevant articles from PubMed for analysis. The most frequently used LLMs were GPT-3.5, GPT-4, Bard, LLaMa/Alpaca-based models, and Bing Chat. The five most frequently used criteria for scoring LLM outputs were "accuracy", "completeness", "appropriateness", "insight", and "consistency". CONCLUSIONS The most frequently used criteria for defining high-quality LLMs have been consistently selected by researchers over the past 1.5 years. We identified a high degree of variation in how studies reported their findings and assessed LLM performance. Standardized reporting of qualitative evaluation metrics that assess the quality of LLM outputs can be developed to facilitate research studies on LLMs in healthcare.
Collapse
Affiliation(s)
- Cindy N Ho
- Diabetes Technology Society, Burlingame, CA, USA
| | - Tiffany Tian
- Diabetes Technology Society, Burlingame, CA, USA
| | | | | | - Vidith Phillips
- School of Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Risa M Wolf
- Division of Pediatric Endocrinology, The Johns Hopkins Hospital, Baltimore, MD, USA
- Hopkins Business of Health Initiative, Johns Hopkins University, Washington, DC, USA
| | | | - Tinglong Dai
- Hopkins Business of Health Initiative, Johns Hopkins University, Washington, DC, USA
- Carey Business School, Johns Hopkins University, Baltimore, MD, USA
- School of Nursing, Johns Hopkins University, Baltimore, MD, USA
| | - David C Klonoff
- Diabetes Research Institute, Mills-Peninsula Medical Center, 100 South San Mateo Drive, Room 1165, San Mateo, CA, 94401, USA.
| |
Collapse
|
41
|
Dagli MM, Ghenbot Y, Ahmad HS, Chauhan D, Turlip R, Wang P, Welch WC, Ozturk AK, Yoon JW. Development and validation of a novel AI framework using NLP with LLM integration for relevant clinical data extraction through automated chart review. Sci Rep 2024; 14:26783. [PMID: 39500759 PMCID: PMC11538412 DOI: 10.1038/s41598-024-77535-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2024] [Accepted: 10/23/2024] [Indexed: 11/08/2024] Open
Abstract
The accurate extraction of surgical data from electronic health records (EHRs), particularly operative notes through manual chart review (MCR), is complex, crucial, and time-intensive, limited by human error due to fatigue and the level of training. This study aimed to develop and validate a novel Natural Language Processing (NLP) algorithm integrated with a Large Language Model (LLM; GPT4-Turbo) to automate the extraction of spinal surgery data from EHRs. The algorithm employed a two-stage approach. Initially, a rule-based NLP framework reviewed and classified candidate segments from the text, preserving their reference segments. These segments were then verified in the second stage through the LLM. The primary outcomes of this study were the accurate extraction of surgical data, including the type of surgery, levels operated, number of disks removed, and presence of intraoperative incidental durotomies. Secondary objectives explored time efficiency, tokenization lengths, and costs. The performance of the algorithm was assessed across two validation databases, analyzing metrics such as accuracy, sensitivity, discrimination, F1-score, and precision, with 95% confidence intervals calculated using percentile-based bootstrapping. The NLP + LLM algorithm markedly outperformed all performance metrics, demonstrating significant improvements in time and cost efficiency. These results suggest the potential for widespread adoption of this technology.
Collapse
Affiliation(s)
- Mert Marcel Dagli
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, 801 Spruce Street, Philadelphia, PA, 19107, USA.
| | - Yohannes Ghenbot
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, 801 Spruce Street, Philadelphia, PA, 19107, USA
| | - Hasan S Ahmad
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, 801 Spruce Street, Philadelphia, PA, 19107, USA
| | - Daksh Chauhan
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, 801 Spruce Street, Philadelphia, PA, 19107, USA
| | - Ryan Turlip
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, 801 Spruce Street, Philadelphia, PA, 19107, USA
| | - Patrick Wang
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, 801 Spruce Street, Philadelphia, PA, 19107, USA
| | - William C Welch
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, 801 Spruce Street, Philadelphia, PA, 19107, USA
| | - Ali K Ozturk
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, 801 Spruce Street, Philadelphia, PA, 19107, USA
| | - Jang W Yoon
- Department of Neurosurgery, Perelman School of Medicine, University of Pennsylvania, 801 Spruce Street, Philadelphia, PA, 19107, USA.
| |
Collapse
|
42
|
Zhang C, Chen X. Letter to the editor, "Evaluating the accuracy of ChatGPT-4 in predicting ASA scores: A prospective multicentric study ChatGPT-4 in ASA score prediction". J Clin Anesth 2024; 98:111571. [PMID: 39180866 DOI: 10.1016/j.jclinane.2024.111571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2024] [Accepted: 07/29/2024] [Indexed: 08/27/2024]
Affiliation(s)
- Chenghong Zhang
- Department of Anesthesia, Women's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China
| | - Xinzhong Chen
- Department of Anesthesia, Women's Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang, China.
| |
Collapse
|
43
|
Sudri K, Motro-Feingold I, Ramon-Gonen R, Barda N, Klang E, Fefer P, Amunts S, Attia ZI, Alkhouli M, Segev A, Cohen-Shelly M, Barbash IM. Enhancing Coronary Revascularization Decisions: The Promising Role of Large Language Models as a Decision-Support Tool for Multidisciplinary Heart Team. Circ Cardiovasc Interv 2024; 17:e014201. [PMID: 39502077 DOI: 10.1161/circinterventions.124.014201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/24/2024] [Accepted: 09/03/2024] [Indexed: 11/21/2024]
Abstract
BACKGROUND While clinical practice guidelines advocate for multidisciplinary heart team (MDHT) discussions in coronary revascularization, variability in implementation across health care settings remains a challenge. This variability could potentially be addressed by language learning models like ChatGPT, offering decision-making support in diverse health care environments. Our study aims to critically evaluate the concordance between recommendations made by MDHT and those generated by language learning models in coronary revascularization decision-making. METHODS From March 2023 to July 2023, consecutive coronary angiography cases (n=86) that were referred for revascularization (either percutaneous or surgical) were analyzed using both ChatGPT-3.5 and ChatGPT-4. Case presentation formats included demographics, medical background, detailed description of angiographic findings, and SYNTAX score (Synergy Between Percutaneous Coronary Intervention With Taxus and Cardiac Surgery; I and II), which were presented in 3 different formats. The recommendations of the models were compared with those of an MDHT. RESULTS ChatGPT-4 showed high concordance with decisions made by the MDHT (accuracy 0.82, sensitivity 0.8, specificity 0.83, and kappa 0.59), while ChatGPT-3.5 (0.67, 0.27, 0.84, and 0.12, respectively) showed lower concordance. Entropy and Fleiss kappa of ChatGPT-4 were 0.09 and 0.9, respectively, indicating high reliability and repeatability. The best correlation between ChatGPT-4 and MDHT was achieved when clinical cases were presented in a detailed context. Specific subgroups of patients yielded high accuracy (>0.9) of ChatGPT-4, including those with left main disease, 3 vessel disease, and diabetic patients. CONCLUSIONS The present study demonstrates that advanced language learning models like ChatGPT-4 may be able to predict clinical recommendations for coronary artery disease revascularization with reasonable accuracy, especially in specific patient groups, underscoring their potential role as a supportive tool in clinical decision-making.
Collapse
Affiliation(s)
- Karin Sudri
- ARC Innovation Center, Sagol Big Data and AI Hub (K.S., M.C.-S.), Sheba Medical Center, Tel Hashomer, Israel
| | - Iris Motro-Feingold
- Sheba Education Authority (I.M.-F.), Sheba Medical Center, Tel Hashomer, Israel
| | - Roni Ramon-Gonen
- The Graduate School of Business Administration (R.R.-G.), Bar-Ilan University, Ramat-Gan, Israel
- Data Science Institute (R.R.-G.), Bar-Ilan University, Ramat-Gan, Israel
| | - Noam Barda
- ARC Innovation Center (N.B.), Sheba Medical Center, Tel Hashomer, Israel
- Software and Information Systems Engineering (N.B.), Ben-Gurion University of the Negev, Be'er Sheva, Israel
- Epidemiology, Biostatistics and Community Health Sciences (N.B.), Ben-Gurion University of the Negev, Be'er Sheva, Israel
| | - Eyal Klang
- The Division of Data Driven and Digital Medicine, Icahn School of Medicine at Mount Sinai, New York, NY (E.K.)
| | - Paul Fefer
- Interventional Cardiology Unit, Leviev Heart Institute (P.F., A.S., I.M.B.), Sheba Medical Center, Tel Hashomer, Israel
- Faculty of Medicine, Tel Aviv University, Israel (P.F., S.A., A.S., I.M.B.)
| | - Sergei Amunts
- Department of Cardiac Surgery, Leviev Cardiothoracic and Vascular Center (S.A.), Sheba Medical Center, Tel Hashomer, Israel
- Faculty of Medicine, Tel Aviv University, Israel (P.F., S.A., A.S., I.M.B.)
| | - Zachi Itzhak Attia
- Department of Cardiovascular Medicine (Z.I.A., M.A.), Mayo Clinic, Rochester, MN
- Department of Artificial Intelligence and Informatics (Z.I.A.), Mayo Clinic, Rochester, MN
| | - Mohamad Alkhouli
- Department of Cardiovascular Medicine (Z.I.A., M.A.), Mayo Clinic, Rochester, MN
| | - Amitai Segev
- Interventional Cardiology Unit, Leviev Heart Institute (P.F., A.S., I.M.B.), Sheba Medical Center, Tel Hashomer, Israel
- Faculty of Medicine, Tel Aviv University, Israel (P.F., S.A., A.S., I.M.B.)
| | - Michal Cohen-Shelly
- ARC Innovation Center, Sagol Big Data and AI Hub (K.S., M.C.-S.), Sheba Medical Center, Tel Hashomer, Israel
- The Olga and Lev Leviev Heart Center (M.C.-S.), Sheba Medical Center, Tel Hashomer, Israel
| | - Israel Moshe Barbash
- Interventional Cardiology Unit, Leviev Heart Institute (P.F., A.S., I.M.B.), Sheba Medical Center, Tel Hashomer, Israel
- Faculty of Medicine, Tel Aviv University, Israel (P.F., S.A., A.S., I.M.B.)
| |
Collapse
|
44
|
Vaira LA, Lechien JR, Abbate V, Allevi F, Audino G, Beltramini GA, Bergonzani M, Boscolo-Rizzo P, Califano G, Cammaroto G, Chiesa-Estomba CM, Committeri U, Crimi S, Curran NR, di Bello F, di Stadio A, Frosolini A, Gabriele G, Gengler IM, Lonardi F, Maglitto F, Mayo-Yáñez M, Petrocelli M, Pucci R, Saibene AM, Saponaro G, Tel A, Trabalzini F, Trecca EMC, Vellone V, Salzano G, De Riu G. Validation of the Quality Analysis of Medical Artificial Intelligence (QAMAI) tool: a new tool to assess the quality of health information provided by AI platforms. Eur Arch Otorhinolaryngol 2024; 281:6123-6131. [PMID: 38703195 PMCID: PMC11512889 DOI: 10.1007/s00405-024-08710-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2024] [Accepted: 04/27/2024] [Indexed: 05/06/2024]
Abstract
BACKGROUND The widespread diffusion of Artificial Intelligence (AI) platforms is revolutionizing how health-related information is disseminated, thereby highlighting the need for tools to evaluate the quality of such information. This study aimed to propose and validate the Quality Assessment of Medical Artificial Intelligence (QAMAI), a tool specifically designed to assess the quality of health information provided by AI platforms. METHODS The QAMAI tool has been developed by a panel of experts following guidelines for the development of new questionnaires. A total of 30 responses from ChatGPT4, addressing patient queries, theoretical questions, and clinical head and neck surgery scenarios were assessed by 27 reviewers from 25 academic centers worldwide. Construct validity, internal consistency, inter-rater and test-retest reliability were assessed to validate the tool. RESULTS The validation was conducted on the basis of 792 assessments for the 30 responses given by ChatGPT4. The results of the exploratory factor analysis revealed a unidimensional structure of the QAMAI with a single factor comprising all the items that explained 51.1% of the variance with factor loadings ranging from 0.449 to 0.856. Overall internal consistency was high (Cronbach's alpha = 0.837). The Interclass Correlation Coefficient was 0.983 (95% CI 0.973-0.991; F (29,542) = 68.3; p < 0.001), indicating excellent reliability. Test-retest reliability analysis revealed a moderate-to-strong correlation with a Pearson's coefficient of 0.876 (95% CI 0.859-0.891; p < 0.001). CONCLUSIONS The QAMAI tool demonstrated significant reliability and validity in assessing the quality of health information provided by AI platforms. Such a tool might become particularly important/useful for physicians as patients increasingly seek medical information on AI platforms.
Collapse
Affiliation(s)
- Luigi Angelo Vaira
- Maxillofacial Surgery Operative Unit, Department of Medicine, Surgery and Pharmacy, University of Sassari, Viale San Pietro 43/B, 07100, Sassari, Italy.
- PhD School of Biomedical Science, Biomedical Sciences Department, University of Sassari, Sassari, Italy.
| | - Jerome R Lechien
- Department of Laryngology and Bronchoesophagology, EpiCURA Hospital, Mons School of Medicine, UMONS. Research Institute for Health Sciences and Technology, University of Mons (UMons), Mons, Belgium
- Department of Otolaryngology-Head Neck Surgery, Elsan Polyclinic of Poitiers, Poitiers, France
| | - Vincenzo Abbate
- Head and Neck Section, Department of Neurosciences, Reproductive and Odontostomatological Science, Federico II University of Naples, Naples, Italy
| | - Fabiana Allevi
- Maxillofacial Surgery Department, ASSt Santi Paolo e Carlo, University of Milan, Milan, Italy
| | - Giovanni Audino
- Head and Neck Section, Department of Neurosciences, Reproductive and Odontostomatological Science, Federico II University of Naples, Naples, Italy
| | - Giada Anna Beltramini
- Department of Biomedical, Surgical and Dental Sciences, University of Milan, Milan, Italy
- Maxillofacial and Dental Unit, Fondazione IRCCS Cà Granda Ospedale Maggiore Policlinico, Milan, Italy
| | - Michela Bergonzani
- Maxillo-Facial Surgery Division, Head and Neck Department, University Hospital of Parma, Parma, USA
| | - Paolo Boscolo-Rizzo
- Department of Medical, Surgical and Health Sciences, Section of Otolaryngology, University of Trieste, Trieste, Italy
| | - Gianluigi Califano
- Department of Neurosciences, Reproductive and Odontostomatological Science, Federico II University of Naples, Naples, Italy
| | - Giovanni Cammaroto
- ENT Department, Morgagni Pierantoni Hospital, AUSL Romagna, Forlì, Italy
| | - Carlos M Chiesa-Estomba
- Department of Otorhinolaryngology-Head and Neck Surgery, Hospital Universitario Donostia, San Sebastian, Spain
| | - Umberto Committeri
- Head and Neck Section, Department of Neurosciences, Reproductive and Odontostomatological Science, Federico II University of Naples, Naples, Italy
| | - Salvatore Crimi
- Operative Unit of Maxillofacial Surgery, Policlinico San Marco, University of Catania, Catania, Italy
| | - Nicholas R Curran
- Department of Otolaryngology-Head and Neck Surgery, University of Cincinnati Medical Center, Cincinnati, OH, USA
| | - Francesco di Bello
- Department of Neurosciences, Reproductive and Odontostomatological Science, Federico II University of Naples, Naples, Italy
| | - Arianna di Stadio
- Otolaryngology Unit, GF Ingrassia Department, University of Catania, Catania, Italy
| | - Andrea Frosolini
- Department of Maxillofacial Surgery, University of Siena, Siena, Italy
| | - Guido Gabriele
- Department of Maxillofacial Surgery, University of Siena, Siena, Italy
| | - Isabelle M Gengler
- Department of Otolaryngology-Head and Neck Surgery, University of Cincinnati Medical Center, Cincinnati, OH, USA
| | - Fabio Lonardi
- Department of Maxillofacial Surgery, University of Verona, Verona, Italy
| | - Fabio Maglitto
- Maxillo-Facial Surgery Unit, University of Bari "Aldo Moro", Bari, Italy
| | - Miguel Mayo-Yáñez
- Otorhinolaryngology, Head and Neck Surgery Department, Complexo Hospitalario Universitario A Coruña (CHUAC), A Coruña, Galicia, Spain
| | - Marzia Petrocelli
- Maxillofacial Surgery Operative Unit, Bellaria and Maggiore Hospital, Bologna, Italy
| | - Resi Pucci
- Maxillofacial Surgery Unit, San Camillo-Forlanini Hospital, Rome, Italy
| | - Alberto Maria Saibene
- Otolaryngology Unit, Santi Paolo e Carlo Hospital, Department of Health Sciences, University of Milan, Milan, Italy
| | - Gianmarco Saponaro
- Maxillo-Facial Surgery Unit, IRCSS "A. Gemelli" Foundation-Catholic University of the Sacred Heart, Rome, Italy
| | - Alessandro Tel
- Clinic of Maxillofacial Surgery, Department of Head and Neck Surgery and Neuroscience, University Hospital of Udine, Udine, Italy
| | - Franco Trabalzini
- Department of Otorhinolaryngology, Head and Neck Surgery, Meyer Children's Hospital, Florence, Italy
| | - Eleonora M C Trecca
- Department of Otorhinolaryngology and Maxillofacial Surgery, IRCCS Hospital Casa Sollievo Della Sofferenza, San Giovanni Rotondo, Foggia, Italy
- Department of Otorhinolaryngology, University Hospital of Foggia, Foggia, Italy
| | | | - Giovanni Salzano
- Head and Neck Section, Department of Neurosciences, Reproductive and Odontostomatological Science, Federico II University of Naples, Naples, Italy
| | - Giacomo De Riu
- Maxillofacial Surgery Operative Unit, Department of Medicine, Surgery and Pharmacy, University of Sassari, Viale San Pietro 43/B, 07100, Sassari, Italy
| |
Collapse
|
45
|
Hirosawa T, Shimizu T. The potential, limitations, and future of diagnostics enhanced by generative artificial intelligence. Diagnosis (Berl) 2024; 11:446-449. [PMID: 38987215 DOI: 10.1515/dx-2024-0095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Accepted: 06/06/2024] [Indexed: 07/12/2024]
Abstract
OBJECTIVES This short communication explores the potential, limitations, and future directions of generative artificial intelligence (GAI) in enhancing diagnostics. METHODS This commentary reviews current applications and advancements in GAI, particularly focusing on its integration into medical diagnostics. It examines the role of GAI in supporting medical interviews, assisting in differential diagnosis, and aiding clinical reasoning through the lens of dual-process theory. The discussion is supported by recent examples and theoretical frameworks to illustrate the practical and potential uses of GAI in medicine. RESULTS GAI shows significant promise in enhancing diagnostic processes by supporting the translation of patient descriptions into visual formats, providing differential diagnoses, and facilitating complex clinical reasoning. However, limitations such as the potential for generating medical misinformation, known as hallucinations, exist. Furthermore, the commentary highlights the integration of GAI with both intuitive and analytical decision-making processes in clinical diagnostics, demonstrating potential improvements in both the speed and accuracy of diagnoses. CONCLUSIONS While GAI presents transformative potential for medical diagnostics, it also introduces risks that must be carefully managed. Future advancements should focus on refining GAI technologies to better align with human diagnostic reasoning, ensuring GAI enhances rather than replaces the medical professionals' expertise.
Collapse
Affiliation(s)
- Takanobu Hirosawa
- Department of Diagnostic and Generalist Medicine, 12756 Dokkyo Medical University , Tochigi, Japan
| | - Taro Shimizu
- Department of Diagnostic and Generalist Medicine, 12756 Dokkyo Medical University , Tochigi, Japan
| |
Collapse
|
46
|
Reading Turchioe M, Kisselev S, Van Bulck L, Bakken S. Increasing Generative Artificial Intelligence Competency among Students Enrolled in Doctoral Nursing Research Coursework. Appl Clin Inform 2024; 15:842-851. [PMID: 39053615 PMCID: PMC11483171 DOI: 10.1055/a-2373-3151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Accepted: 07/24/2024] [Indexed: 07/27/2024] Open
Abstract
BACKGROUND Generative artificial intelligence (AI) tools may soon be integrated into health care practice and research. Nurses in leadership roles, many of whom are doctorally prepared, will need to determine whether and how to integrate them in a safe and useful way. OBJECTIVE This study aimed to develop and evaluate a brief intervention to increase PhD nursing students' knowledge of appropriate applications for using generative AI tools in health care. METHODS We created didactic lectures and laboratory-based activities to introduce generative AI to students enrolled in a nursing PhD data science and visualization course. Students were provided with a subscription to Chat Generative Pretrained Transformer (ChatGPT) 4.0, a general-purpose generative AI tool, for use in and outside the class. During the didactic portion, we described generative AI and its current and potential future applications in health care, including examples of appropriate and inappropriate applications. In the laboratory sessions, students were given three tasks representing different use cases of generative AI in health care practice and research (clinical decision support, patient decision support, and scientific communication) and asked to engage with ChatGPT on each. Students (n = 10) independently wrote a brief reflection for each task evaluating safety (accuracy, hallucinations) and usability (ease of use, usefulness, and intention to use in the future). Reflections were analyzed using directed content analysis. RESULTS Students were able to identify the strengths and limitations of ChatGPT in completing all three tasks and developed opinions on whether they would feel comfortable using ChatGPT for similar tasks in the future. All of them reported increasing their self-rated competency in generative AI by one to two points on a five-point rating scale. CONCLUSION This brief educational intervention supported doctoral nursing students in understanding the appropriate uses of ChatGPT, which may support their ability to appraise and use these tools in their future work.
Collapse
Affiliation(s)
| | - Sergey Kisselev
- Columbia University School of Nursing, New York, New York, United States
| | - Liesbet Van Bulck
- Department of Public Health and Primary Care, KU Leuven - University of Leuven, Leuven, Belgium
| | - Suzanne Bakken
- Columbia University School of Nursing, New York, New York, United States
- Department of Biomedical Informatics, Columbia University, New York, New York, United States
- Data Science Institute, Columbia University, New York, New York, United States
| |
Collapse
|
47
|
Davis NM, El-Said E, Fortune P, Shen A, Succi MD. Transforming Health Care Landscapes: The Lever of Radiology Research and Innovation on Emerging Markets Poised for Aggressive Growth. J Am Coll Radiol 2024; 21:1552-1556. [PMID: 39096946 DOI: 10.1016/j.jacr.2024.07.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Accepted: 07/30/2024] [Indexed: 08/05/2024]
Abstract
Advances in radiology are crucial not only to the future of the field but to medicine as a whole. Here, we present three emerging areas of medicine that are poised to change how health care is delivered-hospital at home, artificial intelligence, and precision medicine-and illustrate how advances in radiological tools and technologies are helping to fuel the growth of these markets in the United States and across the globe.
Collapse
Affiliation(s)
- Nicole M Davis
- Innovation Office, Mass General Brigham, Somerville, Massachusetts
| | - Ezat El-Said
- Medically Engineered Solutions in Healthcare Incubator, Innovations in Operations Research Center, Massachusetts General Hospital, Boston, Massachusetts; Harvard Medical School, Boston, Massachusetts; Department of Radiology, Massachusetts General Hospital, Boston, Massachusetts
| | - Patrick Fortune
- Vice President, Strategic Innovation Leaders at Mass General Brigham, Innovation Office, Mass General Brigham, Somerville, Massachusetts
| | - Angela Shen
- Innovation Office, Mass General Brigham, Somerville, Massachusetts; Vice President, Strategic Innovation Leaders at Mass General Brigham
| | - Marc D Succi
- Innovation Office, Mass General Brigham, Somerville, Massachusetts; Harvard Medical School, Boston, Massachusetts; Department of Radiology, Massachusetts General Hospital, Boston, Massachusetts; Medically Engineered Solutions in Healthcare Incubator, Innovations in Operations Research Center, Massachusetts General Hospital, Boston, Massachusetts. MDS is the Associate Chair of Innovation and Commercialization at Mass General Brigham Enterprise Radiology; Strategic Innovation Leader at Mass General Brigham Innovation; Founder and Executive Director of the MESH Incubator at Mass General Brigham.
| |
Collapse
|
48
|
Hu J, Fu J, Zhao W, Lou P, Feng M, Ren H, Feng S, Li Y, Fang A. Characterizing pituitary adenomas in clinical notes: Corpus construction and its application in LLMs. Health Informatics J 2024; 30:14604582241291442. [PMID: 39379071 DOI: 10.1177/14604582241291442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/10/2024]
Abstract
Objective: Faced with the challenges of differential diagnosis caused by the complex clinical manifestations and high pathological heterogeneity of pituitary adenomas, this study aims to construct a high-quality annotated corpus to characterize pituitary adenomas in clinical notes containing rich diagnosis and treatment information. Methods: A dataset from a pituitary adenomas neurosurgery treatment center of a tertiary first-class hospital in China was retrospectively collected. A semi-automatic corpus construction framework was designed. A total of 2000 documents containing 9430 sentences and 524,232 words were annotated, and the text corpus of pituitary adenomas (TCPA) was constructed and analyzed. Its potential application in large language models (LLMs) was explored through fine-tuning and prompting experiments. Results: TCPA had 4782 medical entities and 28,998 tokens, achieving good quality with the inter-annotator agreement value of 0.862-0.986. The LLMs experiments showed that TCPA can be used to automatically identify clinical information from free texts, and introducing instances with clinical characteristics can effectively reduce the need for training data, thereby reducing labor costs. Conclusion: This study characterized pituitary adenomas in clinical notes, and the proposed method were able to serve as references for relevant research in medical natural language scenarios with highly specialized language structure and terminology.
Collapse
Affiliation(s)
- Jiahui Hu
- Institute of Medical Information, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Jin Fu
- Peking Union Medical College Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Wanqing Zhao
- Institute of Medical Information, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Pei Lou
- Institute of Medical Information, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Ming Feng
- Peking Union Medical College Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Huiling Ren
- Institute of Medical Information, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Shanshan Feng
- Peking Union Medical College Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Yansheng Li
- DHC Mediway Technology Co., Ltd., Beijing, China
| | - An Fang
- Institute of Medical Information, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| |
Collapse
|
49
|
Drouaud A, Stocchi C, Tang J, Gonsalves G, Cheung Z, Szatkowski J, Forsh D. Exploring the Performance of ChatGPT in an Orthopaedic Setting and Its Potential Use as an Educational Tool. JB JS Open Access 2024; 9:e24.00081. [PMID: 39600798 PMCID: PMC11584220 DOI: 10.2106/jbjs.oa.24.00081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/29/2024] Open
Abstract
Introduction We assessed ChatGPT-4 vision (GPT-4V)'s performance for image interpretation, diagnosis formulation, and patient management capabilities. We aim to shed light on its potential as an educational tool addressing real-life cases for medical students. Methods Ten of the most popular orthopaedic trauma cases from OrthoBullets were selected. GPT-4V interpreted medical imaging and patient information, providing diagnoses, and guiding responses to OrthoBullets questions. Four fellowship-trained orthopaedic trauma surgeons rated GPT-4V responses using a 5-point Likert scale (strongly disagree to strongly agree). Each of GPT-4V's answers was assessed for alignment with current medical knowledge (accuracy), rationale and whether it is logical (rationale), relevancy to the specific case (relevance), and whether surgeons would trust the answers (trustworthiness). Mean scores from surgeon ratings were calculated. Results In total, 10 clinical cases, comprising 97 questions, were analyzed (10 imaging, 35 management, and 52 treatment). The surgeons assigned a mean overall rating of 3.46/5.00 to GPT-4V's imaging response (accuracy 3.28, rationale 3.68, relevance 3.75, and trustworthiness 3.15). Management questions received an overall score of 3.76 (accuracy 3.61, rationale 3.84, relevance 4.01, and trustworthiness 3.58), while treatment questions had an average overall score of 4.04 (accuracy 3.99, rationale 4.08, relevance 4.15, and trustworthiness 3.93). Conclusion This is the first study evaluating GPT-4V's imaging interpretation, personalized management, and treatment approaches as a medical educational tool. Surgeon ratings indicate overall fair agreement in GPT-4V reasoning behind decision-making. GPT-4V performed less favorably in imaging interpretation compared with its management and treatment approach performance. The performance of GPT-4V falls below our fellowship-trained orthopaedic trauma surgeon's standards as a standalone tool for medical education.
Collapse
Affiliation(s)
- Arthur Drouaud
- George Washington University School of Medicine, Washington, District of Columbia
| | - Carolina Stocchi
- Department of Orthopaedic Surgery, Mount Sinai, New York, New York
| | - Justin Tang
- Department of Orthopaedic Surgery, Mount Sinai, New York, New York
| | - Grant Gonsalves
- Department of Orthopaedic Surgery, Mount Sinai, New York, New York
| | - Zoe Cheung
- Department of Orthopaedic Surgery, Staten Island University Hospital, Staten Island, New York
| | - Jan Szatkowski
- Department of Orthopaedic Surgery, Indiana University Health Methodist Hospital, Indianapolis, Indiana
| | - David Forsh
- Department of Orthopaedic Surgery, Mount Sinai, New York, New York
| |
Collapse
|
50
|
Milad D, Antaki F, Milad J, Farah A, Khairy T, Mikhail D, Giguère CÉ, Touma S, Bernstein A, Szigiato AA, Nayman T, Mullie GA, Duval R. Assessing the medical reasoning skills of GPT-4 in complex ophthalmology cases. Br J Ophthalmol 2024; 108:1398-1405. [PMID: 38365427 DOI: 10.1136/bjo-2023-325053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Accepted: 02/07/2024] [Indexed: 02/18/2024]
Abstract
BACKGROUND/AIMS This study assesses the proficiency of Generative Pre-trained Transformer (GPT)-4 in answering questions about complex clinical ophthalmology cases. METHODS We tested GPT-4 on 422 Journal of the American Medical Association Ophthalmology Clinical Challenges, and prompted the model to determine the diagnosis (open-ended question) and identify the next-step (multiple-choice question). We generated responses using two zero-shot prompting strategies, including zero-shot plan-and-solve+ (PS+), to improve the reasoning of the model. We compared the best-performing model to human graders in a benchmarking effort. RESULTS Using PS+ prompting, GPT-4 achieved mean accuracies of 48.0% (95% CI (43.1% to 52.9%)) and 63.0% (95% CI (58.2% to 67.6%)) in diagnosis and next step, respectively. Next-step accuracy did not significantly differ by subspecialty (p=0.44). However, diagnostic accuracy in pathology and tumours was significantly higher than in uveitis (p=0.027). When the diagnosis was accurate, 75.2% (95% CI (68.6% to 80.9%)) of the next steps were correct. Conversely, when the diagnosis was incorrect, 50.2% (95% CI (43.8% to 56.6%)) of the next steps were accurate. The next step was three times more likely to be accurate when the initial diagnosis was correct (p<0.001). No significant differences were observed in diagnostic accuracy and decision-making between board-certified ophthalmologists and GPT-4. Among trainees, senior residents outperformed GPT-4 in diagnostic accuracy (p≤0.001 and 0.049) and in accuracy of next step (p=0.002 and 0.020). CONCLUSION Improved prompting enhances GPT-4's performance in complex clinical situations, although it does not surpass ophthalmology trainees in our context. Specialised large language models hold promise for future assistance in medical decision-making and diagnosis.
Collapse
Affiliation(s)
- Daniel Milad
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Department of Ophthalmology, Hôpital Maisonneuve-Rosement, Montreal, Quebec, Canada
| | - Fares Antaki
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Institute of Ophthalmology, University College London, London, UK
- CHUM School of Artificial Intelligence in Healthcare (SAIH), Centre Hospitalier de l'Université de Montréal (CHUM), Montreal, Quebec, Canada
| | - Jason Milad
- Department of Software Engineering, University of Waterloo, Waterloo, Ontario, Canada
| | - Andrew Farah
- Faculty of Medicine, McGill University, Montreal, Quebec, Canada
| | - Thomas Khairy
- Faculty of Medicine, McGill University, Montreal, Quebec, Canada
| | - David Mikhail
- Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Charles-Édouard Giguère
- Centre de recherche de l'Institut universitaire en santé mentale de Montréal, Montréal, Quebec, Canada
| | - Samir Touma
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Department of Ophthalmology, Hôpital Maisonneuve-Rosement, Montreal, Quebec, Canada
| | - Allison Bernstein
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Department of Ophthalmology, Hôpital Maisonneuve-Rosement, Montreal, Quebec, Canada
| | - Andrei-Alexandru Szigiato
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Department of Ophthalmology, Hôpital du Sacré-Coeur de Montréal, Montreal, Quebec, Canada
| | - Taylor Nayman
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Department of Ophthalmology, Hôpital Maisonneuve-Rosement, Montreal, Quebec, Canada
| | - Guillaume A Mullie
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Department of Ophthalmology, Cité-de-la-Santé Hospital, Laval, Quebec, Canada
| | - Renaud Duval
- Department of Ophthalmology, University of Montreal, Montreal, Quebec, Canada
- Department of Ophthalmology, Hôpital Maisonneuve-Rosement, Montreal, Quebec, Canada
| |
Collapse
|