1
|
Kunze KN, Varady NH, Mazzucco M, Lu AZ, Chahla J, Martin RK, Ranawat AS, Pearle AD, Williams RJ. The Large Language Model ChatGPT-4 Exhibits Excellent Triage Capabilities and Diagnostic Performance for Patients Presenting With Various Causes of Knee Pain. Arthroscopy 2025; 41:1438-1447.e14. [PMID: 38925234 DOI: 10.1016/j.arthro.2024.06.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/25/2024] [Revised: 05/28/2024] [Accepted: 06/14/2024] [Indexed: 06/28/2024]
Abstract
PURPOSE To provide a proof-of-concept analysis of the appropriateness and performance of ChatGPT-4 to triage, synthesize differential diagnoses, and generate treatment plans concerning common presentations of knee pain. METHODS Twenty knee complaints warranting triage and expanded scenarios were input into ChatGPT-4, with memory cleared prior to each new input to mitigate bias. For the 10 triage complaints, ChatGPT-4 was asked to generate a differential diagnosis that was graded for accuracy and suitability in comparison to a differential created by 2 orthopaedic sports medicine physicians. For the 10 clinical scenarios, ChatGPT-4 was prompted to provide treatment guidance for the patient, which was again graded. To test the higher-order capabilities of ChatGPT-4, further inquiry into these specific management recommendations was performed and graded. RESULTS All ChatGPT-4 diagnoses were deemed appropriate within the spectrum of potential pathologies on a differential. The top diagnosis on the differential was identical between surgeons and ChatGPT-4 for 70% of scenarios, and the top diagnosis provided by the surgeon appeared as either the first or second diagnosis in 90% of scenarios. Overall, 16 of 30 diagnoses (53.3%) in the differential were identical. When provided with 10 expanded vignettes with a single diagnosis, the accuracy of ChatGPT-4 increased to 100%, with the suitability of management graded as appropriate in 90% of cases. Specific information pertaining to conservative management, surgical approaches, and related treatments was appropriate and accurate in 100% of cases. CONCLUSIONS ChatGPT-4 provided clinically reasonable diagnoses to triage patient complaints of knee pain due to various underlying conditions that were generally consistent with differentials provided by sports medicine physicians. Diagnostic performance was enhanced when providing additional information, allowing ChatGPT-4 to reach high predictive accuracy for recommendations concerning management and treatment options. However, ChatGPT-4 may show clinically important error rates for diagnosis depending on prompting strategy and information provided; therefore, further refinements are necessary prior to implementation into clinical workflows. CLINICAL RELEVANCE Although ChatGPT-4 is increasingly being used by patients for health information, the potential for ChatGPT-4 to serve as a clinical support tool is unclear. In this study, we found that ChatGPT-4 was frequently able to diagnose and triage knee complaints appropriately as rated by sports medicine surgeons, suggesting that it may eventually be a useful clinical support tool.
Collapse
Affiliation(s)
- Kyle N Kunze
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A.; Sports Medicine and Shoulder Service, Hospital for Special Surgery, New York, New York, U.S.A..
| | - Nathan H Varady
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A.; Sports Medicine and Shoulder Service, Hospital for Special Surgery, New York, New York, U.S.A
| | | | - Amy Z Lu
- Weill Cornell College of Medicine, New York, New York, U.S.A
| | - Jorge Chahla
- Department of Orthopaedic Surgery, Rush University Medical Center, Chicago, Illinois, U.S.A
| | - R Kyle Martin
- Department of Orthopedic Surgery, University of Minnesota, Minneapolis, Minnesota, U.S.A
| | - Anil S Ranawat
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A.; Sports Medicine and Shoulder Service, Hospital for Special Surgery, New York, New York, U.S.A
| | - Andrew D Pearle
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A.; Sports Medicine and Shoulder Service, Hospital for Special Surgery, New York, New York, U.S.A
| | - Riley J Williams
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, U.S.A.; Sports Medicine and Shoulder Service, Hospital for Special Surgery, New York, New York, U.S.A
| |
Collapse
|
2
|
Ganjavi C, Melamed S, Biedermann B, Eppler MB, Rodler S, Layne E, Cei F, Gill I, Cacciamani GE. Generative artificial intelligence in oncology. Curr Opin Urol 2025; 35:205-213. [PMID: 40026054 DOI: 10.1097/mou.0000000000001272] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/04/2025]
Abstract
PURPOSE OF REVIEW By leveraging models such as large language models (LLMs) and generative computer vision tools, generative artificial intelligence (GAI) is reshaping cancer research and oncologic practice from diagnosis to treatment to follow-up. This timely review provides a comprehensive overview of the current applications and future potential of GAI in oncology, including in urologic malignancies. RECENT FINDINGS GAI has demonstrated significant potential in improving cancer diagnosis by integrating multimodal data, improving diagnostic workflows, and assisting in imaging interpretation. In treatment, GAI shows promise in aligning clinical decisions with guidelines, optimizing systemic therapy choices, and aiding patient education. Posttreatment, GAI applications include streamlining administrative tasks, improving follow-up care, and monitoring adverse events. In urologic oncology, GAI shows promise in image analysis, clinical data extraction, and outcomes research. Future developments in GAI could stimulate oncologic discovery, improve clinical efficiency, and enhance the patient-physician relationship. SUMMARY Integration of GAI into oncology has shown some ability to enhance diagnostic accuracy, optimize treatment decisions, and improve clinical efficiency, ultimately strengthening the patient-physician relationship. Despite these advancements, the inherent stochasticity of GAI's performance necessitates human oversight, more specialized models, proper physician training, and robust guidelines to ensure its well tolerated and effective integration into oncologic practice.
Collapse
Affiliation(s)
- Conner Ganjavi
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine
- AI Center at USC Urology, USC Institute of Urology, University of Southern California, Los Angeles, California, USA
| | - Sam Melamed
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine
- AI Center at USC Urology, USC Institute of Urology, University of Southern California, Los Angeles, California, USA
| | - Brett Biedermann
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine
- AI Center at USC Urology, USC Institute of Urology, University of Southern California, Los Angeles, California, USA
| | - Michael B Eppler
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine
- AI Center at USC Urology, USC Institute of Urology, University of Southern California, Los Angeles, California, USA
| | - Severin Rodler
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine
- AI Center at USC Urology, USC Institute of Urology, University of Southern California, Los Angeles, California, USA
| | - Ethan Layne
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine
- AI Center at USC Urology, USC Institute of Urology, University of Southern California, Los Angeles, California, USA
| | - Francesco Cei
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine
- AI Center at USC Urology, USC Institute of Urology, University of Southern California, Los Angeles, California, USA
| | - Inderbir Gill
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine
- AI Center at USC Urology, USC Institute of Urology, University of Southern California, Los Angeles, California, USA
| | - Giovanni E Cacciamani
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine
- AI Center at USC Urology, USC Institute of Urology, University of Southern California, Los Angeles, California, USA
| |
Collapse
|
3
|
Zou Y, Ye R, Gao Y, Zhou J, Li Y, Chen W, Zha F, Wang Y. Comparison of triage performance among DRP tool, ChatGPT, and outpatient rehabilitation doctors. Sci Rep 2025; 15:14084. [PMID: 40269240 PMCID: PMC12019411 DOI: 10.1038/s41598-025-99216-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2024] [Accepted: 04/17/2025] [Indexed: 04/25/2025] Open
Abstract
With the increasing need for rehabilitation, efficient triage is crucial. This study aims to explore the performance of the distributing rehabilitation patients (DRP) tool, ChatGPT, and outpatient doctors in rehabilitation patient triage, and to compare their strengths and limitations. This is a multicenter cross-sectional study. A total of 300 rehabilitation patients were selected from 27 medical institutions in 15 cities. Patients assessed by three methods: doctor assessment, ChatGPT, and the DRP tool. Three groups were set according to different methods: Doctor Group, ChatGPT Group, and Tool Group. Triage outcomes: outpatient rehabilitation, primary healthcare institutions inpatient, secondary hospital inpatient, tertiary hospital inpatient, and nursing homes or long-term care institutions inpatient. The consistency of triage was analyzed. Significant differences were observed between Doctor Group and both ChatGPT Group and Tool Group (P < 0.01; P < 0.01), no significant difference was observed between ChatGPT Group and Tool Group (P = 0.29). Consistency analysis revealed fair consistency between Doctor Group and both Tool Group and ChatGPT Group. Tool Group and ChatGPT Group showed good consistency. Percentage consistency showed that Tool Group and ChatGPT Group achieved the highest rate of agreement, at 63.67%. This study revealed differences between the DRP tool, ChatGPT, and traditional triage methods. The DRP tool may be more suitable for rehabilitation triage.
Collapse
Affiliation(s)
- Yucong Zou
- Department of Rehabilitation, Shenzhen Second People's Hospital, First Affiliated Hospital of Shenzhen University, 3002 Sungang West Road, Futian District, Shenzhen, 518035, Guangdong Province, China
| | - Ruixue Ye
- Department of Rehabilitation, Shenzhen Second People's Hospital, First Affiliated Hospital of Shenzhen University, 3002 Sungang West Road, Futian District, Shenzhen, 518035, Guangdong Province, China
| | - Yan Gao
- Department of Rehabilitation, Shenzhen Second People's Hospital, First Affiliated Hospital of Shenzhen University, 3002 Sungang West Road, Futian District, Shenzhen, 518035, Guangdong Province, China
| | - Jing Zhou
- Department of Rehabilitation, Shenzhen Second People's Hospital, First Affiliated Hospital of Shenzhen University, 3002 Sungang West Road, Futian District, Shenzhen, 518035, Guangdong Province, China
| | - Yawei Li
- Department of Rehabilitation, Shenzhen Second People's Hospital, First Affiliated Hospital of Shenzhen University, 3002 Sungang West Road, Futian District, Shenzhen, 518035, Guangdong Province, China
| | - Wenshi Chen
- Department of Rehabilitation, Shenzhen Second People's Hospital, First Affiliated Hospital of Shenzhen University, 3002 Sungang West Road, Futian District, Shenzhen, 518035, Guangdong Province, China
| | - Fubing Zha
- Department of Rehabilitation, Shenzhen Second People's Hospital, First Affiliated Hospital of Shenzhen University, 3002 Sungang West Road, Futian District, Shenzhen, 518035, Guangdong Province, China.
| | - Yulong Wang
- Department of Rehabilitation, Shenzhen Second People's Hospital, First Affiliated Hospital of Shenzhen University, 3002 Sungang West Road, Futian District, Shenzhen, 518035, Guangdong Province, China.
| |
Collapse
|
4
|
Alsharawneh A, Elshatarat RA, Alsulami GS, Alrabab'a MH, Al-Za'areer MS, Alhumaidi BN, Almagharbeh WT, Al Niarat TF, Al-Sayaghi KM, Saleh ZT. Triage decisions and health outcomes among oncology patients: a comparative study of medical and surgical cancer cases in emergency departments. BMC Emerg Med 2025; 25:69. [PMID: 40254595 PMCID: PMC12010577 DOI: 10.1186/s12873-025-01191-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2024] [Accepted: 02/18/2025] [Indexed: 04/22/2025] Open
Abstract
BACKGROUND Cancer-related emergencies are a significant challenge for healthcare systems globally, including Jordan. Effective triage is critical in ensuring timely and accurate prioritization of care, especially for surgical cancer patients requiring urgent intervention. However, under-triage-misclassification of high-acuity patients into lower urgency categories-can lead to significant delays and worsened outcomes. Despite the recognized importance of accurate triage, limited research has evaluated its impact on cancer patients in Jordan, particularly those requiring surgical care. OBJECTIVES This study aimed to evaluate the timeliness and prioritization of care for cancer patients admitted through the emergency department (ED) in Jordan. The specific objectives were to examine the association between under-triage and treatment delays and assess its impact on key outcomes, including time to physician assessment, time to treatment, and hospital length of stay. METHODS A retrospective cohort design was used to analyze data from 481 cancer patients admitted through the ED in four governmental hospitals across Jordan. Two cohorts were established: surgical cancer patients requiring emergency interventions and non-surgical cancer patients presenting with other oncological emergencies. Triage accuracy was assessed using the Canadian Triage and Acuity Scale (CTAS), and under-triage was identified when patients requiring high urgency care (CTAS I-III) were misclassified into lower urgency categories (CTAS IV-V). Data were collected from electronic health records and analyzed using multiple linear regression to evaluate the association between under-triage and treatment outcomes. RESULTS The majority of patients were elderly, with a mean age of 62.6 years (± 10.7), and a significant proportion presented with advanced-stage cancer (83.4% in stages III and IV). Surgical patients frequently exhibited severe symptoms such as acute pain (51.6%) and respiratory discomfort (41.1%). Under-triage rates were 44.1% for surgical patients and 39.4% for non-surgical patients. Among surgical patients, under-triage significantly delayed time to physician assessment (β = 34.9 min, p < 0.001) and time to treatment (β = 68.0 min, p < 0.001). For non-surgical patients, under-triage delays were even greater, with prolonged physician assessment times (β = 48.6 min, p < 0.001) and ED length of stay (β = 7.3 h, p < 0.001). Both cohorts experienced significant increases in hospital length of stay (surgical: β = 3.2 days, p = 0.008; non-surgical: β = 3.2 days, p < 0.001). CONCLUSION Under-triage in Jordanian EDs is strongly associated with significant delays in care for both surgical and non-surgical cancer patients, highlighting systemic gaps in acuity recognition and triage processes. These findings underscore the need for targeted interventions to improve triage accuracy, particularly through oncology-specific training and the integration of evidence-based tools like SIRS criteria. Enhancing ED processes for cancer patients is crucial to reducing delays, optimizing resource allocation, and improving clinical outcomes in this vulnerable population. CLINICAL TRIAL NUMBER Not applicable.
Collapse
Affiliation(s)
- Anas Alsharawneh
- Department of Adult Health Nursing, Faculty of Nursing, The Hashemite University, Zarqa, Jordan.
| | - Rami A Elshatarat
- Department of Medical and Surgical Nursing, College of Nursing, Taibah University, Madinah, Saudi Arabia
| | - Ghaida Shujayyi Alsulami
- Department of Clinical Nursing Practices, Faculty of Nursing, Umm Al-Qura University, Makkah, Saudi Arabia
| | - Mahmoud H Alrabab'a
- Prince Al‑Hussein Bin Abdullah II Academy for Civil Protection, Al‑Balqa Applied University, Salt, Jordan
| | - Majed S Al-Za'areer
- College of Health Science and Nursing, Al- Rayan Colleges, Madinah, Saudi Arabia
| | - Bandar Naffaa Alhumaidi
- Department of community health nursing, College of Nursing, Taibah University, Madinah, Saudi Arabia
| | - Wesam T Almagharbeh
- Medical Surgical Nursing Department, Faculty of Nursing, University of Tabuk, Tabuk, Saudi Arabia
| | | | - Khaled M Al-Sayaghi
- Department of Medical and Surgical Nursing, College of Nursing, Taibah University, Madinah, Saudi Arabia
- Nursing Division, Faculty of Medicine and Health Sciences, Sana'a University, Sana'a, Yemen
| | - Zyad T Saleh
- Department of Clinical Nursing, School of Nursing, The University of Jordan, Amman, Jordan
- Department of Nursing, Vision College, Riyadh, Saudi Arabia
| |
Collapse
|
5
|
Pasli S, Yadigaroğlu M, Kirimli EN, Beşer MF, Unutmaz İ, Ayhan AÖ, Karakurt B, Şahin AS, Hiçyilmaz Hİ, Imamoğlu M. ChatGPT-supported patient triage with voice commands in the emergency department: A prospective multicenter study. Am J Emerg Med 2025; 94:63-70. [PMID: 40273640 DOI: 10.1016/j.ajem.2025.04.040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2025] [Revised: 04/12/2025] [Accepted: 04/16/2025] [Indexed: 04/26/2025] Open
Abstract
BACKGROUND Triage aims to prioritize patients according to their medical urgency by accurately evaluating their clinical conditions, managing waiting times efficiently, and improving the overall effectiveness of emergency care. This study aims to assess ChatGPT's performance in patient triage across four emergency departments with varying dynamics and to provide a detailed analysis of its strengths and weaknesses. METHODS In this multicenter, prospective study, we compared the triage decisions made by ChatGPT-4o and the triage personnel with the gold standard decisions determined by an emergency medicine (EM) specialist. In the hospitals where we conducted the study, triage teams routinely direct patients to the appropriate ED areas based on the Emergency Severity Index (ESI) system and the hospital's local triage protocols. During the study period, the triage team collected patient data, including chief complaints, comorbidities, and vital signs, and used this information to make the initial triage decisions. An independent physician simultaneously entered the same data into ChatGPT using voice commands. At the same time, an EM specialist, present in the triage room throughout the study period, reviewed the same patient data and determined the gold standard triage decisions, strictly adhering to both the hospital's local protocols and the ESI system. Before initiating the study, we customized ChatGPT for each hospital by designing prompts that incorporated both the general principles of the ESI triage system and the specific triage rules of each hospital. The model's overall, hospital-based, and area-based performance was evaluated, with Cohen's Kappa, F1 score, and performance analyses conducted. RESULTS This study included 6657 patients. The overall agreement between triage personnel and GPT-4o with the gold standard was nearly perfect (Cohen's kappa = 0.782 and 0.833, respectively). The overall F1 score was 0.863 for the triage team, while GPT-4 achieved an F1 score of 0.897, demonstrating superior performance. ROC curve analysis showed the lowest performance in the yellow zone of a tertiary hospital (AUC = 0.75) and in the red zone of another tertiary hospital (AUC = 0.78). However, overall, AUC values greater than 0.90 were observed, indicating high accuracy. CONCLUSION ChatGPT generally outperformed triage personnel in patient triage across emergency departments with varying conditions, demonstrating high agreement with the gold standard decision. However, in tertiary hospitals, its performance was relatively lower in triaging patients with more complex symptoms, particularly those requiring triage to the yellow and red zones.
Collapse
Affiliation(s)
- Sinan Pasli
- Karadeniz Technical University, School of Medicine, Department of Emergency Medicine, Trabzon, Turkey.
| | - Metin Yadigaroğlu
- Samsun University, School of Medicine, Department of Emergency Medicine, Samsun, Turkey
| | | | | | - İhsan Unutmaz
- Samsun University, School of Medicine, Department of Emergency Medicine, Samsun, Turkey
| | - Asu Özden Ayhan
- Karadeniz Technical University, School of Medicine, Department of Emergency Medicine, Trabzon, Turkey
| | - Büşra Karakurt
- Samsun University, School of Medicine, Department of Emergency Medicine, Samsun, Turkey
| | - Abdul Samet Şahin
- Karadeniz Technical University, School of Medicine, Department of Emergency Medicine, Trabzon, Turkey
| | - Halil İbrahim Hiçyilmaz
- Deggendorf Institute of Technology, Artificial Intelligence for Smart Sensors and Actuators, Deggendorf, Germany
| | - Melih Imamoğlu
- Karadeniz Technical University, School of Medicine, Department of Emergency Medicine, Trabzon, Turkey
| |
Collapse
|
6
|
Alsumait A, Deshmukh S, Wang C, Leffler CT. Triage of Patient Messages Sent to the Eye Clinic via the Electronic Medical Record: A Comparative Study on AI and Human Triage Performance. J Clin Med 2025; 14:2395. [PMID: 40217845 PMCID: PMC11989310 DOI: 10.3390/jcm14072395] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2025] [Revised: 03/14/2025] [Accepted: 03/25/2025] [Indexed: 04/14/2025] Open
Abstract
Background/Objectives: Assess the ability of ChatGPT-4 (GPT-4) to effectively triage patient messages sent to the general eye clinic at our institution. Methods: Patient messages sent to the general eye clinic via MyChart were de-identified and then triaged by an ophthalmologist-in-training (MD) as well as GPT-4 with two main objectives. Both MD and GPT-4 were asked to direct patients to either general or specialty eye clinics, urgently or nonurgently, depending on the severity of the condition. Main Outcomes: GPT-4s ability to accurately direct patient messages to (1) a general or specialty eye clinic and (2) determine the time frame within which the patient needed to be seen (triage acuity). Accuracy was determined by comparing percent agreement with recommendations given by GPT-4 with those given by MD. Results: The study included 139 messages. Percent agreement between the ophthalmologist-in-training and GPT-4 was 64.7% for general/specialty clinic recommendation and 60.4% for triage acuity. Cohen's kappa was 0.33 and 0.67 for specialty clinic and triage urgency, respectively. GPT-4 recommended a triage acuity equal to or sooner than ophthalmologist-in-training for 93.5% of cases and recommended a less urgent triage acuity in 6.5% of cases. Conclusions: Our study indicates an AI system, such as GPT-4, should complement rather than replace physician judgment in triaging ophthalmic complaints. These systems may assist providers and reduce the workload of ophthalmologists and ophthalmic technicians as GPT-4 becomes more adept at triaging ophthalmic issues. Additionally, the integration of AI into ophthalmic triage could have therapeutic implications by ensuring timely and appropriate care, potentially improving patient outcomes by reducing delays in treatment. Combining GPT-4 with human expertise can improve service delivery speeds and patient outcomes while safeguarding against potential AI pitfalls.
Collapse
Affiliation(s)
- Abdulaziz Alsumait
- Department of Ophthalmology, Henry Ford Hospital, Detroit, MI 48202, USA;
| | - Sharanya Deshmukh
- Virginia Commonwealth University School of Medicine, Richmond, VA 23298, USA;
| | - Christine Wang
- Wilmer Eye Institute, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA;
| | - Christopher T. Leffler
- Department of Ophthalmology, Virginia Commonwealth University School of Medicine, 401 N. 11th St., Richmond, VA 23298, USA
| |
Collapse
|
7
|
Takita H, Kabata D, Walston SL, Tatekawa H, Saito K, Tsujimoto Y, Miki Y, Ueda D. A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians. NPJ Digit Med 2025; 8:175. [PMID: 40121370 PMCID: PMC11929846 DOI: 10.1038/s41746-025-01543-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2024] [Accepted: 02/26/2025] [Indexed: 03/25/2025] Open
Abstract
While generative artificial intelligence (AI) has shown potential in medical diagnostics, comprehensive evaluation of its diagnostic performance and comparison with physicians has not been extensively explored. We conducted a systematic review and meta-analysis of studies validating generative AI models for diagnostic tasks published between June 2018 and June 2024. Analysis of 83 studies revealed an overall diagnostic accuracy of 52.1%. No significant performance difference was found between AI models and physicians overall (p = 0.10) or non-expert physicians (p = 0.93). However, AI models performed significantly worse than expert physicians (p = 0.007). Several models demonstrated slightly higher performance compared to non-experts, although the differences were not significant. Generative AI demonstrates promising diagnostic capabilities with accuracy varying by model. Although it has not yet achieved expert-level reliability, these findings suggest potential for enhancing healthcare delivery and medical education when implemented with appropriate understanding of its limitations.
Collapse
Affiliation(s)
- Hirotaka Takita
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Daijiro Kabata
- Center for Mathematical and Data Science, Kobe University, Kobe, Japan
| | - Shannon L Walston
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
- Department of Artificial Intelligence, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Hiroyuki Tatekawa
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Kenichi Saito
- Center for Digital Transformation of Health Care, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| | - Yasushi Tsujimoto
- Oku Medical Clinic, Osaka, Japan
- Department of Health Promotion and Human Behavior, Kyoto University Graduate School of Medicine/School of Public Health, Kyoto University, Kyoto, Japan
- Scientific Research WorkS Peer Support Group (SRWS-PSG), Osaka, Japan
| | - Yukio Miki
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Daiju Ueda
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan.
- Department of Artificial Intelligence, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan.
- Center for Health Science Innovation, Osaka Metropolitan University, Osaka, Japan.
| |
Collapse
|
8
|
Birol NY, Çiftci HB, Yılmaz A, Çağlayan A, Alkan F. Is there any room for ChatGPT AI bot in speech-language pathology? Eur Arch Otorhinolaryngol 2025:10.1007/s00405-025-09295-y. [PMID: 40025183 DOI: 10.1007/s00405-025-09295-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2024] [Accepted: 02/21/2025] [Indexed: 03/04/2025]
Abstract
PURPOSE This study investigates the potential of the ChatGPT-4.0 artificial intelligence bot to assist speech-language pathologists (SLPs) by assessing its accuracy, comprehensiveness, and relevance in various tasks related to speech, language, and swallowing disorders. METHOD In this cross-sectional descriptive study, 15 practicing SLPs evaluated ChatGPT-4.0's responses to task-specific queries across six core areas: report writing, assessment material generation, clinical decision support, therapy stimulus generation, therapy planning, and client/family training material generation. English prompts were created in seven areas: speech sound disorders, motor speech disorders, aphasia, stuttering, childhood language disorders, voice disorders, and swallowing disorders. These prompts were entered into ChatGPT-4.0, and its responses were evaluated. Using a three-point Likert-type scale, participants rated each response for accuracy, relevance, and comprehensiveness based on clinical expectations and their professional judgment. RESULTS The study revealed that ChatGPT-4.0 performed with predominantly high accuracy, comprehensiveness, and relevance in tasks related to speech and language disorders. High accuracy, comprehensiveness, and relevance levels were observed in report writing, clinical decision support, and creating education material. However, tasks such as creating therapy stimuli and therapy planning showed more variation with medium and high accuracy levels. CONCLUSIONS ChatGPT-4.0 shows promise in assisting SLPs with various professional tasks, particularly report writing, clinical decision support, and education material creation. However, further research is needed to address its limitations in therapy stimulus generation and therapy planning to improve its usability in clinical practice. Integrating AI technologies such as ChatGPT could improve the efficiency and effectiveness of therapeutic processes in speech-language pathology.
Collapse
Affiliation(s)
- Namık Yücel Birol
- Department of Speech and Language Therapy, Faculty of Health Sciences, Tarsus University, Mersin, Türkiye.
| | - Hilal Berber Çiftci
- Department of Speech and Language Therapy, Faculty of Health Sciences, Tarsus University, Mersin, Türkiye
| | - Ayşegül Yılmaz
- Department of Speech and Language Therapy, Graduate School of Health Sciences, İstanbul Medipol University, İstanbul, Türkiye
| | - Ayhan Çağlayan
- Çağlayan Speech and Language Therapy Center, İzmir, Türkiye
| | - Ferhat Alkan
- Department of Speech and Language Therapy, Institute of Graduate Education, İstinye University, İstanbul, Türkiye
| |
Collapse
|
9
|
Huo B, Boyle A, Marfo N, Tangamornsuksan W, Steen JP, McKechnie T, Lee Y, Mayol J, Antoniou SA, Thirunavukarasu AJ, Sanger S, Ramji K, Guyatt G. Large Language Models for Chatbot Health Advice Studies: A Systematic Review. JAMA Netw Open 2025; 8:e2457879. [PMID: 39903463 PMCID: PMC11795331 DOI: 10.1001/jamanetworkopen.2024.57879] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 11/26/2024] [Indexed: 02/06/2025] Open
Abstract
Importance There is much interest in the clinical integration of large language models (LLMs) in health care. Many studies have assessed the ability of LLMs to provide health advice, but the quality of their reporting is uncertain. Objective To perform a systematic review to examine the reporting variability among peer-reviewed studies evaluating the performance of generative artificial intelligence (AI)-driven chatbots for summarizing evidence and providing health advice to inform the development of the Chatbot Assessment Reporting Tool (CHART). Evidence Review A search of MEDLINE via Ovid, Embase via Elsevier, and Web of Science from inception to October 27, 2023, was conducted with the help of a health sciences librarian to yield 7752 articles. Two reviewers screened articles by title and abstract followed by full-text review to identify primary studies evaluating the clinical accuracy of generative AI-driven chatbots in providing health advice (chatbot health advice studies). Two reviewers then performed data extraction for 137 eligible studies. Findings A total of 137 studies were included. Studies examined topics in surgery (55 [40.1%]), medicine (51 [37.2%]), and primary care (13 [9.5%]). Many studies focused on treatment (91 [66.4%]), diagnosis (60 [43.8%]), or disease prevention (29 [21.2%]). Most studies (136 [99.3%]) evaluated inaccessible, closed-source LLMs and did not provide enough information to identify the version of the LLM under evaluation. All studies lacked a sufficient description of LLM characteristics, including temperature, token length, fine-tuning availability, layers, and other details. Most studies (136 [99.3%]) did not describe a prompt engineering phase in their study. The date of LLM querying was reported in 54 (39.4%) studies. Most studies (89 [65.0%]) used subjective means to define the successful performance of the chatbot, while less than one-third addressed the ethical, regulatory, and patient safety implications of the clinical integration of LLMs. Conclusions and Relevance In this systematic review of 137 chatbot health advice studies, the reporting quality was heterogeneous and may inform the development of the CHART reporting standards. Ethical, regulatory, and patient safety considerations are crucial as interest grows in the clinical integration of LLMs.
Collapse
Affiliation(s)
- Bright Huo
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Amy Boyle
- Michael G. DeGroote School of Medicine, McMaster University, Hamilton, Ontario, Canada
| | - Nana Marfo
- H. Ross University School of Medicine, Miramar, Florida
| | - Wimonchat Tangamornsuksan
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| | - Jeremy P. Steen
- Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
| | - Tyler McKechnie
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Yung Lee
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Julio Mayol
- Hospital Clinico San Carlos, IdISSC, Universidad Complutense de Madrid, Madrid, Spain
| | | | | | - Stephanie Sanger
- Health Science Library, Faculty of Health Sciences, McMaster University, Hamilton, Ontario, Canada
| | - Karim Ramji
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Gordon Guyatt
- Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada
| |
Collapse
|
10
|
Nadarajasundaram A, Harrow S. The Role of Artificial Intelligence in Triaging Patients in Eye Casualty Departments: A Systematic Review. Cureus 2025; 17:e78144. [PMID: 39877053 PMCID: PMC11774558 DOI: 10.7759/cureus.78144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/28/2025] [Indexed: 01/31/2025] Open
Abstract
Visual impairment and eye disease remain a significant burden, highlighting the need for further support regarding eye care services. Artificial intelligence (AI) and its rapid advancements are providing an avenue for transforming healthcare. As a result, this provides a potential avenue to address the growing challenges with eye health and could assist in settings such as eye casualty departments. This review aims to evaluate current studies on AI implementation in eye casualty triage to understand the potential application for the future. A systematic review was conducted across a range of sources and databases producing 77 records initially identified, with four studies included in the final analysis. The findings demonstrated that AI tools are able to produce consistent and accurate triaging of patients and provide improvement in work efficiency without compromising safety. However, we note limitations of the studies including limited external validations of results and general applicability at present. Additionally, all the studies highlight the need for further studies and testing to allow for better understanding and validation of AI tools in eye casualty triaging.
Collapse
Affiliation(s)
| | - Simeon Harrow
- Emergency Medicine Department, Maidstone and Tunbridge Wells NHS Trust, Maidstone, GBR
| |
Collapse
|
11
|
Cao JJ, Kwon DH, Ghaziani TT, Kwo P, Tse G, Kesselman A, Kamaya A, Tse JR. Large language models' responses to liver cancer surveillance, diagnosis, and management questions: accuracy, reliability, readability. Abdom Radiol (NY) 2024; 49:4286-4294. [PMID: 39088019 DOI: 10.1007/s00261-024-04501-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Revised: 07/10/2024] [Accepted: 07/13/2024] [Indexed: 08/02/2024]
Abstract
PURPOSE To assess the accuracy, reliability, and readability of publicly available large language models in answering fundamental questions on hepatocellular carcinoma diagnosis and management. METHODS Twenty questions on liver cancer diagnosis and management were asked in triplicate to ChatGPT-3.5 (OpenAI), Gemini (Google), and Bing (Microsoft). Responses were assessed by six fellowship-trained physicians from three academic liver transplant centers who actively diagnose and/or treat liver cancer. Responses were categorized as accurate (score 1; all information is true and relevant), inadequate (score 0; all information is true, but does not fully answer the question or provides irrelevant information), or inaccurate (score - 1; any information is false). Means with standard deviations were recorded. Responses were considered as a whole accurate if mean score was > 0 and reliable if mean score was > 0 across all responses for the single question. Responses were also quantified for readability using the Flesch Reading Ease Score and Flesch-Kincaid Grade Level. Readability and accuracy across 60 responses were compared using one-way ANOVAs with Tukey's multiple comparison tests. RESULTS Of the twenty questions, ChatGPT answered nine (45%), Gemini answered 12 (60%), and Bing answered six (30%) questions accurately; however, only six (30%), eight (40%), and three (15%), respectively, were both accurate and reliable. There were no significant differences in accuracy between any chatbot. ChatGPT responses were the least readable (mean Flesch Reading Ease Score 29; college graduate), followed by Gemini (30; college) and Bing (40; college; p < 0.001). CONCLUSION Large language models provide complex responses to basic questions on hepatocellular carcinoma diagnosis and management that are seldomly accurate, reliable, or readable.
Collapse
Affiliation(s)
- Jennie J Cao
- Department of Radiology, Stanford University School of Medicine, 300 Pasteur Drive, Room H-1307, Stanford, CA, 94305, USA
| | - Daniel H Kwon
- Department of Medicine, San Francisco School of Medicine, University of California, 505 Parnassus Ave, MC1286C, San Francisco, CA, 94144, USA
| | - Tara T Ghaziani
- Department of Medicine, Stanford University School of Medicine, 430 Broadway St MC 6341, Redwood City, CA, 94063, USA
| | - Paul Kwo
- Department of Medicine, Stanford University School of Medicine, 430 Broadway St MC 6341, Redwood City, CA, 94063, USA
| | - Gary Tse
- Department of Radiological Sciences, Los Angeles David Geffen School of Medicine, University of California, 757 Westwood Plaza Los Angeles, Los Angeles, CA, 90095, USA
| | - Andrew Kesselman
- Department of Radiology, Stanford University School of Medicine, 875 Blake Wilbur Drive Palo Alto, Stanford, CA, 94304, USA
| | - Aya Kamaya
- Department of Radiology, Stanford University School of Medicine, 300 Pasteur Drive, Room H-1307, Stanford, CA, 94305, USA
| | - Justin R Tse
- Department of Radiology, Stanford University School of Medicine, 300 Pasteur Drive, Room H-1307, Stanford, CA, 94305, USA.
| |
Collapse
|
12
|
Schmidl B, Hütten T, Pigorsch S, Stögbauer F, Hoch CC, Hussain T, Wollenberg B, Wirth M. Evaluation of artificial intelligence in the therapy of oropharyngeal squamous cell carcinoma: De-escalation via Claude 3 Opus, Vertex AI and ChatGPT 4.0? - an experimental study. Int J Surg 2024; 110:8256-8260. [PMID: 39806758 PMCID: PMC11634083 DOI: 10.1097/js9.0000000000002139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2024] [Accepted: 11/05/2024] [Indexed: 01/16/2025]
Affiliation(s)
- Benedikt Schmidl
- Department of Otolaryngology Head and Neck Surgery, Technical University Munich, Munich, Germany
| | - Tobias Hütten
- Department of Otolaryngology Head and Neck Surgery, Technical University Munich, Munich, Germany
| | - Steffi Pigorsch
- Department of Radio Oncology, Technical University Munich, Munich, Germany
| | - Fabian Stögbauer
- Technical University Munich, Institute of Pathology, Munich, Germany
| | - Cosima C. Hoch
- Department of Otolaryngology Head and Neck Surgery, Technical University Munich, Munich, Germany
| | - Timon Hussain
- Department of Otolaryngology Head and Neck Surgery, Technical University Munich, Munich, Germany
| | - Barbara Wollenberg
- Department of Otolaryngology Head and Neck Surgery, Technical University Munich, Munich, Germany
| | - Markus Wirth
- Department of Otolaryngology Head and Neck Surgery, Technical University Munich, Munich, Germany
| |
Collapse
|
13
|
Lim B, Seth I, Cuomo R, Kenney PS, Ross RJ, Sofiadellis F, Pentangelo P, Ceccaroni A, Alfano C, Rozen WM. Can AI Answer My Questions? Utilizing Artificial Intelligence in the Perioperative Assessment for Abdominoplasty Patients. Aesthetic Plast Surg 2024; 48:4712-4724. [PMID: 38898239 PMCID: PMC11645314 DOI: 10.1007/s00266-024-04157-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2024] [Accepted: 05/21/2024] [Indexed: 06/21/2024]
Abstract
BACKGROUND Abdominoplasty is a common operation, used for a range of cosmetic and functional issues, often in the context of divarication of recti, significant weight loss, and after pregnancy. Despite this, patient-surgeon communication gaps can hinder informed decision-making. The integration of large language models (LLMs) in healthcare offers potential for enhancing patient information. This study evaluated the feasibility of using LLMs for answering perioperative queries. METHODS This study assessed the efficacy of four leading LLMs-OpenAI's ChatGPT-3.5, Anthropic's Claude, Google's Gemini, and Bing's CoPilot-using fifteen unique prompts. All outputs were evaluated using the Flesch-Kincaid, Flesch Reading Ease score, and Coleman-Liau index for readability assessment. The DISCERN score and a Likert scale were utilized to evaluate quality. Scores were assigned by two plastic surgical residents and then reviewed and discussed until a consensus was reached by five plastic surgeon specialists. RESULTS ChatGPT-3.5 required the highest level for comprehension, followed by Gemini, Claude, then CoPilot. Claude provided the most appropriate and actionable advice. In terms of patient-friendliness, CoPilot outperformed the rest, enhancing engagement and information comprehensiveness. ChatGPT-3.5 and Gemini offered adequate, though unremarkable, advice, employing more professional language. CoPilot uniquely included visual aids and was the only model to use hyperlinks, although they were not very helpful and acceptable, and it faced limitations in responding to certain queries. CONCLUSION ChatGPT-3.5, Gemini, Claude, and Bing's CoPilot showcased differences in readability and reliability. LLMs offer unique advantages for patient care but require careful selection. Future research should integrate LLM strengths and address weaknesses for optimal patient education. LEVEL OF EVIDENCE V This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .
Collapse
Affiliation(s)
- Bryan Lim
- Department of Plastic Surgery, Peninsula Health, Melbourne, Victoria, 3199, Australia
| | - Ishith Seth
- Department of Plastic Surgery, Peninsula Health, Melbourne, Victoria, 3199, Australia
| | - Roberto Cuomo
- Plastic Surgery Unit, Department of Medicine, Surgery and Neuroscience, University of Siena, Siena, Italy.
| | - Peter Sinkjær Kenney
- Department of Plastic Surgery, Velje Hospital, Beriderbakken 4, 7100, Vejle, Denmark
- Department of Plastic and Breast Surgery, Aarhus University Hospital, Aarhus, Denmark
| | - Richard J Ross
- Department of Plastic Surgery, Peninsula Health, Melbourne, Victoria, 3199, Australia
| | - Foti Sofiadellis
- Department of Plastic Surgery, Peninsula Health, Melbourne, Victoria, 3199, Australia
| | | | | | | | - Warren Matthew Rozen
- Department of Plastic Surgery, Peninsula Health, Melbourne, Victoria, 3199, Australia
| |
Collapse
|
14
|
Schmidl B, Hütten T, Pigorsch S, Stögbauer F, Hoch CC, Hussain T, Wollenberg B, Wirth M. Assessing the use of the novel tool Claude 3 in comparison to ChatGPT 4.0 as an artificial intelligence tool in the diagnosis and therapy of primary head and neck cancer cases. Eur Arch Otorhinolaryngol 2024; 281:6099-6109. [PMID: 39112556 PMCID: PMC11512878 DOI: 10.1007/s00405-024-08828-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2024] [Accepted: 07/03/2024] [Indexed: 10/28/2024]
Abstract
OBJECTIVES Head and neck squamous cell carcinoma (HNSCC) is a complex malignancy that requires a multidisciplinary tumor board approach for individual treatment planning. In recent years, artificial intelligence tools have emerged to assist healthcare professionals in making informed treatment decisions. This study investigates the application of the newly published LLM Claude 3 Opus compared to the currently most advanced LLM ChatGPT 4.0 for the diagnosis and therapy planning of primary HNSCC. The results were compared to that of a conventional multidisciplinary tumor board; (2) Materials and Methods: We conducted a study in March 2024 on 50 consecutive primary head and neck cancer cases. The diagnostics and MDT recommendations were compared to the Claude 3 Opus and ChatGPT 4.0 recommendations for each patient and rated by two independent reviewers for the following parameters: clinical recommendation, explanation, and summarization in addition to the Artificial Intelligence Performance Instrument (AIPI); (3) Results: In this study, Claude 3 achieved better scores for the diagnostic workup of patients than ChatGPT 4.0 and provided treatment recommendations involving surgery, chemotherapy, and radiation therapy. In terms of clinical recommendations, explanation and summarization Claude 3 scored similar to ChatGPT 4.0, listing treatment recommendations which were congruent with the MDT, but failed to cite the source of the information; (4) Conclusion: This study is the first analysis of Claude 3 for primary head and neck cancer cases and demonstrates a superior performance in the diagnosis of HNSCC than ChatGPT 4.0 and similar results for therapy recommendations. This marks the advent of a newly launched advanced AI model that may be superior to ChatGPT 4.0 for the assessment of primary head and neck cancer cases and may assist in the clinical diagnostic and MDT setting.
Collapse
Affiliation(s)
- Benedikt Schmidl
- Department of Otolaryngology Head and Neck Surgery, Technical University Munich, Munich, Germany.
| | - Tobias Hütten
- Department of Otolaryngology Head and Neck Surgery, Technical University Munich, Munich, Germany
| | - Steffi Pigorsch
- Department of RadioOncology, Technical University Munich, Munich, Germany
| | - Fabian Stögbauer
- Institute of Pathology, Technical University Munich, Munich, Germany
| | - Cosima C Hoch
- Department of Otolaryngology Head and Neck Surgery, Technical University Munich, Munich, Germany
| | - Timon Hussain
- Department of Otolaryngology Head and Neck Surgery, Technical University Munich, Munich, Germany
| | - Barbara Wollenberg
- Department of Otolaryngology Head and Neck Surgery, Technical University Munich, Munich, Germany
| | - Markus Wirth
- Department of Otolaryngology Head and Neck Surgery, Technical University Munich, Munich, Germany
| |
Collapse
|
15
|
Carl N, Schramm F, Haggenmüller S, Kather JN, Hetz MJ, Wies C, Michel MS, Wessels F, Brinker TJ. Large language model use in clinical oncology. NPJ Precis Oncol 2024; 8:240. [PMID: 39443582 PMCID: PMC11499929 DOI: 10.1038/s41698-024-00733-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Accepted: 10/12/2024] [Indexed: 10/25/2024] Open
Abstract
Large language models (LLMs) are undergoing intensive research for various healthcare domains. This systematic review and meta-analysis assesses current applications, methodologies, and the performance of LLMs in clinical oncology. A mixed-methods approach was used to extract, summarize, and compare methodological approaches and outcomes. This review includes 34 studies. LLMs are primarily evaluated on their ability to answer oncologic questions across various domains. The meta-analysis highlights a significant performance variance, influenced by diverse methodologies and evaluation criteria. Furthermore, differences in inherent model capabilities, prompting strategies, and oncological subdomains contribute to heterogeneity. The lack of use of standardized and LLM-specific reporting protocols leads to methodological disparities, which must be addressed to ensure comparability in LLM research and ultimately leverage the reliable integration of LLM technologies into clinical practice.
Collapse
Affiliation(s)
- Nicolas Carl
- Department of Digital Prevention, Diagnostics and Therapy Guidance, German Cancer Research Center (DKFZ), Heidelberg, Germany
- Department of Urology and Urological Surgery, University Medical Center Mannheim, Ruprecht-Karls University Heidelberg, Mannheim, Germany
| | - Franziska Schramm
- Department of Digital Prevention, Diagnostics and Therapy Guidance, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Sarah Haggenmüller
- Department of Digital Prevention, Diagnostics and Therapy Guidance, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Jakob Nikolas Kather
- Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, Technical University Dresden, Dresden, Germany
| | - Martin J Hetz
- Department of Digital Prevention, Diagnostics and Therapy Guidance, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Christoph Wies
- Department of Digital Prevention, Diagnostics and Therapy Guidance, German Cancer Research Center (DKFZ), Heidelberg, Germany
- Medical Faculty, Ruprecht-Karls University Heidelberg, Heidelberg, Germany
| | - Maurice Stephan Michel
- Department of Urology and Urological Surgery, University Medical Center Mannheim, Ruprecht-Karls University Heidelberg, Mannheim, Germany
| | - Frederik Wessels
- Department of Urology and Urological Surgery, University Medical Center Mannheim, Ruprecht-Karls University Heidelberg, Mannheim, Germany
| | - Titus J Brinker
- Department of Digital Prevention, Diagnostics and Therapy Guidance, German Cancer Research Center (DKFZ), Heidelberg, Germany.
| |
Collapse
|
16
|
Swinburne NC, Jackson CB, Pagano AM, Stember JN, Schefflein J, Marinelli B, Panyam PK, Autz A, Chopra MS, Holodny AI, Ginsberg MS. Foundational Segmentation Models and Clinical Data Mining Enable Accurate Computer Vision for Lung Cancer. JOURNAL OF IMAGING INFORMATICS IN MEDICINE 2024:10.1007/s10278-024-01304-6. [PMID: 39438365 DOI: 10.1007/s10278-024-01304-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/27/2024] [Revised: 10/10/2024] [Accepted: 10/11/2024] [Indexed: 10/25/2024]
Abstract
This study aims to assess the effectiveness of integrating Segment Anything Model (SAM) and its variant MedSAM into the automated mining, object detection, and segmentation (MODS) methodology for developing robust lung cancer detection and segmentation models without post hoc labeling of training images. In a retrospective analysis, 10,000 chest computed tomography scans from patients with lung cancer were mined. Line measurement annotations were converted to bounding boxes, excluding boxes < 1 cm or > 7 cm. The You Only Look Once object detection architecture was used for teacher-student learning to label unannotated lesions on the training images. Subsequently, a final tumor detection model was trained and employed with SAM and MedSAM for tumor segmentation. Model performance was assessed on a manually annotated test dataset, with additional evaluations conducted on an external lung cancer dataset before and after detection model fine-tuning. Bootstrap resampling was used to calculate 95% confidence intervals. Data mining yielded 10,789 line annotations, resulting in 5403 training boxes. The baseline detection model achieved an internal F1 score of 0.847, improving to 0.860 after self-labeling. Tumor segmentation using the final detection model attained internal Dice similarity coefficients (DSCs) of 0.842 (SAM) and 0.822 (MedSAM). After fine-tuning, external validation showed an F1 of 0.832 and DSCs of 0.802 (SAM) and 0.804 (MedSAM). Integrating foundational segmentation models into the MODS framework results in high-performing lung cancer detection and segmentation models using only mined clinical data. Both SAM and MedSAM hold promise as foundational segmentation models for radiology images.
Collapse
Affiliation(s)
- Nathaniel C Swinburne
- Department of Radiology, Memorial Sloan Kettering Cancer Center, 1275 York Ave, New York, NY, 10065, USA.
| | - Christopher B Jackson
- Department of Radiation Oncology, Memorial Sloan Kettering Cancer Center, 1275 York Ave, New York, NY, 10065, USA
| | - Andrew M Pagano
- Department of Radiology, Memorial Sloan Kettering Cancer Center, 1275 York Ave, New York, NY, 10065, USA
| | - Joseph N Stember
- Department of Radiology, Memorial Sloan Kettering Cancer Center, 1275 York Ave, New York, NY, 10065, USA
| | - Javin Schefflein
- Department of Radiology, Memorial Sloan Kettering Cancer Center, 1275 York Ave, New York, NY, 10065, USA
| | - Brett Marinelli
- Department of Radiology, Memorial Sloan Kettering Cancer Center, 1275 York Ave, New York, NY, 10065, USA
| | - Prashanth Kumar Panyam
- Department of Radiology, Memorial Sloan Kettering Cancer Center, 1275 York Ave, New York, NY, 10065, USA
| | - Arthur Autz
- Department of Radiology, Memorial Sloan Kettering Cancer Center, 1275 York Ave, New York, NY, 10065, USA
| | - Mohapar S Chopra
- Department of Radiology, Memorial Sloan Kettering Cancer Center, 1275 York Ave, New York, NY, 10065, USA
| | - Andrei I Holodny
- Department of Radiology, Memorial Sloan Kettering Cancer Center, 1275 York Ave, New York, NY, 10065, USA
| | - Michelle S Ginsberg
- Department of Radiology, Memorial Sloan Kettering Cancer Center, 1275 York Ave, New York, NY, 10065, USA
| |
Collapse
|
17
|
Sanduleanu S, Ersahin K, Bremm J, Talibova N, Damer T, Erdogan M, Kottlors J, Goertz L, Bruns C, Maintz D, Abdullayev N. Feasibility of GPT-3.5 versus Machine Learning for Automated Surgical Decision-Making Determination: A Multicenter Study on Suspected Appendicitis. AI 2024; 5:1942-1954. [DOI: 10.3390/ai5040096] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2025] Open
Abstract
Background: Nonsurgical treatment of uncomplicated appendicitis is a reasonable option in many cases despite the sparsity of robust, easy access, externally validated, and multimodally informed clinical decision support systems (CDSSs). Developed by OpenAI, the Generative Pre-trained Transformer 3.5 model (GPT-3) may provide enhanced decision support for surgeons in less certain appendicitis cases or those posing a higher risk for (relative) operative contra-indications. Our objective was to determine whether GPT-3.5, when provided high-throughput clinical, laboratory, and radiological text-based information, will come to clinical decisions similar to those of a machine learning model and a board-certified surgeon (reference standard) in decision-making for appendectomy versus conservative treatment. Methods: In this cohort study, we randomly collected patients presenting at the emergency department (ED) of two German hospitals (GFO, Troisdorf, and University Hospital Cologne) with right abdominal pain between October 2022 and October 2023. Statistical analysis was performed using R, version 3.6.2, on RStudio, version 2023.03.0 + 386. Overall agreement between the GPT-3.5 output and the reference standard was assessed by means of inter-observer kappa values as well as accuracy, sensitivity, specificity, and positive and negative predictive values with the “Caret” and “irr” packages. Statistical significance was defined as p < 0.05. Results: There was agreement between the surgeon’s decision and GPT-3.5 in 102 of 113 cases, and all cases where the surgeon decided upon conservative treatment were correctly classified by GPT-3.5. The estimated model training accuracy was 83.3% (95% CI: 74.0, 90.4), while the validation accuracy for the model was 87.0% (95% CI: 66.4, 97.2). This is in comparison to the GPT-3.5 accuracy of 90.3% (95% CI: 83.2, 95.0), which did not perform significantly better in comparison to the machine learning model (p = 0.21). Conclusions: This study, the first study of the “intended use” of GPT-3.5 for surgical treatment to our knowledge, comparing surgical decision-making versus an algorithm found a high degree of agreement between board-certified surgeons and GPT-3.5 for surgical decision-making in patients presenting to the emergency department with lower abdominal pain.
Collapse
Affiliation(s)
| | - Koray Ersahin
- Department of General and Visceral Surgery, GFO Clinics Troisdorf, Academic Hospital of the Friedrich-Wilhelms-University Bonn, 50937 Troisdorf, Germany
| | - Johannes Bremm
- Institute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, 50937 Cologne, Germany
| | - Narmin Talibova
- Department of Internal Medicine III, University Hospital, 89081 Ulm, Germany
| | - Tim Damer
- Department of General and Visceral Surgery, GFO Clinics Troisdorf, Academic Hospital of the Friedrich-Wilhelms-University Bonn, 50937 Troisdorf, Germany
| | - Merve Erdogan
- Department of Radiology and Neuroradiology, GFO Clinics Troisdorf, Academic Hospital of the Friedrich-Wilhelms-University Bonn, 53840 Troisdorf, Germany
| | - Jonathan Kottlors
- Institute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, 50937 Cologne, Germany
| | - Lukas Goertz
- Institute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, 50937 Cologne, Germany
| | - Christiane Bruns
- Department of General, Visceral, Tumor and Transplantation Surgery, University Hospital of Cologne, Kerpener Straße 62, 50937 Cologne, Germany
- Center for Integrated Oncology (CIO) Aachen, Bonn, Cologne and Düsseldorf, 50937 Cologne, Germany
| | - David Maintz
- Institute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, 50937 Cologne, Germany
| | - Nuran Abdullayev
- Department of Radiology and Neuroradiology, GFO Clinics Troisdorf, Academic Hospital of the Friedrich-Wilhelms-University Bonn, 53840 Troisdorf, Germany
| |
Collapse
|
18
|
Sorin V, Klang E, Sobeh T, Konen E, Shrot S, Livne A, Weissbuch Y, Hoffmann C, Barash Y. Generative pre-trained transformer (GPT)-4 support for differential diagnosis in neuroradiology. Quant Imaging Med Surg 2024; 14:7551-7560. [PMID: 39429611 PMCID: PMC11485343 DOI: 10.21037/qims-24-200] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Accepted: 07/25/2024] [Indexed: 10/22/2024]
Abstract
Background Differential diagnosis in radiology relies on the accurate identification of imaging patterns. The use of large language models (LLMs) in radiology holds promise, with many potential applications that may enhance the efficiency of radiologists' workflow. The study aimed to evaluate the efficacy of generative pre-trained transformer (GPT)-4, a LLM, in providing differential diagnoses in neuroradiology, comparing its performance with board-certified neuroradiologists. Methods Sixty neuroradiology reports with variable diagnoses were inserted into GPT-4, which was tasked with generating a top-3 differential diagnosis for each case. The results were compared to the true diagnoses and to the differential diagnoses provided by three blinded neuroradiologists. Diagnostic accuracy and agreement between readers were assessed. Results Of the 60 patients (mean age 47.8 years, 65% female), GPT-4 correctly included the diagnoses in its differentials in 61.7% (37/60) of cases, while the neuroradiologists' accuracy ranged from 63.3% (38/60) to 73.3% (44/60). Agreement between GPT-4 and the neuroradiologists, and among the neuroradiologists was fair to moderate [Cohen's kappa (kw) 0.34-0.44 and kw 0.39-0.54, respectively]. Conclusions GPT-4 shows potential as a support tool for differential diagnosis in neuroradiology, though it was outperformed by human experts. Radiologists should remain mindful to the limitations of LLMs, while harboring their potential to enhance educational and clinical work.
Collapse
Affiliation(s)
- Vera Sorin
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Ramat Gan, Israel
- The Faculty of Medicine, Tel-Aviv University, Tel Aviv-Yafo, Israel
- DeepVision Lab, Chaim Sheba Medical Center, Ramat Gan, Israel
- Sami Sagol AI Hub, ARC, Sheba Medical Center, Ramat Gan, Israel
| | - Eyal Klang
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Ramat Gan, Israel
- The Faculty of Medicine, Tel-Aviv University, Tel Aviv-Yafo, Israel
- DeepVision Lab, Chaim Sheba Medical Center, Ramat Gan, Israel
- Sami Sagol AI Hub, ARC, Sheba Medical Center, Ramat Gan, Israel
| | - Tamer Sobeh
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Ramat Gan, Israel
- The Faculty of Medicine, Tel-Aviv University, Tel Aviv-Yafo, Israel
| | - Eli Konen
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Ramat Gan, Israel
- The Faculty of Medicine, Tel-Aviv University, Tel Aviv-Yafo, Israel
| | - Shai Shrot
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Ramat Gan, Israel
- The Faculty of Medicine, Tel-Aviv University, Tel Aviv-Yafo, Israel
| | - Adva Livne
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Ramat Gan, Israel
- The Faculty of Medicine, Tel-Aviv University, Tel Aviv-Yafo, Israel
- DeepVision Lab, Chaim Sheba Medical Center, Ramat Gan, Israel
| | - Yulian Weissbuch
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Ramat Gan, Israel
- The Faculty of Medicine, Tel-Aviv University, Tel Aviv-Yafo, Israel
| | - Chen Hoffmann
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Ramat Gan, Israel
- The Faculty of Medicine, Tel-Aviv University, Tel Aviv-Yafo, Israel
| | - Yiftach Barash
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Ramat Gan, Israel
- The Faculty of Medicine, Tel-Aviv University, Tel Aviv-Yafo, Israel
- DeepVision Lab, Chaim Sheba Medical Center, Ramat Gan, Israel
| |
Collapse
|
19
|
Glicksberg BS, Timsina P, Patel D, Sawant A, Vaid A, Raut G, Charney AW, Apakama D, Carr BG, Freeman R, Nadkarni GN, Klang E. Evaluating the accuracy of a state-of-the-art large language model for prediction of admissions from the emergency room. J Am Med Inform Assoc 2024; 31:1921-1928. [PMID: 38771093 PMCID: PMC11339523 DOI: 10.1093/jamia/ocae103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Accepted: 04/22/2024] [Indexed: 05/22/2024] Open
Abstract
BACKGROUND Artificial intelligence (AI) and large language models (LLMs) can play a critical role in emergency room operations by augmenting decision-making about patient admission. However, there are no studies for LLMs using real-world data and scenarios, in comparison to and being informed by traditional supervised machine learning (ML) models. We evaluated the performance of GPT-4 for predicting patient admissions from emergency department (ED) visits. We compared performance to traditional ML models both naively and when informed by few-shot examples and/or numerical probabilities. METHODS We conducted a retrospective study using electronic health records across 7 NYC hospitals. We trained Bio-Clinical-BERT and XGBoost (XGB) models on unstructured and structured data, respectively, and created an ensemble model reflecting ML performance. We then assessed GPT-4 capabilities in many scenarios: through Zero-shot, Few-shot with and without retrieval-augmented generation (RAG), and with and without ML numerical probabilities. RESULTS The Ensemble ML model achieved an area under the receiver operating characteristic curve (AUC) of 0.88, an area under the precision-recall curve (AUPRC) of 0.72 and an accuracy of 82.9%. The naïve GPT-4's performance (0.79 AUC, 0.48 AUPRC, and 77.5% accuracy) showed substantial improvement when given limited, relevant data to learn from (ie, RAG) and underlying ML probabilities (0.87 AUC, 0.71 AUPRC, and 83.1% accuracy). Interestingly, RAG alone boosted performance to near peak levels (0.82 AUC, 0.56 AUPRC, and 81.3% accuracy). CONCLUSIONS The naïve LLM had limited performance but showed significant improvement in predicting ED admissions when supplemented with real-world examples to learn from, particularly through RAG, and/or numerical probabilities from traditional ML models. Its peak performance, although slightly lower than the pure ML model, is noteworthy given its potential for providing reasoning behind predictions. Further refinement of LLMs with real-world data is necessary for successful integration as decision-support tools in care settings.
Collapse
Affiliation(s)
- Benjamin S Glicksberg
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Prem Timsina
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Dhaval Patel
- Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Ashwin Sawant
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Akhil Vaid
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Ganesh Raut
- Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Alexander W Charney
- The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Donald Apakama
- Department of Emergency Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Brendan G Carr
- Department of Emergency Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Robert Freeman
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Girish N Nadkarni
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Eyal Klang
- Division of Data-Driven and Digital Medicine, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- The Charles Bronfman Department of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| |
Collapse
|
20
|
Dal E, Srivastava A, Chigarira B, Hage Chehade C, Matthew Thomas V, Galarza Fortuna GM, Garg D, Ji R, Gebrael G, Agarwal N, Swami U, Li H. Effectiveness of ChatGPT 4.0 in Telemedicine-Based Management of Metastatic Prostate Carcinoma. Diagnostics (Basel) 2024; 14:1899. [PMID: 39272684 PMCID: PMC11394468 DOI: 10.3390/diagnostics14171899] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2024] [Revised: 07/29/2024] [Accepted: 08/22/2024] [Indexed: 09/15/2024] Open
Abstract
The recent rise in telemedicine, notably during the COVID-19 pandemic, highlights the potential of integrating artificial intelligence tools in healthcare. This study assessed the effectiveness of ChatGPT versus medical oncologists in the telemedicine-based management of metastatic prostate cancer. In this retrospective study, 102 patients who met inclusion criteria were analyzed to compare the competencies of ChatGPT and oncologists in telemedicine consultations. ChatGPT's role in pre-charting and determining the need for in-person consultations was evaluated. The primary outcome was the concordance between ChatGPT and oncologists in treatment decisions. Results showed a moderate concordance (Cohen's Kappa = 0.43, p < 0.001). The number of diagnoses made by both parties was not significantly different (median number of diagnoses: 5 vs. 5, p = 0.12). In conclusion, ChatGPT exhibited moderate agreement with oncologists in management via telemedicine, indicating the need for further research to explore its healthcare applications.
Collapse
Affiliation(s)
- Emre Dal
- Huntsman Cancer Institute, University of Utah, Salt Lake City, UT 84112, USA
| | - Ayana Srivastava
- Huntsman Cancer Institute, University of Utah, Salt Lake City, UT 84112, USA
| | - Beverly Chigarira
- Huntsman Cancer Institute, University of Utah, Salt Lake City, UT 84112, USA
| | - Chadi Hage Chehade
- Huntsman Cancer Institute, University of Utah, Salt Lake City, UT 84112, USA
| | | | | | - Diya Garg
- Huntsman Cancer Institute, University of Utah, Salt Lake City, UT 84112, USA
| | - Richard Ji
- Huntsman Cancer Institute, University of Utah, Salt Lake City, UT 84112, USA
| | - Georges Gebrael
- Huntsman Cancer Institute, University of Utah, Salt Lake City, UT 84112, USA
| | - Neeraj Agarwal
- Huntsman Cancer Institute, University of Utah, Salt Lake City, UT 84112, USA
| | - Umang Swami
- Huntsman Cancer Institute, University of Utah, Salt Lake City, UT 84112, USA
| | - Haoran Li
- Department of Medical Oncology, University of Kansas Cancer Center, Westwood, KS 66205, USA
| |
Collapse
|
21
|
Alasker A, Alsalamah S, Alshathri N, Almansour N, Alsalamah F, Alghafees M, AlKhamees M, Alsaikhan B. Performance of large language models (LLMs) in providing prostate cancer information. BMC Urol 2024; 24:177. [PMID: 39180045 PMCID: PMC11342655 DOI: 10.1186/s12894-024-01570-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Accepted: 08/16/2024] [Indexed: 08/26/2024] Open
Abstract
PURPOSE The diagnosis and management of prostate cancer (PCa), the second most common cancer in men worldwide, are highly complex. Hence, patients often seek knowledge through additional resources, including AI chatbots such as ChatGPT and Google Bard. This study aimed to evaluate the performance of LLMs in providing education on PCa. METHODS Common patient questions about PCa were collected from reliable educational websites and evaluated for accuracy, comprehensiveness, readability, and stability by two independent board-certified urologists, with a third resolving discrepancy. Accuracy was measured on a 3-point scale, comprehensiveness was measured on a 5-point Likert scale, and readability was measured using the Flesch Reading Ease (FRE) score and Flesch-Kincaid FK Grade Level. RESULTS A total of 52 questions on general knowledge, diagnosis, treatment, and prevention of PCa were provided to three LLMs. Although there was no significant difference in the overall accuracy of LLMs, ChatGPT-3.5 demonstrated superiority over the other LLMs in terms of general knowledge of PCa (p = 0.018). ChatGPT-4 achieved greater overall comprehensiveness than ChatGPT-3.5 and Bard (p = 0.028). For readability, Bard generated simpler sentences with the highest FRE score (54.7, p < 0.001) and lowest FK reading level (10.2, p < 0.001). CONCLUSION ChatGPT-3.5, ChatGPT-4 and Bard generate accurate, comprehensive, and easily readable PCa material. These AI models might not replace healthcare professionals but can assist in patient education and guidance.
Collapse
Affiliation(s)
- Ahmed Alasker
- Division of Urology, Department of Surgery, Ministry of National Guard - Health Affairs, Riyadh, Saudi Arabia
- King Abdullah International Medical Research Center (KAIMRC), Riyadh, Saudi Arabia
- College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
| | - Seham Alsalamah
- King Abdullah International Medical Research Center (KAIMRC), Riyadh, Saudi Arabia.
- College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia.
| | - Nada Alshathri
- King Abdullah International Medical Research Center (KAIMRC), Riyadh, Saudi Arabia
- College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
| | - Nura Almansour
- King Abdullah International Medical Research Center (KAIMRC), Riyadh, Saudi Arabia
- College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
| | - Faris Alsalamah
- King Abdullah International Medical Research Center (KAIMRC), Riyadh, Saudi Arabia
- College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
| | - Mohammad Alghafees
- Division of Urology, Department of Surgery, Ministry of National Guard - Health Affairs, Riyadh, Saudi Arabia
- King Abdullah International Medical Research Center (KAIMRC), Riyadh, Saudi Arabia
- College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
| | - Mohammad AlKhamees
- Division of Urology, Department of Surgery, Ministry of National Guard - Health Affairs, Riyadh, Saudi Arabia
- King Abdullah International Medical Research Center (KAIMRC), Riyadh, Saudi Arabia
- College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
- Department of Surgical Specialities, College of Medicine, Majmaah University, Majmaah, Saudi Arabia
| | - Bader Alsaikhan
- Division of Urology, Department of Surgery, Ministry of National Guard - Health Affairs, Riyadh, Saudi Arabia
- King Abdullah International Medical Research Center (KAIMRC), Riyadh, Saudi Arabia
- College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
| |
Collapse
|
22
|
Du X, Zhou Z, Wang Y, Chuang YW, Yang R, Zhang W, Wang X, Zhang R, Hong P, Bates DW, Zhou L. Generative Large Language Models in Electronic Health Records for Patient Care Since 2023: A Systematic Review. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.08.11.24311828. [PMID: 39228726 PMCID: PMC11370524 DOI: 10.1101/2024.08.11.24311828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2024]
Abstract
Background Generative Large language models (LLMs) represent a significant advancement in natural language processing, achieving state-of-the-art performance across various tasks. However, their application in clinical settings using real electronic health records (EHRs) is still rare and presents numerous challenges. Objective This study aims to systematically review the use of generative LLMs, and the effectiveness of relevant techniques in patient care-related topics involving EHRs, summarize the challenges faced, and suggest future directions. Methods A Boolean search for peer-reviewed articles was conducted on May 19th, 2024 using PubMed and Web of Science to include research articles published since 2023, which was one month after the release of ChatGPT. The search results were deduplicated. Multiple reviewers, including biomedical informaticians, computer scientists, and a physician, screened the publications for eligibility and conducted data extraction. Only studies utilizing generative LLMs to analyze real EHR data were included. We summarized the use of prompt engineering, fine-tuning, multimodal EHR data, and evaluation matrices. Additionally, we identified current challenges in applying LLMs in clinical settings as reported by the included studies and proposed future directions. Results The initial search identified 6,328 unique studies, with 76 studies included after eligibility screening. Of these, 67 studies (88.2%) employed zero-shot prompting, five of them reported 100% accuracy on five specific clinical tasks. Nine studies used advanced prompting strategies; four tested these strategies experimentally, finding that prompt engineering improved performance, with one study noting a non-linear relationship between the number of examples in a prompt and performance improvement. Eight studies explored fine-tuning generative LLMs, all reported performance improvements on specific tasks, but three of them noted potential performance degradation after fine-tuning on certain tasks. Only two studies utilized multimodal data, which improved LLM-based decision-making and enabled accurate rare disease diagnosis and prognosis. The studies employed 55 different evaluation metrics for 22 purposes, such as correctness, completeness, and conciseness. Two studies investigated LLM bias, with one detecting no bias and the other finding that male patients received more appropriate clinical decision-making suggestions. Six studies identified hallucinations, such as fabricating patient names in structured thyroid ultrasound reports. Additional challenges included but were not limited to the impersonal tone of LLM consultations, which made patients uncomfortable, and the difficulty patients had in understanding LLM responses. Conclusion Our review indicates that few studies have employed advanced computational techniques to enhance LLM performance. The diverse evaluation metrics used highlight the need for standardization. LLMs currently cannot replace physicians due to challenges such as bias, hallucinations, and impersonal responses.
Collapse
Affiliation(s)
- Xinsong Du
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115
| | - Zhengyang Zhou
- Department of Computer Science, Brandeis University, Waltham, MA 02453
| | - Yifei Wang
- Department of Computer Science, Brandeis University, Waltham, MA 02453
| | - Ya-Wen Chuang
- Division of Nephrology, Department of Internal Medicine, Taichung Veterans General Hospital, Taichung, Taiwan, 407219
- Department of Post-Baccalaureate Medicine, College of Medicine, National Chung Hsing University, Taichung, Taiwan, 402202
- School of Medicine, College of Medicine, China Medical University, Taichung, Taiwan, 404328
| | - Richard Yang
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
| | - Wenyu Zhang
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115
| | - Xinyi Wang
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115
| | - Rui Zhang
- Division of Computational Health Sciences, University of Minnesota, Minneapolis, MN 55455
| | - Pengyu Hong
- Department of Computer Science, Brandeis University, Waltham, MA 02453
| | - David W. Bates
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Health Policy and Management, Harvard T.H. Chan School of Public Health, Boston, MA 02115
| | - Li Zhou
- Division of General Internal Medicine and Primary Care, Brigham and Women’s Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts 02115
| |
Collapse
|
23
|
Benson R, Elia M, Hyams B, Chang JH, Hong JC. A Narrative Review on the Application of Large Language Models to Support Cancer Care and Research. Yearb Med Inform 2024; 33:90-98. [PMID: 40199294 PMCID: PMC12020524 DOI: 10.1055/s-0044-1800726] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/10/2025] Open
Abstract
OBJECTIVES The emergence of large language models has resulted in a significant shift in informatics research and carries promise in clinical cancer care. Here we provide a narrative review of the recent use of large language models (LLMs) to support cancer care, prevention, and research. METHODS We performed a search of the Scopus database for studies on the application of bidirectional encoder representations from transformers (BERT) and generative-pretrained transformer (GPT) LLMs in cancer care published between the start of 2021 and the end of 2023. We present salient and impactful papers related to each of these themes. RESULTS Studies identified focused on aspects of clinical decision support (CDS), cancer education, and support for research activities. The use of LLMs for CDS primarily focused on aspects of treatment and screening planning, treatment response, and the management of adverse events. Studies using LLMs for cancer education typically focused on question-answering, assessing cancer myths and misconceptions, and text summarization and simplification. Finally, studies using LLMs to support research activities focused on scientific writing and idea generation, cohort identification and extraction, clinical data processing, and NLP-centric tasks. CONCLUSIONS The application of LLMs in cancer care has shown promise across a variety of diverse use cases. Future research should utilize quantitative metrics, qualitative insights, and user insights in the development and evaluation of LLM-based cancer care tools. The development of open-source LLMs for use in cancer care research and activities should also be a priority.
Collapse
Affiliation(s)
- Ryzen Benson
- Department of Radiation Oncology, University of California, San Francisco, San Francisco, California
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California
| | - Marianna Elia
- Department of Radiation Oncology, University of California, San Francisco, San Francisco, California
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California
| | - Benjamin Hyams
- Department of Radiation Oncology, University of California, San Francisco, San Francisco, California
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California
- School of Medicine, University of California, San Francisco, San Francisco, California
| | - Ji Hyun Chang
- Department of Radiation Oncology, University of California, San Francisco, San Francisco, California
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California
- Department of Radiation Oncology, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea
| | - Julian C. Hong
- Department of Radiation Oncology, University of California, San Francisco, San Francisco, California
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California
- UCSF UC Berkeley Joint Program in Computational Precision Health (CPH), San Francisco, CA
| |
Collapse
|
24
|
Kaboudi N, Firouzbakht S, Shahir Eftekhar M, Fayazbakhsh F, Joharivarnoosfaderani N, Ghaderi S, Dehdashti M, Mohtasham Kia Y, Afshari M, Vasaghi-Gharamaleki M, Haghani L, Moradzadeh Z, Khalaj F, Mohammadi Z, Hasanabadi Z, Shahidi R. Diagnostic Accuracy of ChatGPT for Patients' Triage; a Systematic Review and Meta-Analysis. ARCHIVES OF ACADEMIC EMERGENCY MEDICINE 2024; 12:e60. [PMID: 39290765 PMCID: PMC11407534 DOI: 10.22037/aaem.v12i1.2384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 09/19/2024]
Abstract
Introduction Artificial intelligence (AI), particularly ChatGPT developed by OpenAI, has shown the potential to improve diagnostic accuracy and efficiency in emergency department (ED) triage. This study aims to evaluate the diagnostic performance and safety of ChatGPT in prioritizing patients based on urgency in ED settings. Methods A systematic review and meta-analysis were conducted following PRISMA guidelines. Comprehensive literature searches were performed in Scopus, Web of Science, PubMed, and Embase. Studies evaluating ChatGPT's diagnostic performance in ED triage were included. Quality assessment was conducted using the QUADAS-2 tool. Pooled accuracy estimates were calculated using a random-effects model, and heterogeneity was assessed with the I² statistic. Results Fourteen studies with a total of 1,412 patients or scenarios were included. ChatGPT 4.0 demonstrated a pooled accuracy of 0.86 (95% CI: 0.64-0.98) with substantial heterogeneity (I² = 93%). ChatGPT 3.5 showed a pooled accuracy of 0.63 (95% CI: 0.43-0.81) with significant heterogeneity (I² = 84%). Funnel plots indicated potential publication bias, particularly for ChatGPT 3.5. Quality assessments revealed varying levels of risk of bias and applicability concerns. Conclusion ChatGPT, especially version 4.0, shows promise in improving ED triage accuracy. However, significant variability and potential biases highlight the need for further evaluation and enhancement.
Collapse
Affiliation(s)
- Navid Kaboudi
- Faculty of Pharmacy, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Saeedeh Firouzbakht
- Department of Pediatrics, School of Medicine, Bushehr University of Medical Sciences, Bushehr, Iran
| | | | | | | | - Salar Ghaderi
- Research Center for Evidence-based Medicine, Faculty of Medicine, Tabriz University of Medical Sciences, Tabriz, Iran
| | | | | | - Maryam Afshari
- School of Medicine, Shiraz University of Medical Sciences, Shiraz, Iran
| | | | - Leila Haghani
- Memorial Sloan Kettering Cancer Center, New York, United States
| | - Zahra Moradzadeh
- School of Medicine, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Fattaneh Khalaj
- Liver and Pancreatobiliary Diseases Research Center, Digestive Diseases Research Institute, Tehran University of Medical Sciences, Tehran, IR, Iran
| | - Zahra Mohammadi
- School of Medicine, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Zahra Hasanabadi
- School of Medicine, Qazvin University of Medical Sciences, Qazvin, Iran
| | - Ramin Shahidi
- School of Medicine, Bushehr University of Medical Sciences, Bushehr, Iran
| |
Collapse
|
25
|
Levin C, Kagan T, Rosen S, Saban M. An evaluation of the capabilities of language models and nurses in providing neonatal clinical decision support. Int J Nurs Stud 2024; 155:104771. [PMID: 38688103 DOI: 10.1016/j.ijnurstu.2024.104771] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 03/26/2024] [Accepted: 04/03/2024] [Indexed: 05/02/2024]
Abstract
AIM To assess the clinical reasoning capabilities of two large language models, ChatGPT-4 and Claude-2.0, compared to those of neonatal nurses during neonatal care scenarios. DESIGN A cross-sectional study with a comparative evaluation using a survey instrument that included six neonatal intensive care unit clinical scenarios. PARTICIPANTS 32 neonatal intensive care nurses with 5-10 years of experience working in the neonatal intensive care units of three medical centers. METHODS Participants responded to 6 written clinical scenarios. Simultaneously, we asked ChatGPT-4 and Claude-2.0 to provide initial assessments and treatment recommendations for the same scenarios. The responses from ChatGPT-4 and Claude-2.0 were then scored by certified neonatal nurse practitioners for accuracy, completeness, and response time. RESULTS Both models demonstrated capabilities in clinical reasoning for neonatal care, with Claude-2.0 significantly outperforming ChatGPT-4 in clinical accuracy and speed. However, limitations were identified across the cases in diagnostic precision, treatment specificity, and response lag. CONCLUSIONS While showing promise, current limitations reinforce the need for deep refinement before ChatGPT-4 and Claude-2.0 can be considered for integration into clinical practice. Additional validation of these tools is important to safely leverage this Artificial Intelligence technology for enhancing clinical decision-making. IMPACT The study provides an understanding of the reasoning accuracy of new Artificial Intelligence models in neonatal clinical care. The current accuracy gaps of ChatGPT-4 and Claude-2.0 need to be addressed prior to clinical usage.
Collapse
Affiliation(s)
- Chedva Levin
- Faculty of School of Life and Health Sciences, Nursing Department, The Jerusalem College of Technology-Lev Academic Center, Jerusalem, Israel; The Department of Vascular Surgery, The Chaim Sheba Medical Center, Tel Hashomer, Ramat Gan, Tel Aviv, Israel
| | | | - Shani Rosen
- Department of Nursing, School of Health Professions, Faculty of Medical and Health Sciences, Tel Aviv University, Israel
| | - Mor Saban
- Department of Nursing, School of Health Professions, Faculty of Medical and Health Sciences, Tel Aviv University, Israel.
| |
Collapse
|
26
|
Varady NH, Lu AZ, Mazzucco M, Dines JS, Altchek DW, Williams RJ, Kunze KN. Understanding How ChatGPT May Become a Clinical Administrative Tool Through an Investigation on the Ability to Answer Common Patient Questions Concerning Ulnar Collateral Ligament Injuries. Orthop J Sports Med 2024; 12:23259671241257516. [PMID: 39139744 PMCID: PMC11320692 DOI: 10.1177/23259671241257516] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 01/10/2024] [Indexed: 08/15/2024] Open
Abstract
Background The consumer availability and automated response functions of chat generator pretrained transformer (ChatGPT-4), a large language model, poise this application to be utilized for patient health queries and may have a role in serving as an adjunct to minimize administrative and clinical burden. Purpose To evaluate the ability of ChatGPT-4 to respond to patient inquiries concerning ulnar collateral ligament (UCL) injuries and compare these results with the performance of Google. Study Design Cross-sectional study. Methods Google Web Search was used as a benchmark, as it is the most widely used search engine worldwide and the only search engine that generates frequently asked questions (FAQs) when prompted with a query, allowing comparisons through a systematic approach. The query "ulnar collateral ligament reconstruction" was entered into Google, and the top 10 FAQs, answers, and their sources were recorded. ChatGPT-4 was prompted to perform a Google search of FAQs with the same query and to record the sources of answers for comparison. This process was again replicated to obtain 10 new questions requiring numeric instead of open-ended responses. Finally, responses were graded independently for clinical accuracy (grade 0 = inaccurate, grade 1 = somewhat accurate, grade 2 = accurate) by 2 fellowship-trained sports medicine surgeons (D.W.A, J.S.D.) blinded to the search engine and answer source. Results ChatGPT-4 used a greater proportion of academic sources than Google to provide answers to the top 10 FAQs, although this was not statistically significant (90% vs 50%; P = .14). In terms of question overlap, 40% of the most common questions on Google and ChatGPT-4 were the same. When comparing FAQs with numeric responses, 20% of answers were completely overlapping, 30% demonstrated partial overlap, and the remaining 50% did not demonstrate any overlap. All sources used by ChatGPT-4 to answer these FAQs were academic, while only 20% of sources used by Google were academic (P = .0007). The remaining Google sources included social media (40%), medical practices (20%), single-surgeon websites (10%), and commercial websites (10%). The mean (± standard deviation) accuracy for answers given by ChatGPT-4 was significantly greater compared with Google for the top 10 FAQs (1.9 ± 0.2 vs 1.2 ± 0.6; P = .001) and top 10 questions with numeric answers (1.8 ± 0.4 vs 1 ± 0.8; P = .013). Conclusion ChatGPT-4 is capable of providing responses with clinically relevant content concerning UCL injuries and reconstruction. ChatGPT-4 utilized a greater proportion of academic websites to provide responses to FAQs representative of patient inquiries compared with Google Web Search and provided significantly more accurate answers. Moving forward, ChatGPT has the potential to be used as a clinical adjunct when answering queries about UCL injuries and reconstruction, but further validation is warranted before integrated or autonomous use in clinical settings.
Collapse
Affiliation(s)
- Nathan H. Varady
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, USA
| | - Amy Z. Lu
- Weill Cornell Medical College, New York, New York, USA
| | | | - Joshua S. Dines
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, USA
| | | | - Riley J. Williams
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, USA
| | - Kyle N. Kunze
- Department of Orthopaedic Surgery, Hospital for Special Surgery, New York, New York, USA
| |
Collapse
|
27
|
Amacher SA, Arpagaus A, Sahmer C, Becker C, Gross S, Urben T, Tisljar K, Sutter R, Marsch S, Hunziker S. Prediction of outcomes after cardiac arrest by a generative artificial intelligence model. Resusc Plus 2024; 18:100587. [PMID: 38433764 PMCID: PMC10906512 DOI: 10.1016/j.resplu.2024.100587] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Revised: 02/01/2024] [Accepted: 02/11/2024] [Indexed: 03/05/2024] Open
Abstract
Aims To investigate the prognostic accuracy of a non-medical generative artificial intelligence model (Chat Generative Pre-Trained Transformer 4 - ChatGPT-4) as a novel aspect in predicting death and poor neurological outcome at hospital discharge based on real-life data from cardiac arrest patients. Methods This prospective cohort study investigates the prognostic performance of ChatGPT-4 to predict outcomes at hospital discharge of adult cardiac arrest patients admitted to intensive care at a large Swiss tertiary academic medical center (COMMUNICATE/PROPHETIC cohort study). We prompted ChatGPT-4 with sixteen prognostic parameters derived from established post-cardiac arrest scores for each patient. We compared the prognostic performance of ChatGPT-4 regarding the area under the curve (AUC), sensitivity, specificity, positive and negative predictive values, and likelihood ratios of three cardiac arrest scores (Out-of-Hospital Cardiac Arrest [OHCA], Cardiac Arrest Hospital Prognosis [CAHP], and PROgnostication using LOGistic regression model for Unselected adult cardiac arrest patients in the Early stages [PROLOGUE score]) for in-hospital mortality and poor neurological outcome. Results Mortality at hospital discharge was 43% (n = 309/713), 54% of patients (n = 387/713) had a poor neurological outcome. ChatGPT-4 showed good discrimination regarding in-hospital mortality with an AUC of 0.85, similar to the OHCA, CAHP, and PROLOGUE (AUCs of 0.82, 0.83, and 0.84, respectively) scores. For poor neurological outcome, ChatGPT-4 showed a similar prediction to the post-cardiac arrest scores (AUC 0.83). Conclusions ChatGPT-4 showed a similar performance in predicting mortality and poor neurological outcome compared to validated post-cardiac arrest scores. However, more research is needed regarding illogical answers for potential incorporation of an LLM in the multimodal outcome prognostication after cardiac arrest.
Collapse
Affiliation(s)
- Simon A. Amacher
- Intensive Care Medicine, Department of Acute Medical Care, University Hospital Basel, Basel, Switzerland
- Medical Communication and Psychosomatic Medicine, University Hospital Basel, Basel, Switzerland
- Emergency Medicine, Department of Acute Medical Care, University Hospital Basel, Basel, Switzerland
| | - Armon Arpagaus
- Medical Communication and Psychosomatic Medicine, University Hospital Basel, Basel, Switzerland
| | - Christian Sahmer
- Medical Communication and Psychosomatic Medicine, University Hospital Basel, Basel, Switzerland
| | - Christoph Becker
- Medical Communication and Psychosomatic Medicine, University Hospital Basel, Basel, Switzerland
- Emergency Medicine, Department of Acute Medical Care, University Hospital Basel, Basel, Switzerland
| | - Sebastian Gross
- Medical Communication and Psychosomatic Medicine, University Hospital Basel, Basel, Switzerland
| | - Tabita Urben
- Medical Communication and Psychosomatic Medicine, University Hospital Basel, Basel, Switzerland
| | - Kai Tisljar
- Intensive Care Medicine, Department of Acute Medical Care, University Hospital Basel, Basel, Switzerland
| | - Raoul Sutter
- Intensive Care Medicine, Department of Acute Medical Care, University Hospital Basel, Basel, Switzerland
- Medical Faculty, University of Basel, Basel, Switzerland
- Division of Neurophysiology, Department of Neurology, University Hospital Basel, Basel, Switzerland
| | - Stephan Marsch
- Intensive Care Medicine, Department of Acute Medical Care, University Hospital Basel, Basel, Switzerland
- Medical Faculty, University of Basel, Basel, Switzerland
| | - Sabina Hunziker
- Medical Communication and Psychosomatic Medicine, University Hospital Basel, Basel, Switzerland
- Medical Faculty, University of Basel, Basel, Switzerland
- Post-Intensive Care Clinic, University Hospital Basel, Basel, Switzerland
| |
Collapse
|
28
|
Pressman SM, Borna S, Gomez-Cabello CA, Haider SA, Haider CR, Forte AJ. Clinical and Surgical Applications of Large Language Models: A Systematic Review. J Clin Med 2024; 13:3041. [PMID: 38892752 PMCID: PMC11172607 DOI: 10.3390/jcm13113041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Revised: 05/15/2024] [Accepted: 05/19/2024] [Indexed: 06/21/2024] Open
Abstract
Background: Large language models (LLMs) represent a recent advancement in artificial intelligence with medical applications across various healthcare domains. The objective of this review is to highlight how LLMs can be utilized by clinicians and surgeons in their everyday practice. Methods: A systematic review was conducted following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines. Six databases were searched to identify relevant articles. Eligibility criteria emphasized articles focused primarily on clinical and surgical applications of LLMs. Results: The literature search yielded 333 results, with 34 meeting eligibility criteria. All articles were from 2023. There were 14 original research articles, four letters, one interview, and 15 review articles. These articles covered a wide variety of medical specialties, including various surgical subspecialties. Conclusions: LLMs have the potential to enhance healthcare delivery. In clinical settings, LLMs can assist in diagnosis, treatment guidance, patient triage, physician knowledge augmentation, and administrative tasks. In surgical settings, LLMs can assist surgeons with documentation, surgical planning, and intraoperative guidance. However, addressing their limitations and concerns, particularly those related to accuracy and biases, is crucial. LLMs should be viewed as tools to complement, not replace, the expertise of healthcare professionals.
Collapse
Affiliation(s)
| | - Sahar Borna
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
| | | | - Syed Ali Haider
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
| | - Clifton R. Haider
- Department of Physiology and Biomedical Engineering, Mayo Clinic, Rochester, MN 55905, USA
| | - Antonio Jorge Forte
- Division of Plastic Surgery, Mayo Clinic, Jacksonville, FL 32224, USA
- Center for Digital Health, Mayo Clinic, Rochester, MN 55905, USA
| |
Collapse
|
29
|
Preiksaitis C, Ashenburg N, Bunney G, Chu A, Kabeer R, Riley F, Ribeira R, Rose C. The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review. JMIR Med Inform 2024; 12:e53787. [PMID: 38728687 PMCID: PMC11127144 DOI: 10.2196/53787] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2023] [Revised: 12/20/2023] [Accepted: 04/05/2024] [Indexed: 05/12/2024] Open
Abstract
BACKGROUND Artificial intelligence (AI), more specifically large language models (LLMs), holds significant potential in revolutionizing emergency care delivery by optimizing clinical workflows and enhancing the quality of decision-making. Although enthusiasm for integrating LLMs into emergency medicine (EM) is growing, the existing literature is characterized by a disparate collection of individual studies, conceptual analyses, and preliminary implementations. Given these complexities and gaps in understanding, a cohesive framework is needed to comprehend the existing body of knowledge on the application of LLMs in EM. OBJECTIVE Given the absence of a comprehensive framework for exploring the roles of LLMs in EM, this scoping review aims to systematically map the existing literature on LLMs' potential applications within EM and identify directions for future research. Addressing this gap will allow for informed advancements in the field. METHODS Using PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) criteria, we searched Ovid MEDLINE, Embase, Web of Science, and Google Scholar for papers published between January 2018 and August 2023 that discussed LLMs' use in EM. We excluded other forms of AI. A total of 1994 unique titles and abstracts were screened, and each full-text paper was independently reviewed by 2 authors. Data were abstracted independently, and 5 authors performed a collaborative quantitative and qualitative synthesis of the data. RESULTS A total of 43 papers were included. Studies were predominantly from 2022 to 2023 and conducted in the United States and China. We uncovered four major themes: (1) clinical decision-making and support was highlighted as a pivotal area, with LLMs playing a substantial role in enhancing patient care, notably through their application in real-time triage, allowing early recognition of patient urgency; (2) efficiency, workflow, and information management demonstrated the capacity of LLMs to significantly boost operational efficiency, particularly through the automation of patient record synthesis, which could reduce administrative burden and enhance patient-centric care; (3) risks, ethics, and transparency were identified as areas of concern, especially regarding the reliability of LLMs' outputs, and specific studies highlighted the challenges of ensuring unbiased decision-making amidst potentially flawed training data sets, stressing the importance of thorough validation and ethical oversight; and (4) education and communication possibilities included LLMs' capacity to enrich medical training, such as through using simulated patient interactions that enhance communication skills. CONCLUSIONS LLMs have the potential to fundamentally transform EM, enhancing clinical decision-making, optimizing workflows, and improving patient outcomes. This review sets the stage for future advancements by identifying key research areas: prospective validation of LLM applications, establishing standards for responsible use, understanding provider and patient perceptions, and improving physicians' AI literacy. Effective integration of LLMs into EM will require collaborative efforts and thorough evaluation to ensure these technologies can be safely and effectively applied.
Collapse
Affiliation(s)
- Carl Preiksaitis
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Nicholas Ashenburg
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Gabrielle Bunney
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Andrew Chu
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Rana Kabeer
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Fran Riley
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Ryan Ribeira
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Christian Rose
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| |
Collapse
|
30
|
Scott IA, Zuccon G. The new paradigm in machine learning - foundation models, large language models and beyond: a primer for physicians. Intern Med J 2024; 54:705-715. [PMID: 38715436 DOI: 10.1111/imj.16393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Accepted: 03/26/2024] [Indexed: 05/18/2024]
Abstract
Foundation machine learning models are deep learning models capable of performing many different tasks using different data modalities such as text, audio, images and video. They represent a major shift from traditional task-specific machine learning prediction models. Large language models (LLM), brought to wide public prominence in the form of ChatGPT, are text-based foundational models that have the potential to transform medicine by enabling automation of a range of tasks, including writing discharge summaries, answering patients questions and assisting in clinical decision-making. However, such models are not without risk and can potentially cause harm if their development, evaluation and use are devoid of proper scrutiny. This narrative review describes the different types of LLM, their emerging applications and potential limitations and bias and likely future translation into clinical practice.
Collapse
Affiliation(s)
- Ian A Scott
- Centre for Health Services Research, University of Queensland, Woolloongabba, Australia
| | - Guido Zuccon
- School of Electrical Engineering and Computer Sciences, The University of Queensland, St Lucia, Queensland, Australia
| |
Collapse
|
31
|
Frosolini A, Catarzi L, Benedetti S, Latini L, Chisci G, Franz L, Gennaro P, Gabriele G. The Role of Large Language Models (LLMs) in Providing Triage for Maxillofacial Trauma Cases: A Preliminary Study. Diagnostics (Basel) 2024; 14:839. [PMID: 38667484 PMCID: PMC11048758 DOI: 10.3390/diagnostics14080839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2024] [Revised: 04/10/2024] [Accepted: 04/16/2024] [Indexed: 04/28/2024] Open
Abstract
BACKGROUND In the evolving field of maxillofacial surgery, integrating advanced technologies like Large Language Models (LLMs) into medical practices, especially for trauma triage, presents a promising yet largely unexplored potential. This study aimed to evaluate the feasibility of using LLMs for triaging complex maxillofacial trauma cases by comparing their performance against the expertise of a tertiary referral center. METHODS Utilizing a comprehensive review of patient records in a tertiary referral center over a year-long period, standardized prompts detailing patient demographics, injury characteristics, and medical histories were created. These prompts were used to assess the triage suggestions of ChatGPT 4.0 and Google GEMINI against the center's recommendations, supplemented by evaluating the AI's performance using the QAMAI and AIPI questionnaires. RESULTS The results in 10 cases of major maxillofacial trauma indicated moderate agreement rates between LLM recommendations and the referral center, with some variances in the suggestion of appropriate examinations (70% ChatGPT and 50% GEMINI) and treatment plans (60% ChatGPT and 45% GEMINI). Notably, the study found no statistically significant differences in several areas of the questionnaires, except in the diagnosis accuracy (GEMINI: 3.30, ChatGPT: 2.30; p = 0.032) and relevance of the recommendations (GEMINI: 2.90, ChatGPT: 3.50; p = 0.021). A Spearman correlation analysis highlighted significant correlations within the two questionnaires, specifically between the QAMAI total score and AIPI treatment scores (rho = 0.767, p = 0.010). CONCLUSIONS This exploratory investigation underscores the potential of LLMs in enhancing clinical decision making for maxillofacial trauma cases, indicating a need for further research to refine their application in healthcare settings.
Collapse
Affiliation(s)
- Andrea Frosolini
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Lisa Catarzi
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Simone Benedetti
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Linda Latini
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Glauco Chisci
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Leonardo Franz
- Phoniatris and Audiology Unit, Department of Neuroscience DNS, University of Padova, 35122 Treviso, Italy;
- Artificial Intelligence in Medicine and Innovation in Clinical Research and Methodology (PhD Program), Department of Clinical and Experimental Sciences, University of Brescia, 25121 Brescia, Italy
| | - Paolo Gennaro
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| | - Guido Gabriele
- Maxillofacial Surgery Unit, Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy; (L.C.); (S.B.); (L.L.); (G.C.); (P.G.); (G.G.)
| |
Collapse
|
32
|
Paslı S, Şahin AS, Beşer MF, Topçuoğlu H, Yadigaroğlu M, İmamoğlu M. Assessing the precision of artificial intelligence in ED triage decisions: Insights from a study with ChatGPT. Am J Emerg Med 2024; 78:170-175. [PMID: 38295466 DOI: 10.1016/j.ajem.2024.01.037] [Citation(s) in RCA: 15] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Revised: 12/25/2023] [Accepted: 01/21/2024] [Indexed: 02/02/2024] Open
Abstract
BACKGROUND The rise in emergency department presentations globally poses challenges for efficient patient management. To address this, various strategies aim to expedite patient management. Artificial intelligence's (AI) consistent performance and rapid data interpretation extend its healthcare applications, especially in emergencies. The introduction of a robust AI tool like ChatGPT, based on GPT-4 developed by OpenAI, can benefit patients and healthcare professionals by improving the speed and accuracy of resource allocation. This study examines ChatGPT's capability to predict triage outcomes based on local emergency department rules. METHODS This study is a single-center prospective observational study. The study population consists of all patients who presented to the emergency department with any symptoms and agreed to participate. The study was conducted on three non-consecutive days for a total of 72 h. Patients' chief complaints, vital parameters, medical history and the area to which they were directed by the triage team in the emergency department were recorded. Concurrently, an emergency medicine physician inputted the same data into previously trained GPT-4, according to local rules. According to this data, the triage decisions made by GPT-4 were recorded. In the same process, an emergency medicine specialist determined where the patient should be directed based on the data collected, and this decision was considered the gold standard. Accuracy rates and reliability for directing patients to specific areas by the triage team and GPT-4 were evaluated using Cohen's kappa test. Furthermore, the accuracy of the patient triage process performed by the triage team and GPT-4 was assessed by receiver operating characteristic (ROC) analysis. Statistical analysis considered a value of p < 0.05 as significant. RESULTS The study was carried out on 758 patients. Among the participants, 416 (54.9%) were male and 342 (45.1%) were female. Evaluating the primary endpoints of our study - the agreement between the decisions of the triage team, GPT-4 decisions in emergency department triage, and the gold standard - we observed almost perfect agreement both between the triage team and the gold standard and between GPT-4 and the gold standard (Cohen's Kappa 0.893 and 0.899, respectively; p < 0.001 for each). CONCLUSION Our findings suggest GPT-4 possess outstanding predictive skills in triaging patients in an emergency setting. GPT-4 can serve as an effective tool to support the triage process.
Collapse
Affiliation(s)
- Sinan Paslı
- Karadeniz Technical University, Faculty of Medicine, Department of Emergency Medicine, Trabzon, Turkey.
| | | | | | - Hazal Topçuoğlu
- Siirt Education & Research Hospital, Department of Emergency Medicine, Siirt, Turkey
| | - Metin Yadigaroğlu
- Samsun University, Faculty of Medicine, Department of Emergency Medicine, Samsun, Turkey
| | - Melih İmamoğlu
- Karadeniz Technical University, Faculty of Medicine, Department of Emergency Medicine, Trabzon, Turkey
| |
Collapse
|
33
|
Kerr W, Acosta S, Kwan P, Worrell G, Mikati MA. Artificial Intelligence: Fundamentals and Breakthrough Applications in Epilepsy. Epilepsy Curr 2024:15357597241238526. [PMID: 39554271 PMCID: PMC11562289 DOI: 10.1177/15357597241238526] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2024] Open
Abstract
Artificial intelligence, machine learning, and deep learning are increasingly being used in all medical fields including for epilepsy research and clinical care. Already there have been resultant cutting-edge applications in both the clinical and research arenas of epileptology. Because there is a need to disseminate knowledge about these approaches, how to use them, their advantages, and their potential limitations, the goal of the 2023 Merritt-Putnam Symposium and of this synopsis review of that symposium has been to present the background and state of the art and then to draw conclusions on current and future applications of these approaches through the following: (1) Initially provide an explanation of the fundamental principles of artificial intelligence, machine learning, and deep learning. These are presented in the first section of this review by Dr Wesley Kerr. (2) Provide insights into their cutting-edge applications in screening for medications in neural organoids, in general, and for epilepsy in particular. These are presented by Dr Sandra Acosta. (3) Provide insights into how artificial intelligence approaches can predict clinical response to medication treatments. These are presented by Dr Patrick Kwan. (4) Finally, provide insights into the expanding applications to the detection and analysis of EEG signals in intensive care, epilepsy monitoring unit, and intracranial monitoring situations, as presented below by Dr Gregory Worrell. The expectation is that, in the coming decade and beyond, the increasing use of the above approaches will transform epilepsy research and care and supplement, but not replace, the diligent work of epilepsy clinicians and researchers.
Collapse
Affiliation(s)
- Wesley Kerr
- Department of Neurology, University of Pittsburgh Medical Center, Pittsburgh, PA, USA
- Department of Biomedical Engineering, University of Pittsburgh Medical Center, Pittsburgh, PA, USA
| | - Sandra Acosta
- Department of Pathology and Experimental Therapeutics, Institute of Neurosciences, University of Barcelona, Barcelona, Catalonia, Spain
- Program of Neuroscience, Institute of Biomedical Reseaerch of Bellvitge (IDIBELL), L’Hospitalet de Llobregat, Spain
| | - Patrick Kwan
- Department of Neuroscience, Monash Institute of Medical Engineering at Monash University, and Epilepsy Unit of Alfred Hospital, Melbourne, Victoria, Australia
| | - Gregory Worrell
- Department of Neurology, Mayo Clinic, Rochester, MN, USA
- Department Physiology and Biomedical Engineering, Mayo Clinic, Rochester, MN, USA
| | - Mohamad A. Mikati
- Department of Pediatrics, Duke University, Durham, NC, USA
- Department of Neurobiology, Duke University, Durham, NC, USA
| |
Collapse
|
34
|
Mu Y, He D. The Potential Applications and Challenges of ChatGPT in the Medical Field. Int J Gen Med 2024; 17:817-826. [PMID: 38476626 PMCID: PMC10929156 DOI: 10.2147/ijgm.s456659] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Accepted: 02/26/2024] [Indexed: 03/14/2024] Open
Abstract
ChatGPT, an AI-driven conversational large language model (LLM), has garnered significant scholarly attention since its inception, owing to its manifold applications in the realm of medical science. This study primarily examines the merits, limitations, anticipated developments, and practical applications of ChatGPT in clinical practice, healthcare, medical education, and medical research. It underscores the necessity for further research and development to enhance its performance and deployment. Moreover, future research avenues encompass ongoing enhancements and standardization of ChatGPT, mitigating its limitations, and exploring its integration and applicability in translational and personalized medicine. Reflecting the narrative nature of this review, a focused literature search was performed to identify relevant publications on ChatGPT's use in medicine. This process was aimed at gathering a broad spectrum of insights to provide a comprehensive overview of the current state and future prospects of ChatGPT in the medical domain. The objective is to aid healthcare professionals in understanding the groundbreaking advancements associated with the latest artificial intelligence tools, while also acknowledging the opportunities and challenges presented by ChatGPT.
Collapse
Affiliation(s)
- Yonglin Mu
- Department of Urology, Children’s Hospital of Chongqing Medical University, Chongqing, People’s Republic of China
| | - Dawei He
- Department of Urology, Children’s Hospital of Chongqing Medical University, Chongqing, People’s Republic of China
| |
Collapse
|
35
|
Wang Z, Zhang Z, Traverso A, Dekker A, Qian L, Sun P. Assessing the role of GPT-4 in thyroid ultrasound diagnosis and treatment recommendations: enhancing interpretability with a chain of thought approach. Quant Imaging Med Surg 2024; 14:1602-1615. [PMID: 38415150 PMCID: PMC10895085 DOI: 10.21037/qims-23-1180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Accepted: 11/30/2023] [Indexed: 02/29/2024]
Abstract
Background As artificial intelligence (AI) becomes increasingly prevalent in the medical field, the effectiveness of AI-generated medical reports in disease diagnosis remains to be evaluated. ChatGPT is a large language model developed by open AI with a notable capacity for text abstraction and comprehension. This study aimed to explore the capabilities, limitations, and potential of Generative Pre-trained Transformer (GPT)-4 in analyzing thyroid cancer ultrasound reports, providing diagnoses, and recommending treatment plans. Methods Using 109 diverse thyroid cancer cases, we evaluated GPT-4's performance by comparing its generated reports to those from doctors with various levels of experience. We also conducted a Turing Test and a consistency analysis. To enhance the interpretability of the model, we applied the Chain of Thought (CoT) method to deconstruct the decision-making chain of the GPT model. Results GPT-4 demonstrated proficiency in report structuring, professional terminology, and clarity of expression, but showed limitations in diagnostic accuracy. In addition, our consistency analysis highlighted certain discrepancies in the AI's performance. The CoT method effectively enhanced the interpretability of the AI's decision-making process. Conclusions GPT-4 exhibits potential as a supplementary tool in healthcare, especially for generating thyroid gland diagnostic reports. Our proposed online platform, "ThyroAIGuide", alongside the CoT method, underscores the potential of AI to augment diagnostic processes, elevate healthcare accessibility, and advance patient education. However, the journey towards fully integrating AI into healthcare is ongoing, requiring continuous research, development, and careful monitoring by medical professionals to ensure patient safety and quality of care.
Collapse
Affiliation(s)
- Zhixiang Wang
- Department of Ultrasound, Beijing Friendship Hospital, Capital Medical University, Beijing, China
- Department of Radiation Oncology (Maastro), GROW-School for Oncology, Maastricht University Medical Centre+, Maastricht, The Netherlands
| | - Zhen Zhang
- Department of Radiation Oncology (Maastro), GROW-School for Oncology, Maastricht University Medical Centre+, Maastricht, The Netherlands
| | - Alberto Traverso
- Department of Radiation Oncology (Maastro), GROW-School for Oncology, Maastricht University Medical Centre+, Maastricht, The Netherlands
| | - Andre Dekker
- Department of Radiation Oncology (Maastro), GROW-School for Oncology, Maastricht University Medical Centre+, Maastricht, The Netherlands
| | - Linxue Qian
- Department of Ultrasound, Beijing Friendship Hospital, Capital Medical University, Beijing, China
| | - Pengfei Sun
- Department of Ultrasound, Beijing Friendship Hospital, Capital Medical University, Beijing, China
| |
Collapse
|
36
|
Ray PP. ChatGPT's competence in addressing urolithiasis: myth or reality? Int Urol Nephrol 2024; 56:149-150. [PMID: 37726510 DOI: 10.1007/s11255-023-03802-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Accepted: 09/09/2023] [Indexed: 09/21/2023]
|
37
|
dos Santos ML, Victória VNG. Critical evaluation of applications of artificial intelligence based linguistic models in Occupational Health. Rev Bras Med Trab 2024; 22:e20231241. [PMID: 39165532 PMCID: PMC11333049 DOI: 10.47626/1679-4435-2023-1241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Accepted: 10/30/2023] [Indexed: 08/22/2024] Open
Abstract
This article explores the impact and potential applications of large language models in Occupational Medicine. Large language models have the ability to provide support for medical decision-making, patient screening, summarization and creation of technical, scientific, and legal documents, training and education for doctors and occupational health teams, as well as patient education, potentially leading to lower costs, reduced time expenditure, and a lower incidence of human errors. Despite promising results and a wide range of applications, large language models also have significant limitations in terms of their accuracy, the risk of generating false information, and incorrect recommendations. Various ethical aspects that have not been well elucidated by the medical and academic communities should also be considered, and the lack of regulation by government entities can create areas of legal uncertainty regarding their use in Occupational Medicine and in the legal environment. Significant future improvements can be expected in these models in the coming years, and further studies on the applications of large language models in Occupational Medicine should be encouraged.
Collapse
Affiliation(s)
- Mateus Lins dos Santos
- 6ª Vara, Justiça Federal em Sergipe, Itabaiana, SE, Brazil
- 9ª Vara, Justiça Federal em Sergipe, Propriá, SE,
Brazil
| | | |
Collapse
|
38
|
Shojaei P, Khosravi M, Jafari Y, Mahmoudi AH, Hassanipourmahani H. ChatGPT utilization within the building blocks of the healthcare services: A mixed-methods study. Digit Health 2024; 10:20552076241297059. [PMID: 39559384 PMCID: PMC11571260 DOI: 10.1177/20552076241297059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2024] [Accepted: 10/17/2024] [Indexed: 11/20/2024] Open
Abstract
Introduction ChatGPT, as an AI tool, has been introduced in healthcare for various purposes. The objective of the study was to investigate the principal benefits of ChatGPT utilization in healthcare services and to identify potential domains for its expansion within the building blocks of the healthcare industry. Methods A comprehensive three-phase study was conducted employing mixed methods. The initial phase comprised a systematic review and thematic analysis of the data. In the subsequent phases, a questionnaire, developed based on the findings from the first phase, was distributed to a sample of eight experts. The objective was to prioritize the benefits and potential expansion domains of ChatGPT in healthcare building blocks, utilizing gray SWARA (Stepwise Weight Assessment Ratio Analysis) and gray MABAC (Multi-Attributive Border Approximation Area Comparison), respectively. Results The systematic review yielded 74 studies. A thematic analysis of the data from these studies identified 11 unique themes. In the second phase, employing the gray SWARA method, clinical decision-making (weight: 0.135), medical diagnosis (weight: 0.098), medical procedures (weight: 0.070), and patient-centered care (weight: 0.053) emerged as the most significant benefit of ChatGPT in the healthcare sector. Subsequently, it was determined that ChatGPT demonstrated the highest level of usefulness in the information and infrastructure, information and communication technologies blocks. Conclusion The study concluded that, despite the significant benefits of ChatGPT in the clinical domains of healthcare, it exhibits a more pronounced potential for growth within the informational domains of the healthcare industry's building blocks, rather than within the domains of intervention and clinical services.
Collapse
Affiliation(s)
- Payam Shojaei
- Department of Management, Shiraz University, Shiraz, Iran
| | - Mohsen Khosravi
- Department of Healthcare Management, School of Management and Medical Informatics, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Yalda Jafari
- Department of Management, Shiraz University, Shiraz, Iran
| | - Amir Hossein Mahmoudi
- Department of Operations Management & Decision Sciences, Faculty of Management, University of Tehran, Tehran, Iran
| | - Hadis Hassanipourmahani
- Department of Information Technology Management, Faculty of Management, University of Tehran, Tehran, Iran
| |
Collapse
|
39
|
Alotaibi SS, Rehman A, Hasnain M. Revolutionizing ocular cancer management: a narrative review on exploring the potential role of ChatGPT. Front Public Health 2023; 11:1338215. [PMID: 38192545 PMCID: PMC10773849 DOI: 10.3389/fpubh.2023.1338215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 12/04/2023] [Indexed: 01/10/2024] Open
Abstract
This paper pioneers the exploration of ocular cancer, and its management with the help of Artificial Intelligence (AI) technology. Existing literature presents a significant increase in new eye cancer cases in 2023, experiencing a higher incidence rate. Extensive research was conducted using online databases such as PubMed, ACM Digital Library, ScienceDirect, and Springer. To conduct this review, Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) guidelines are used. Of the collected 62 studies, only 20 documents met the inclusion criteria. The review study identifies seven ocular cancer types. Important challenges associated with ocular cancer are highlighted, including limited awareness about eye cancer, restricted healthcare access, financial barriers, and insufficient infrastructure support. Financial barriers is one of the widely examined ocular cancer challenges in the literature. The potential role and limitations of ChatGPT are discussed, emphasizing its usefulness in providing general information to physicians, noting its inability to deliver up-to-date information. The paper concludes by presenting the potential future applications of ChatGPT to advance research on ocular cancer globally.
Collapse
Affiliation(s)
- Saud S. Alotaibi
- Information Systems Department, Umm Al-Qura University, Makkah, Saudi Arabia
| | - Amna Rehman
- Department of Computer Science, Lahore Leads University, Lahore, Pakistan
| | - Muhammad Hasnain
- Department of Computer Science, Lahore Leads University, Lahore, Pakistan
| |
Collapse
|
40
|
Wang X, Liu XQ. Potential and limitations of ChatGPT and generative artificial intelligence in medical safety education. World J Clin Cases 2023; 11:7935-7939. [DOI: 10.12998/wjcc.v11.i32.7935] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Revised: 09/21/2023] [Accepted: 11/02/2023] [Indexed: 11/16/2023] Open
Abstract
The primary objectives of medical safety education are to provide the public with essential knowledge about medications and to foster a scientific approach to drug usage. The era of using artificial intelligence to revolutionize medical safety education has already dawned, and ChatGPT and other generative artificial intelligence models have immense potential in this domain. Notably, they offer a wealth of knowledge, anonymity, continuous availability, and personalized services. However, the practical implementation of generative artificial intelligence models such as ChatGPT in medical safety education still faces several challenges, including concerns about the accuracy of information, legal responsibilities, and ethical obligations. Moving forward, it is crucial to intelligently upgrade ChatGPT by leveraging the strengths of existing medical practices. This task involves further integrating the model with real-life scenarios and proactively addressing ethical and security issues with the ultimate goal of providing the public with comprehensive, convenient, efficient, and personalized medical services.
Collapse
Affiliation(s)
- Xin Wang
- School of Education, Tianjin University, Tianjin 300350, China
| | - Xin-Qiao Liu
- School of Education, Tianjin University, Tianjin 300350, China
| |
Collapse
|