1
|
Rokhshad R, Khoury ZH, Mohammad-Rahimi H, Motie P, Price JB, Tavares T, Jessri M, Bavarian R, Sciubba JJ, Sultan AS. Efficacy and empathy of AI chatbots in answering frequently asked questions on oral oncology. Oral Surg Oral Med Oral Pathol Oral Radiol 2025; 139:719-728. [PMID: 39843286 DOI: 10.1016/j.oooo.2024.12.028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2024] [Revised: 12/13/2024] [Accepted: 12/30/2024] [Indexed: 01/24/2025]
Abstract
OBJECTIVES Artificial intelligence chatbots have demonstrated feasibility and efficacy in improving health outcomes. In this study, responses from 5 different publicly available AI chatbots-Bing, GPT-3.5, GPT-4, Google Bard, and Claude-to frequently asked questions related to oral cancer were evaluated. STUDY DESIGN Relevant patient-related frequently asked questions about oral cancer were obtained from two main sources: public health websites and social media platforms. From these sources, 20 oral cancer-related questions were selected. Four board-certified specialists in oral medicine/oral and maxillofacial pathology assessed the answers using modified version of the global quality score on a 5-point Likert scale. Additionally, readability was measured using the Flesch-Kincaid Grade Level and Flesch Reading Ease scores. Responses were also assessed for empathy using a validated 5-point scale. RESULTS Specialists ranked GPT-4 with highest total score of 17.3 ± 1.5, while Bing received the lowest at 14.9 ± 2.2. Bard had the highest Flesch Reading Ease score of 62 ± 7; and ChatGPT-3.5 and Claude received the lowest scores (more challenging readability). GPT-4 and Bard emerged as the most superior chatbots in terms of empathy and accurate citations on patient-related frequently asked questions pertaining to oral cancer. GPT-4 had highest overall quality, whereas Bing showed the lowest level of quality, empathy, and accuracy for citations. CONCLUSION GPT-4 demonstrated the highest quality responses to frequently asked questions pertaining to oral cancer. Although impressive in their ability to guide patients on common oral cancer topics, most chatbots did not perform well when assessed for empathy or citation accuracy.
Collapse
Affiliation(s)
- Rata Rokhshad
- Department of Pediatric Dentistry, Loma Linda School of Dentistry, CA, USA
| | - Zaid H Khoury
- Department of Oral Diagnostic Sciences & Research, School of Dentistry, Meharry Medical College, TN, USA
| | | | - Parisa Motie
- Medical Image and Signal Processing Research Center, Isfahan University of Medical Sciences, Isfahan, Iran
| | - Jeffery B Price
- Division of Artificial Intelligence Research, Department of Oncology and Diagnostic Sciences, University of Maryland School of Dentistry, Baltimore, MD, USA
| | - Tiffany Tavares
- Department of Comprehensive Dentistry, UT Health San Antonio, School of Dentistry, San Antonio, TX, USA
| | - Maryam Jessri
- Oral Medicine and Pathology Department, School of Dentistry, University of Queensland, Herston, QLD, Australia; Oral Medicine Department, MetroNorth Hospital and Health Services, Queensland Health, QLD, Australia
| | - Roxanne Bavarian
- Department of Oral and Maxillofacial Surgery, Massachusetts General Hospital, Boston, MA, USA; Department of Oral and Maxillofacial Surgery, Harvard School of Dental Medicine, Boston, MA, USA
| | - James J Sciubba
- Department of Otolaryngology, Head & Neck Surgery, The Johns Hopkins University, Baltimore, MD, USA
| | - Ahmed S Sultan
- Division of Artificial Intelligence Research, Department of Oncology and Diagnostic Sciences, University of Maryland School of Dentistry, Baltimore, MD, USA; University of Maryland Marlene and Stewart Greenebaum Comprehensive Cancer Center, Baltimore, MD, USA.
| |
Collapse
|
2
|
Rao AS, Kim J, Mu A, Young CC, Kalmowitz E, Senter-Zapata M, Whitehead DC, Garibyan L, Landman AB, Succi MD. Synthetic medical education in dermatology leveraging generative artificial intelligence. NPJ Digit Med 2025; 8:247. [PMID: 40320492 PMCID: PMC12050279 DOI: 10.1038/s41746-025-01650-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2025] [Accepted: 04/18/2025] [Indexed: 05/08/2025] Open
Abstract
The advent of large language models (LLMs) represents an enormous opportunity to revolutionize medical education. Via "synthetic education," LLMs can be harnessed to generate novel content for medical education purposes, offering potentially unlimited resources for physicians in training. Utilizing OpenAI's GPT-4, we generated clinical vignettes and accompanying explanations for 20 skin and soft tissue diseases tested on the United States Medical Licensing Examination. Physician experts gave the vignettes high average scores on a Likert scale in scientific accuracy (4.45/5), comprehensiveness (4.3/5), and overall quality (4.28/5) and low scores for potential clinical harm (1.6/5) and demographic bias (1.52/5). A strong correlation (r = 0.83) was observed between comprehensiveness and overall quality. Vignettes did not incorporate significant demographic diversity. This study underscores the potential of LLMs in enhancing the scalability, accessibility, and customizability of dermatology education materials. Efforts to increase vignettes' demographic diversity should be incorporated to increase applicability to diverse populations.
Collapse
Affiliation(s)
- Arya S Rao
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, USA
| | - John Kim
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, USA
| | - Andrew Mu
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, USA
| | - Cameron C Young
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, USA
| | - Ezra Kalmowitz
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, USA
| | - Michael Senter-Zapata
- Harvard Medical School, Boston, MA, USA
- Brigham and Women's Hospital, Boston, MA, USA
| | - David C Whitehead
- Harvard Medical School, Boston, MA, USA
- Massachusetts General Hospital, Boston, MA, USA
| | - Lilit Garibyan
- Harvard Medical School, Boston, MA, USA
- Massachusetts General Hospital, Boston, MA, USA
| | - Adam B Landman
- Harvard Medical School, Boston, MA, USA
- Brigham and Women's Hospital, Boston, MA, USA
| | - Marc D Succi
- Harvard Medical School, Boston, MA, USA.
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, USA.
- Massachusetts General Hospital, Boston, MA, USA.
| |
Collapse
|
3
|
Rao A, Mu A, Enichen E, Gupta D, Hall N, Koranteng E, Marks W, Senter-Zapata MJ, Whitehead DC, White BA, Saini S, Landman AB, Succi MD. A Future of Self-Directed Patient Internet Research: Large Language Model-Based Tools Versus Standard Search Engines. Ann Biomed Eng 2025; 53:1199-1208. [PMID: 40025252 DOI: 10.1007/s10439-025-03701-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2024] [Accepted: 02/22/2025] [Indexed: 03/04/2025]
Abstract
PURPOSE As generalist large language models (LLMs) become more commonplace, patients will inevitably increasingly turn to these tools instead of traditional search engines. Here, we evaluate publicly available LLM-based chatbots as tools for patient education through physician review of responses provided by Google, Bard, GPT-3.5 and GPT-4 to commonly searched queries about prevalent chronic health conditions in the United States. METHODS Five distinct commonly Google-searched queries were selected for (i) hypertension, (ii) hyperlipidemia, (iii) diabetes, (iv) anxiety, and (v) mood disorders and prompted into each model of interest. Responses were assessed by board-certified physicians for accuracy, comprehensiveness, and overall quality on a five-point Likert scale. The Flesch-Kincaid Grade Levels were calculated to assess readability. RESULTS GPT-3.5 (4.40 ± 0.48, 4.29 ± 0.43) and GPT-4 (4.35 ± 0.30, 4.24 ± 0.28) received higher ratings in comprehensiveness and quality than Bard (3.79 ± 0.36, 3.87 ± 0.32) and Google (1.87 ± 0.42, 2.11 ± 0.47), all p < 0.05. However, Bard (9.45 ± 1.35) and Google responses (9.92 ± 5.31) had a lower average Flesch-Kincaid Grade Level compared to GPT-3.5 (14.69 ± 1.57) and GPT-4 (12.88 ± 2.02), indicating greater readability. CONCLUSION This study suggests that publicly available LLM-based tools may provide patients with more accurate responses to queries on chronic health conditions than answers provided by Google search. These results provide support for the use of these tools in place of traditional search engines for health-related queries.
Collapse
Affiliation(s)
- Arya Rao
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - Andrew Mu
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - Elizabeth Enichen
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - Dhruva Gupta
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - Nathan Hall
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
| | - Erica Koranteng
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - William Marks
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Harvard Business School, Boston, MA, USA
| | - Michael J Senter-Zapata
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
- Mass General Brigham, Boston, MA, USA
| | - David C Whitehead
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - Benjamin A White
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - Sanjay Saini
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA
| | - Adam B Landman
- Harvard Medical School, Boston, MA, USA
- Mass General Brigham, Boston, MA, USA
| | - Marc D Succi
- Harvard Medical School, Boston, MA, USA.
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA.
- Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, USA.
- Mass General Brigham, Boston, MA, USA.
| |
Collapse
|
4
|
Harkos C, Hadjigeorgiou AG, Voutouri C, Kumar AS, Stylianopoulos T, Jain RK. Using mathematical modelling and AI to improve delivery and efficacy of therapies in cancer. Nat Rev Cancer 2025; 25:324-340. [PMID: 39972158 DOI: 10.1038/s41568-025-00796-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 01/30/2025] [Indexed: 02/21/2025]
Abstract
Mathematical modelling has proven to be a valuable tool in predicting the delivery and efficacy of molecular, antibody-based, nano and cellular therapy in solid tumours. Mathematical models based on our understanding of the biological processes at subcellular, cellular and tissue level are known as mechanistic models that, in turn, are divided into continuous and discrete models. Continuous models are further divided into lumped parameter models - for describing the temporal distribution of medicine in tumours and normal organs - and distributed parameter models - for studying the spatiotemporal distribution of therapy in tumours. Discrete models capture interactions at the cellular and subcellular levels. Collectively, these models are useful for optimizing the delivery and efficacy of molecular, nanoscale and cellular therapy in tumours by incorporating the biological characteristics of tumours, the physicochemical properties of drugs, the interactions among drugs, cancer cells and various components of the tumour microenvironment, and for enabling patient-specific predictions when combined with medical imaging. Artificial intelligence-based methods, such as machine learning, have ushered in a new era in oncology. These data-driven approaches complement mechanistic models and have immense potential for improving cancer detection, treatment and drug discovery. Here we review these diverse approaches and suggest ways to combine mechanistic and artificial intelligence-based models to further improve patient treatment outcomes.
Collapse
Affiliation(s)
- Constantinos Harkos
- Cancer Biophysics Laboratory, Department of Mechanical and Manufacturing Engineering, University of Cyprus, Nicosia, Cyprus
| | - Andreas G Hadjigeorgiou
- Cancer Biophysics Laboratory, Department of Mechanical and Manufacturing Engineering, University of Cyprus, Nicosia, Cyprus
| | - Chrysovalantis Voutouri
- Cancer Biophysics Laboratory, Department of Mechanical and Manufacturing Engineering, University of Cyprus, Nicosia, Cyprus
| | - Ashwin S Kumar
- Edwin L. Steele Laboratories, Department of Radiation Oncology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | - Triantafyllos Stylianopoulos
- Cancer Biophysics Laboratory, Department of Mechanical and Manufacturing Engineering, University of Cyprus, Nicosia, Cyprus.
| | - Rakesh K Jain
- Edwin L. Steele Laboratories, Department of Radiation Oncology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
5
|
Suárez A, Arena S, Herranz Calzada A, Castillo Varón AI, Diaz-Flores García V, Freire Y. Decoding wisdom: Evaluating ChatGPT's accuracy and reproducibility in analyzing orthopantomographic images for third molar assessment. Comput Struct Biotechnol J 2025; 28:141-147. [PMID: 40271108 PMCID: PMC12017887 DOI: 10.1016/j.csbj.2025.04.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2025] [Revised: 04/08/2025] [Accepted: 04/09/2025] [Indexed: 04/25/2025] Open
Abstract
The integration of Artificial Intelligence (AI) into healthcare has opened new avenues for clinical decision support, particularly in radiology. The aim of this study was to evaluate the accuracy and reproducibility of ChatGPT-4o in the radiographic image interpretation of orthopantomograms (OPGs) for assessment of lower third molars, simulating real patient requests for tooth extraction. Thirty OPGs were analyzed, each paired with a standardized prompt submitted to ChatGPT-4o, generating 900 responses (30 per radiograph). Two oral surgery experts independently evaluated the responses using a three-point Likert scale (correct, partially correct/incomplete, incorrect), with disagreements resolved by a third expert. ChatGPT-4o achieved an accuracy rate of 38.44 % (95 % CI: 35.27 %-41.62 %). The percentage agreement among repeated responses was 82.7 %, indicating high consistency, though Gwet's coefficient of agreement (60.4 %) suggested only moderate repeatability. While the model correctly identified general features in some cases, it frequently provided incomplete or fabricated information, particularly in complex radiographs involving overlapping structures or underdeveloped roots. These findings highlight ChatGPT-4o's current limitations in dental radiographic interpretation. Although it demonstrated some capability in analyzing OPGs, its accuracy and reliability remain insufficient for unsupervised clinical use. Professional oversight is essential to prevent diagnostic errors. Further refinement and specialized training of AI models are needed to enhance their performance and ensure safe integration into dental practice, especially in patient-facing applications.
Collapse
Affiliation(s)
- Ana Suárez
- Department of Pre-Clinic Dentistry II, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, Madrid 28670, Spain
| | - Stefania Arena
- Department of Pre-Clinic Dentistry II, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, Madrid 28670, Spain
| | - Alberto Herranz Calzada
- Department of Pre-Clinic Dentistry II, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, Madrid 28670, Spain
- Department of Pre-Clinic Dentistry I, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, Madrid 28670, Spain
| | - Ana Isabel Castillo Varón
- Department of Medicine. Faculty of Medicine, Health and Sports. Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, Madrid 28670, Spain
| | - Victor Diaz-Flores García
- Department of Pre-Clinic Dentistry I, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, Madrid 28670, Spain
| | - Yolanda Freire
- Department of Pre-Clinic Dentistry II, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, Madrid 28670, Spain
| |
Collapse
|
6
|
Pan J, Lee S, Cheligeer C, Martin EA, Riazi K, Quan H, Li N. Integrating large language models with human expertise for disease detection in electronic health records. Comput Biol Med 2025; 191:110161. [PMID: 40198990 DOI: 10.1016/j.compbiomed.2025.110161] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2024] [Revised: 03/17/2025] [Accepted: 04/04/2025] [Indexed: 04/10/2025]
Abstract
OBJECTIVE Electronic health records (EHR) are widely available to complement administrative data-based disease surveillance and healthcare performance evaluation. Defining conditions from EHR is labour-intensive and requires extensive manual labelling of disease outcomes. This study developed an efficient strategy based on advanced large language models to identify multiple conditions from EHR clinical notes. METHODS We linked a cardiac registry cohort in 2015 with an EHR system in Alberta, Canada. We developed a pipeline that leveraged a generative large language model (LLM) to analyze, understand, and interpret EHR notes by prompts based on specific diagnosis, treatment management, and clinical guidelines. The pipeline was applied to detect acute myocardial infarction (AMI), diabetes, and hypertension. The performance was compared against clinician-validated diagnoses as the reference standard and widely adopted International Classification of Diseases (ICD) codes-based methods. RESULTS The study cohort accounted for 3088 patients and 551,095 clinical notes. The prevalence was 55.4 %, 27.7 %, 65.9 % and for AMI, diabetes, and hypertension, respectively. The performance of the LLM-based pipeline for detecting conditions varied: AMI had 88 % sensitivity, 63 % specificity, and 77 % positive predictive value (PPV); diabetes had 91 % sensitivity, 86 % specificity, and 71 % PPV; and hypertension had 94 % sensitivity, 32 % specificity, and 72 % PPV. Compared with ICD codes, the LLM-based method demonstrated improved sensitivity and negative predictive value across all conditions. The monthly percentage trends from the detected cases by LLM and reference standard showed consistent patterns. CONCLUSION The proposed LLM-based pipeline demonstrated reasonable accuracy and high efficiency in disease detection for multiple conditions. Human expert knowledge can be integrated into the pipeline to guide EHR note analysis without manually curated labels. The method could enable comprehensive real-time disease surveillance using EHRs.
Collapse
Affiliation(s)
- Jie Pan
- Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada; Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada; Libin Cardiovascular Institute, University of Calgary, Calgary, AB, Canada.
| | - Seungwon Lee
- Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada; Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada; Provincial Research Data Services, Alberta Health Services, Calgary, AB, Canada
| | - Cheligeer Cheligeer
- Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada; Provincial Research Data Services, Alberta Health Services, Calgary, AB, Canada
| | - Elliot A Martin
- Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada; Provincial Research Data Services, Alberta Health Services, Calgary, AB, Canada
| | - Kiarash Riazi
- Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada; Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada; Libin Cardiovascular Institute, University of Calgary, Calgary, AB, Canada
| | - Hude Quan
- Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada; Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada; Libin Cardiovascular Institute, University of Calgary, Calgary, AB, Canada
| | - Na Li
- Centre for Health Informatics, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada; Department of Community Health Sciences, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada; Libin Cardiovascular Institute, University of Calgary, Calgary, AB, Canada
| |
Collapse
|
7
|
Ghorbian M, Ghobaei-Arani M, Ghorbian S. Transforming breast cancer diagnosis and treatment with large language Models: A comprehensive survey. Methods 2025; 239:S1046-2023(25)00088-X. [PMID: 40199412 DOI: 10.1016/j.ymeth.2025.04.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2025] [Revised: 03/24/2025] [Accepted: 04/01/2025] [Indexed: 04/10/2025] Open
Abstract
Breast cancer (BrCa), being one of the most prevalent forms of cancer in women, poses many challenges in the field of treatment and diagnosis due to its complex biological mechanisms. Early and accurate diagnosis plays a fundamental role in improving survival rates, but the limitations of existing imaging methods and clinical data interpretation often prevent optimal results. Large Language Models (LLMs), which are developed based on advanced architectures such as transformers, have brought about a significant revolution in data processing and medical decision-making. By analyzing a large volume of medical and clinical data, these models enable early diagnosis by identifying patterns in images and medical records and provide personalized treatment strategies by integrating genetic markers and clinical guidelines. Despite the transformative potential of these models, their use in BrCa management faces challenges such as data sensitivity, algorithm transparency, ethical considerations, and model compatibility with the details of medical applications that need to be addressed to achieve reliable results. This review systematically reviews the impact of LLMs on BrCa treatment and diagnosis. This study's objectives include analyzing the role of LLM technology in diagnosing and treating this disease. The findings indicate that the application of LLMs has resulted in significant improvements in various aspects of BrCa management, such as a 35% increase in the Efficiency of Diagnosis and BrCa Treatment (EDBC), a 30% enhancement in the System's Clinical Trust and Reliability (SCTR), and a 20% improvement in the quality of patient education and information (IPEI). Ultimately, this study demonstrates the importance of LLMs in advancing precision medicine for BrCa and paves the way for effective patient-centered care solutions.
Collapse
Affiliation(s)
- Mohsen Ghorbian
- Department of Computer Engineering, Qo.C., Islamic Azad University, Qom, Iran.
| | | | - Saied Ghorbian
- Department of Molecular Genetics, An.C., Islamic Azad University, Ahar, Iran
| |
Collapse
|
8
|
Chen D, Avison K, Alnassar S, Huang RS, Raman S. Medical accuracy of artificial intelligence chatbots in oncology: a scoping review. Oncologist 2025; 30:oyaf038. [PMID: 40285677 PMCID: PMC12032582 DOI: 10.1093/oncolo/oyaf038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2024] [Accepted: 01/03/2025] [Indexed: 04/29/2025] Open
Abstract
BACKGROUND Recent advances in large language models (LLM) have enabled human-like qualities of natural language competency. Applied to oncology, LLMs have been proposed to serve as an information resource and interpret vast amounts of data as a clinical decision-support tool to improve clinical outcomes. OBJECTIVE This review aims to describe the current status of medical accuracy of oncology-related LLM applications and research trends for further areas of investigation. METHODS A scoping literature search was conducted on Ovid Medline for peer-reviewed studies published since 2000. We included primary research studies that evaluated the medical accuracy of a large language model applied in oncology settings. Study characteristics and primary outcomes of included studies were extracted to describe the landscape of oncology-related LLMs. RESULTS Sixty studies were included based on the inclusion and exclusion criteria. The majority of studies evaluated LLMs in oncology as a health information resource in question-answer style examinations (48%), followed by diagnosis (20%) and management (17%). The number of studies that evaluated the utility of fine-tuning and prompt-engineering LLMs increased over time from 2022 to 2024. Studies reported the advantages of LLMs as an accurate information resource, reduction of clinician workload, and improved accessibility and readability of clinical information, while noting disadvantages such as poor reliability, hallucinations, and need for clinician oversight. DISCUSSION There exists significant interest in the application of LLMs in clinical oncology, with a particular focus as a medical information resource and clinical decision support tool. However, further research is needed to validate these tools in external hold-out datasets for generalizability and to improve medical accuracy across diverse clinical scenarios, underscoring the need for clinician supervision of these tools.
Collapse
Affiliation(s)
- David Chen
- Princess Margaret Hospital Cancer Centre, Radiation Medicine Program, Toronto, ON M5G 2C4, Canada
- Temerty Faculty of Medicine, University of Toronto, Toronto, ON M5S 3K3, Canada
| | - Kate Avison
- Princess Margaret Hospital Cancer Centre, Radiation Medicine Program, Toronto, ON M5G 2C4, Canada
- Department of Systems Design Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada
| | - Saif Alnassar
- Princess Margaret Hospital Cancer Centre, Radiation Medicine Program, Toronto, ON M5G 2C4, Canada
- Department of Systems Design Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada
| | - Ryan S Huang
- Princess Margaret Hospital Cancer Centre, Radiation Medicine Program, Toronto, ON M5G 2C4, Canada
- Temerty Faculty of Medicine, University of Toronto, Toronto, ON M5S 3K3, Canada
| | - Srinivas Raman
- Princess Margaret Hospital Cancer Centre, Radiation Medicine Program, Toronto, ON M5G 2C4, Canada
- Temerty Faculty of Medicine, University of Toronto, Toronto, ON M5S 3K3, Canada
- Department of Radiation Oncology, University of Toronto, Toronto, ON M5T 1P5, Canada
- Department of Radiation Oncology, BC Cancer, Vancouver, BC V5Z 1G1, Canada
- Division of Radiation Oncology, University of British Columbia, Vancouver, BC V5Z 1M9, Canada
| |
Collapse
|
9
|
Mitsuyama Y, Tatekawa H, Takita H, Sasaki F, Tashiro A, Oue S, Walston SL, Nonomiya Y, Shintani A, Miki Y, Ueda D. Comparative analysis of GPT-4-based ChatGPT's diagnostic performance with radiologists using real-world radiology reports of brain tumors. Eur Radiol 2025; 35:1938-1947. [PMID: 39198333 PMCID: PMC11913992 DOI: 10.1007/s00330-024-11032-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 07/08/2024] [Accepted: 08/06/2024] [Indexed: 09/01/2024]
Abstract
OBJECTIVES Large language models like GPT-4 have demonstrated potential for diagnosis in radiology. Previous studies investigating this potential primarily utilized quizzes from academic journals. This study aimed to assess the diagnostic capabilities of GPT-4-based Chat Generative Pre-trained Transformer (ChatGPT) using actual clinical radiology reports of brain tumors and compare its performance with that of neuroradiologists and general radiologists. METHODS We collected brain MRI reports written in Japanese from preoperative brain tumor patients at two institutions from January 2017 to December 2021. The MRI reports were translated into English by radiologists. GPT-4 and five radiologists were presented with the same textual findings from the reports and asked to suggest differential and final diagnoses. The pathological diagnosis of the excised tumor served as the ground truth. McNemar's test and Fisher's exact test were used for statistical analysis. RESULTS In a study analyzing 150 radiological reports, GPT-4 achieved a final diagnostic accuracy of 73%, while radiologists' accuracy ranged from 65 to 79%. GPT-4's final diagnostic accuracy using reports from neuroradiologists was higher at 80%, compared to 60% using those from general radiologists. In the realm of differential diagnoses, GPT-4's accuracy was 94%, while radiologists' fell between 73 and 89%. Notably, for these differential diagnoses, GPT-4's accuracy remained consistent whether reports were from neuroradiologists or general radiologists. CONCLUSION GPT-4 exhibited good diagnostic capability, comparable to neuroradiologists in differentiating brain tumors from MRI reports. GPT-4 can be a second opinion for neuroradiologists on final diagnoses and a guidance tool for general radiologists and residents. CLINICAL RELEVANCE STATEMENT This study evaluated GPT-4-based ChatGPT's diagnostic capabilities using real-world clinical MRI reports from brain tumor cases, revealing that its accuracy in interpreting brain tumors from MRI findings is competitive with radiologists. KEY POINTS We investigated the diagnostic accuracy of GPT-4 using real-world clinical MRI reports of brain tumors. GPT-4 achieved final and differential diagnostic accuracy that is comparable with neuroradiologists. GPT-4 has the potential to improve the diagnostic process in clinical radiology.
Collapse
Affiliation(s)
- Yasuhito Mitsuyama
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
| | - Hiroyuki Tatekawa
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
| | - Hirotaka Takita
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
| | - Fumi Sasaki
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
| | - Akane Tashiro
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
| | - Satoshi Oue
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
| | - Shannon L Walston
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
| | - Yuta Nonomiya
- Department of Medical Statistics, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
| | - Ayumi Shintani
- Department of Medical Statistics, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
| | - Yukio Miki
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan
| | - Daiju Ueda
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, 1-4-3 Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan.
- Center for Health Science Innovation, Osaka Metropolitan University, 1-4-3, Asahi-machi, Abeno-ku, Osaka, 545-8585, Japan.
| |
Collapse
|
10
|
Kim TT, Makutonin M, Sirous R, Javan R. Optimizing Large Language Models in Radiology and Mitigating Pitfalls: Prompt Engineering and Fine-tuning. Radiographics 2025; 45:e240073. [PMID: 40048389 DOI: 10.1148/rg.240073] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/13/2025]
Abstract
Large language models (LLMs) such as generative pretrained transformers (GPTs) have had a major impact on society, and there is increasing interest in using these models for applications in medicine and radiology. This article presents techniques to optimize these models and describes their known challenges and limitations. Specifically, the authors explore how to best craft natural language prompts, a process known as prompt engineering, for these models to elicit more accurate and desirable responses. The authors also explain how fine-tuning is conducted, in which a more general model, such as GPT-4, is further trained on a more specific use case, such as summarizing clinical notes, to further improve reliability and relevance. Despite the enormous potential of these models, substantial challenges limit their widespread implementation. These tools differ substantially from traditional health technology in their complexity and their probabilistic and nondeterministic nature, and these differences lead to issues such as "hallucinations," biases, lack of reliability, and security risks. Therefore, the authors provide radiologists with baseline knowledge of the technology underpinning these models and an understanding of how to use them, in addition to exploring best practices in prompt engineering and fine-tuning. Also discussed are current proof-of-concept use cases of LLMs in the radiology literature, such as in clinical decision support and report generation, and the limitations preventing their current adoption in medicine and radiology. ©RSNA, 2025 See invited commentary by Chung and Mongan in this issue.
Collapse
Affiliation(s)
- Theodore Taehoon Kim
- From the Department of Radiology, George Washington University School of Medicine and Health Sciences, 2300 I St NW, Washington, DC 20052 (T.T.K., R.J.); Yale School of Medicine, New Haven, Conn (M.M.); and University of California San Francisco, San Francisco, Calif (R.S.)
| | - Michael Makutonin
- From the Department of Radiology, George Washington University School of Medicine and Health Sciences, 2300 I St NW, Washington, DC 20052 (T.T.K., R.J.); Yale School of Medicine, New Haven, Conn (M.M.); and University of California San Francisco, San Francisco, Calif (R.S.)
| | - Reza Sirous
- From the Department of Radiology, George Washington University School of Medicine and Health Sciences, 2300 I St NW, Washington, DC 20052 (T.T.K., R.J.); Yale School of Medicine, New Haven, Conn (M.M.); and University of California San Francisco, San Francisco, Calif (R.S.)
| | - Ramin Javan
- From the Department of Radiology, George Washington University School of Medicine and Health Sciences, 2300 I St NW, Washington, DC 20052 (T.T.K., R.J.); Yale School of Medicine, New Haven, Conn (M.M.); and University of California San Francisco, San Francisco, Calif (R.S.)
| |
Collapse
|
11
|
Brin D, Sorin V, Barash Y, Konen E, Glicksberg BS, Nadkarni GN, Klang E. Assessing GPT-4 multimodal performance in radiological image analysis. Eur Radiol 2025; 35:1959-1965. [PMID: 39214893 PMCID: PMC11914349 DOI: 10.1007/s00330-024-11035-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2024] [Revised: 06/07/2024] [Accepted: 08/06/2024] [Indexed: 09/04/2024]
Abstract
OBJECTIVES This study aims to assess the performance of a multimodal artificial intelligence (AI) model capable of analyzing both images and textual data (GPT-4V), in interpreting radiological images. It focuses on a range of modalities, anatomical regions, and pathologies to explore the potential of zero-shot generative AI in enhancing diagnostic processes in radiology. METHODS We analyzed 230 anonymized emergency room diagnostic images, consecutively collected over 1 week, using GPT-4V. Modalities included ultrasound (US), computerized tomography (CT), and X-ray images. The interpretations provided by GPT-4V were then compared with those of senior radiologists. This comparison aimed to evaluate the accuracy of GPT-4V in recognizing the imaging modality, anatomical region, and pathology present in the images. RESULTS GPT-4V identified the imaging modality correctly in 100% of cases (221/221), the anatomical region in 87.1% (189/217), and the pathology in 35.2% (76/216). However, the model's performance varied significantly across different modalities, with anatomical region identification accuracy ranging from 60.9% (39/64) in US images to 97% (98/101) and 100% (52/52) in CT and X-ray images (p < 0.001). Similarly, pathology identification ranged from 9.1% (6/66) in US images to 36.4% (36/99) in CT and 66.7% (34/51) in X-ray images (p < 0.001). These variations indicate inconsistencies in GPT-4V's ability to interpret radiological images accurately. CONCLUSION While the integration of AI in radiology, exemplified by multimodal GPT-4, offers promising avenues for diagnostic enhancement, the current capabilities of GPT-4V are not yet reliable for interpreting radiological images. This study underscores the necessity for ongoing development to achieve dependable performance in radiology diagnostics. CLINICAL RELEVANCE STATEMENT Although GPT-4V shows promise in radiological image interpretation, its high diagnostic hallucination rate (> 40%) indicates it cannot be trusted for clinical use as a standalone tool. Improvements are necessary to enhance its reliability and ensure patient safety. KEY POINTS GPT-4V's capability in analyzing images offers new clinical possibilities in radiology. GPT-4V excels in identifying imaging modalities but demonstrates inconsistent anatomy and pathology detection. Ongoing AI advancements are necessary to enhance diagnostic reliability in radiological applications.
Collapse
Affiliation(s)
- Dana Brin
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Tel Hashomer, Israel.
- Faculty of Medicine, Tel-Aviv University, Tel Aviv-Yafo, Israel.
| | - Vera Sorin
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Tel Hashomer, Israel
- Faculty of Medicine, Tel-Aviv University, Tel Aviv-Yafo, Israel
- DeepVision Lab, Chaim Sheba Medical Center, Tel Hashomer, Israel
| | - Yiftach Barash
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Tel Hashomer, Israel
- Faculty of Medicine, Tel-Aviv University, Tel Aviv-Yafo, Israel
- DeepVision Lab, Chaim Sheba Medical Center, Tel Hashomer, Israel
| | - Eli Konen
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Tel Hashomer, Israel
- Faculty of Medicine, Tel-Aviv University, Tel Aviv-Yafo, Israel
| | - Benjamin S Glicksberg
- Hasso Plattner Institute for Digital Health, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Girish N Nadkarni
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, New York, USA
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| | - Eyal Klang
- Department of Diagnostic Imaging, Chaim Sheba Medical Center, Tel Hashomer, Israel
- Faculty of Medicine, Tel-Aviv University, Tel Aviv-Yafo, Israel
- DeepVision Lab, Chaim Sheba Medical Center, Tel Hashomer, Israel
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, New York, USA
- The Charles Bronfman Institute of Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, New York, USA
| |
Collapse
|
12
|
Alanzi TM, Arif W, Alotaibi A, Alnafisi A, Alhwaimal R, Altowairqi N, Alnifaie A, Aldossari K, Althumali K, Alanzi N. Impact of ChatGPT on Diabetes Mellitus Self-Management Among Patients in Saudi Arabia. Cureus 2025; 17:e81855. [PMID: 40342454 PMCID: PMC12059614 DOI: 10.7759/cureus.81855] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/26/2025] [Indexed: 05/11/2025] Open
Abstract
BACKGROUND Diabetes mellitus (DM) is a chronic condition requiring continuous self-management to prevent complications. Artificial intelligence (AI)-driven tools like ChatGPT (OpenAI, Inc., San Francisco, CA, USA) offer potential support in education, monitoring, and decision-making. However, research on its effectiveness in DM self-management remains limited, particularly in Saudi Arabia, necessitating further investigation into its role and impact. PURPOSE This study aims to analyze the impact of ChatGPT on DM self-management. METHODS A qualitative experimental design was adopted in this study. DM patients after interacting with ChatGPT for a week participated in the interviews, where their perceptions on its impact were recorded. A total of 25 DM patients participated in the study, whose results were analyzed using thematic analysis. RESULTS The analysis of interview data revealed 11 themes related to the impact of ChatGPT on DM self-management, which included informational support, personalized recommendations, motivation and support, assistance in decision-making, offering self-care reminders, facilitating communication with healthcare providers, facilitating peer support, providing mental health support, tracking and monitoring, conducting health assessments, and education and awareness. CONCLUSION ChatGPT has a positive impact on DM self-management. However, further research is needed due to ChatGPT's novel nature for generalizing results and extending its applicability to other areas of healthcare.
Collapse
Affiliation(s)
- Turki M Alanzi
- Health Information Management and Technology, Imam Abdulrahman Bin Faisal University, Dammam, SAU
| | - Wejdan Arif
- Radiological Sciences, King Saud University, Riyadh, SAU
| | - Aldanah Alotaibi
- Pharmacology, College of Pharmacy, Shaqra University, Riyadh, SAU
| | - Aasal Alnafisi
- Medicine, King Saud Bin Abdulaziz University for Health Sciences, Jeddah, SAU
| | - Raghad Alhwaimal
- Pharmacology, College of Pharmacy, Shaqra University, Riyadh, SAU
| | - Nouf Altowairqi
- Pharmacology and Toxicology, College of Pharmacy, Jazan University, Jazan, SAU
| | - Amal Alnifaie
- Pharmacology, College of Pharmacy, Shaqra University, Dawadmi, SAU
| | | | | | - Nouf Alanzi
- Clinical Laboratory Sciences, Jouf University, Sakaka, SAU
| |
Collapse
|
13
|
Succi MD, Chang BS, Rao AS. Building the AI-Enabled Medical School of the Future. JAMA 2025:2832147. [PMID: 40163081 DOI: 10.1001/jama.2025.2789] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/02/2025]
Abstract
This Viewpoint discusses preparing medical students to succeed in AI-integrated medical schools.
Collapse
Affiliation(s)
- Marc D Succi
- Harvard Medical School, Boston, Massachusetts
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston
| | - Bernard S Chang
- Harvard Medical School, Boston, Massachusetts
- Beth Israel Deaconess Medical Center, Boston, Massachusetts
| | - Arya S Rao
- Harvard Medical School, Boston, Massachusetts
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston
| |
Collapse
|
14
|
Güneş YC, Cesur T, Çamur E, Karabekmez LG. Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 th edition. Diagn Interv Radiol 2025; 31:111-129. [PMID: 39248152 PMCID: PMC11880873 DOI: 10.4274/dir.2024.242876] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2024] [Accepted: 08/24/2024] [Indexed: 09/10/2024]
Abstract
PURPOSE This study aimed to evaluate the performance of large language models (LLMs) and multimodal LLMs in interpreting the Breast Imaging Reporting and Data System (BI-RADS) categories and providing clinical management recommendations for breast radiology in text-based and visual questions. METHODS This cross-sectional observational study involved two steps. In the first step, we compared ten LLMs (namely ChatGPT 4o, ChatGPT 4, ChatGPT 3.5, Google Gemini 1.5 Pro, Google Gemini 1.0, Microsoft Copilot, Perplexity, Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Opus 200K), general radiologists, and a breast radiologist using 100 text-based multiple-choice questions (MCQs) related to the BI-RADS Atlas 5th edition. In the second step, we assessed the performance of five multimodal LLMs (ChatGPT 4o, ChatGPT 4V, Claude 3.5 Sonnet, Claude 3 Opus, and Google Gemini 1.5 Pro) in assigning BI-RADS categories and providing clinical management recommendations on 100 breast ultrasound images. The comparison of correct answers and accuracy by question types was analyzed using McNemar's and chi-squared tests. Management scores were analyzed using the Kruskal- Wallis and Wilcoxon tests. RESULTS Claude 3.5 Sonnet achieved the highest accuracy in text-based MCQs (90%), followed by ChatGPT 4o (89%), outperforming all other LLMs and general radiologists (78% and 76%) (P < 0.05), except for the Claude 3 Opus models and the breast radiologist (82%) (P > 0.05). Lower-performing LLMs included Google Gemini 1.0 (61%) and ChatGPT 3.5 (60%). Performance across different categories of showed no significant variation among LLMs or radiologists (P > 0.05). For breast ultrasound images, Claude 3.5 Sonnet achieved 59% accuracy, significantly higher than other multimodal LLMs (P < 0.05). Management recommendations were evaluated using a 3-point Likert scale, with Claude 3.5 Sonnet scoring the highest (mean: 2.12 ± 0.97) (P < 0.05). Accuracy varied significantly across BI-RADS categories, except Claude 3 Opus (P < 0.05). Gemini 1.5 Pro failed to answer any BI-RADS 5 questions correctly. Similarly, ChatGPT 4V failed to answer any BI-RADS 1 questions correctly, making them the least accurate in these categories (P < 0.05). CONCLUSION Although LLMs such as Claude 3.5 Sonnet and ChatGPT 4o show promise in text-based BI-RADS assessments, their limitations in visual diagnostics suggest they should be used cautiously and under radiologists' supervision to avoid misdiagnoses. CLINICAL SIGNIFICANCE This study demonstrates that while LLMs exhibit strong capabilities in text-based BI-RADS assessments, their visual diagnostic abilities are currently limited, necessitating further development and cautious application in clinical practice.
Collapse
Affiliation(s)
- Yasin Celal Güneş
- Kırıkkale Yüksek İhtisas Hospital Clinic of Radiology, Kırıkkale, Türkiye
| | - Turay Cesur
- Mamak State Hospital Clinic of Radiology, Ankara, Türkiye
| | - Eren Çamur
- Ankara 29 Mayıs State Hospital Clinic of Radiology, Ankara, Türkiye
| | - Leman Günbey Karabekmez
- Ankara Yıldırım Beyazıt University Faculty of Medicine Department of Radiology, Ankara, Türkiye
| |
Collapse
|
15
|
Zhao Q, Wang H, Wang R, Cao H. Deriving insights from enhanced accuracy: Leveraging prompt engineering in custom GPT for assessing Chinese Nursing Licensing Exam. Nurse Educ Pract 2025; 84:104284. [PMID: 39954324 DOI: 10.1016/j.nepr.2025.104284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2024] [Revised: 01/21/2025] [Accepted: 01/30/2025] [Indexed: 02/17/2025]
Abstract
AIM This study aims to build a Custom GPT specifically designed to answer questions from the Chinese Nursing Licensing Exam, to examine its accuracy and response quality. BACKGROUND Custom GPT could be an efficient tool in nursing education, but it has not yet been implemented in this field. METHODS A quantitative, descriptive, cross-sectional approach was used to evaluate the performance of a Custom GPT. In this study, we developed a Custom GPT by integrating customized knowledge and using Prompt Engineering, retrieval-augmented generation and semantic search technology. Our Custom GPT's performance was compared with that of standard ChatGPT-4 by analyzing 720 questions from three mock exams for the 2024 Chinese Nursing Licensing Exam. RESULTS Custom GPT provided superior results, with its accuracy consistently exceeding 90 % across all six parts of the exams, whereas the accuracy of ChatGPT-4 ranged from 73 % to 89 %. Furthermore, the performance of Custom GPT (accuracy, >85 %) across different question types was superior to that of ChatGPT-4 (accuracy, 66-83 %). The odds ratios consistently favored Custom GPT, indicating a significantly higher likelihood of correct responses (P < 0.05 for most comparisons). In generating explanations, Custom GPT tended to provided more concise and confident responses, whereas ChatGPT-4 provided longer, speculative responses with higher chances of inaccuracies and hallucinations. CONCLUSIONS This study demonstrated significant advantages of Custom GPT over ChatGPT in the Chinese Nursing Licensing Exam, indicating its immense potential in specific application scenarios and its potential for expansion to other areas of nursing.
Collapse
Affiliation(s)
- Quantong Zhao
- Department of Nursing, Lequn Branch, The First Hospital of Jilin University, Changchun, Jilin, China; School of Nursing, Jilin University, Changchun, Jilin, China.
| | - Haiyan Wang
- Department of Nursing, Lequn Branch, The First Hospital of Jilin University, Changchun, Jilin, China
| | - Ran Wang
- Department of Nursing, Lequn Branch, The First Hospital of Jilin University, Changchun, Jilin, China; School of Nursing, Jilin University, Changchun, Jilin, China
| | - Hongshi Cao
- Department of Nursing, Lequn Branch, The First Hospital of Jilin University, Changchun, Jilin, China.
| |
Collapse
|
16
|
Young CC, Enichen E, Rao A, Succi MD. Racial, ethnic, and sex bias in large language model opioid recommendations for pain management. Pain 2025; 166:511-517. [PMID: 39283333 PMCID: PMC12042288 DOI: 10.1097/j.pain.0000000000003388] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Accepted: 07/09/2024] [Indexed: 10/10/2024]
Abstract
ABSTRACT Understanding how large language model (LLM) recommendations vary with patient race/ethnicity provides insight into how LLMs may counter or compound bias in opioid prescription. Forty real-world patient cases were sourced from the MIMIC-IV Note dataset with chief complaints of abdominal pain, back pain, headache, or musculoskeletal pain and amended to include all combinations of race/ethnicity and sex. Large language models were instructed to provide a subjective pain rating and comprehensive pain management recommendation. Univariate analyses were performed to evaluate the association between racial/ethnic group or sex and the specified outcome measures-subjective pain rating, opioid name, order, and dosage recommendations-suggested by 2 LLMs (GPT-4 and Gemini). Four hundred eighty real-world patient cases were provided to each LLM, and responses included pharmacologic and nonpharmacologic interventions. Tramadol was the most recommended weak opioid in 55.4% of cases, while oxycodone was the most frequently recommended strong opioid in 33.2% of cases. Relative to GPT-4, Gemini was more likely to rate a patient's pain as "severe" (OR: 0.57 95% CI: [0.54, 0.60]; P < 0.001), recommend strong opioids (OR: 2.05 95% CI: [1.59, 2.66]; P < 0.001), and recommend opioids later (OR: 1.41 95% CI: [1.22, 1.62]; P < 0.001). Race/ethnicity and sex did not influence LLM recommendations. This study suggests that LLMs do not preferentially recommend opioid treatment for one group over another. Given that prior research shows race-based disparities in pain perception and treatment by healthcare providers, LLMs may offer physicians a helpful tool to guide their pain management and ensure equitable treatment across patient groups.
Collapse
Affiliation(s)
- Cameron C. Young
- Harvard Medical School, Boston, MA, United States
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, United States
| | - Elizabeth Enichen
- Harvard Medical School, Boston, MA, United States
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, United States
| | - Arya Rao
- Harvard Medical School, Boston, MA, United States
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, United States
| | - Marc D. Succi
- Harvard Medical School, Boston, MA, United States
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, United States
- Department of Radiology, Massachusetts General Hospital, Boston, MA, United States
- Enterprise Radiology, Mass General Brigham, Boston, MA, United States
| |
Collapse
|
17
|
Guo S, Li R, Li G, Chen W, Huang J, He L, Ma Y, Wang L, Zheng H, Tian C, Zhao Y, Pan X, Wan H, Liu D, Li Z, Lei J. Comparing ChatGPT's and Surgeon's Responses to Thyroid-related Questions From Patients. J Clin Endocrinol Metab 2025; 110:e841-e850. [PMID: 38597169 DOI: 10.1210/clinem/dgae235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/12/2024] [Revised: 04/03/2024] [Accepted: 04/08/2024] [Indexed: 04/11/2024]
Abstract
CONTEXT For some common thyroid-related conditions with high prevalence and long follow-up times, ChatGPT can be used to respond to common thyroid-related questions. OBJECTIVE In this cross-sectional study, we assessed the ability of ChatGPT (version GPT-4.0) to provide accurate, comprehensive, compassionate, and satisfactory responses to common thyroid-related questions. METHODS First, we obtained 28 thyroid-related questions from the Huayitong app, which together with the 2 interfering questions eventually formed 30 questions. Then, these questions were responded to by ChatGPT (on July 19, 2023), a junior specialist, and a senior specialist (on July 20, 2023) separately. Finally, 26 patients and 11 thyroid surgeons evaluated those responses on 4 dimensions: accuracy, comprehensiveness, compassion, and satisfaction. RESULTS Among the 30 questions and responses, ChatGPT's speed of response was faster than that of the junior specialist (8.69 [7.53-9.48] vs 4.33 [4.05-4.60]; P < .001) and the senior specialist (8.69 [7.53-9.48] vs 4.22 [3.36-4.76]; P < .001). The word count of the ChatGPT's responses was greater than that of both the junior specialist (341.50 [301.00-384.25] vs 74.50 [51.75-84.75]; P < .001) and senior specialist (341.50 [301.00-384.25] vs 104.00 [63.75-177.75]; P < .001). ChatGPT received higher scores than the junior specialist and senior specialist in terms of accuracy, comprehensiveness, compassion, and satisfaction in responding to common thyroid-related questions. CONCLUSION ChatGPT performed better than a junior specialist and senior specialist in answering common thyroid-related questions, but further research is needed to validate the logical ability of the ChatGPT for complex thyroid questions.
Collapse
Affiliation(s)
- Siyin Guo
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Ruicen Li
- Health Management Center, General Practice Medical Center, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Genpeng Li
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Wenjie Chen
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Jing Huang
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Linye He
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Yu Ma
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Liying Wang
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Hongping Zheng
- Department of Thyroid Surgery, General Surgery Ward 7, The First Hospital of Lanzhou University, Lanzhou, Gansu 730000, China
| | - Chunxiang Tian
- Chengdu Women's and Children's Central Hospital, School of Medicine, University of Electronic Science and Technology of China, Chengdu, Sichuan 610031, China
| | - Yatong Zhao
- Thyroid Surgery, Zhengzhou Central Hospital Affiliated of Zhengzhou University, Zhengzhou, Henan 450007, China
| | - Xinmin Pan
- Department of Thyroid Surgery, General Surgery III, Gansu Provincial Hospital, Lanzhou, Gansu 730000, China
| | - Hongxing Wan
- Department of Oncology, Sanya People's Hospital, Sanya, Hainan 572000, China
| | - Dasheng Liu
- Department of Vascular Thyroid Surgery, The Second Affiliated Hospital of Guangzhou University of Chinese Medicine, Guangzhou, Guangdong 510120, China
| | - Zhihui Li
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Jianyong Lei
- Division of Thyroid Surgery, Department of General Surgery, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| |
Collapse
|
18
|
Potočnik J, Thomas E, Kearney D, Killeen RP, Heffernan EJ, Foley SJ. Can ChatGPT and Gemini justify brain CT referrals? A comparative study with human experts and a custom prediction model. Eur Radiol Exp 2025; 9:24. [PMID: 39966263 PMCID: PMC11836243 DOI: 10.1186/s41747-025-00569-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2024] [Accepted: 02/05/2025] [Indexed: 02/20/2025] Open
Abstract
BACKGROUND The poor uptake of imaging referral guidelines in Europe results in a substantial amount of inappropriate computed tomography (CT) scans. Publicly available chatbots, ChatGPT and Gemini, offer an alternative for justifying real-world referrals. Recent research reports high ChatGPT accuracy when analysing American College of Radiology Appropriateness Criteria variants. We compared the chatbots' performance in interpreting, justifying, and suggesting alternative imaging for unstructured adult brain CT referrals in accordance with the European Society of Radiology iGuide. Our prediction model for automated iGuide categorisation of referrals was also compared against the chatbots. METHODS The iGuide justification of 143 real-world CT brain referrals, used to evaluate a prediction model, was analysed by two radiographers and radiologists. ChatGPT-4's and Gemini's imaging recommendations and pathology suspicions were compared with those of humans, with respect to referral completeness. Inter-rater reliability with κ statistics determined the agreement between entities. RESULTS Chatbots' performance was limited (κ = 0.3) but improved for more complete referrals. The prediction model outperformed the chatbots in justification analysis (κ = 0.853). The chatbots' interpretations of complete referrals were highly consistent (49/52, 94.2%). The agreement regarding alternative imaging was high for both complete and ambiguous referrals, with ChatGPT and Gemini correctly identifying imaging modality and anatomical region in 83/96 (86.5%) and 81/96 (84.4%) cases, respectively. CONCLUSION The chatbots' ability to analyse the justification of adult brain CT referrals is limited to complete referrals, unlike our prediction model. Further research is needed to confirm these findings for other types of CT scans and modalities. RELEVANCE STATEMENT ChatGPT and Gemini exhibit potential in justifying free text brain CT referrals; however, further improvements are required to handle real-world referrals of varying quality. KEY POINTS Custom prediction model's justification analysis strongly aligns with iGuide and surpasses chatbots. Chatbots incorrectly justified almost one-half of all CT brain referrals. Chatbots have limited performance in justifying ambiguous CT brain referrals. Chatbot performance improved when referrals were detailed and included suspected pathology.
Collapse
Affiliation(s)
- Jaka Potočnik
- University College Dublin School of Medicine, Dublin, Ireland.
| | - Edel Thomas
- University College Dublin School of Medicine, Dublin, Ireland
| | | | - Ronan P Killeen
- University College Dublin School of Medicine, Dublin, Ireland
- St. Vincent's University Hospital, Dublin, Ireland
- Royal Victoria Eye and Ear Hospital, Dublin, Ireland
| | | | - Shane J Foley
- University College Dublin School of Medicine, Dublin, Ireland
| |
Collapse
|
19
|
Huo B, Boyle A, Marfo N, Tangamornsuksan W, Steen JP, McKechnie T, Lee Y, Mayol J, Antoniou SA, Thirunavukarasu AJ, Sanger S, Ramji K, Guyatt G. Large Language Models for Chatbot Health Advice Studies: A Systematic Review. JAMA Netw Open 2025; 8:e2457879. [PMID: 39903463 PMCID: PMC11795331 DOI: 10.1001/jamanetworkopen.2024.57879] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 11/26/2024] [Indexed: 02/06/2025] Open
Abstract
Importance There is much interest in the clinical integration of large language models (LLMs) in health care. Many studies have assessed the ability of LLMs to provide health advice, but the quality of their reporting is uncertain. Objective To perform a systematic review to examine the reporting variability among peer-reviewed studies evaluating the performance of generative artificial intelligence (AI)-driven chatbots for summarizing evidence and providing health advice to inform the development of the Chatbot Assessment Reporting Tool (CHART). Evidence Review A search of MEDLINE via Ovid, Embase via Elsevier, and Web of Science from inception to October 27, 2023, was conducted with the help of a health sciences librarian to yield 7752 articles. Two reviewers screened articles by title and abstract followed by full-text review to identify primary studies evaluating the clinical accuracy of generative AI-driven chatbots in providing health advice (chatbot health advice studies). Two reviewers then performed data extraction for 137 eligible studies. Findings A total of 137 studies were included. Studies examined topics in surgery (55 [40.1%]), medicine (51 [37.2%]), and primary care (13 [9.5%]). Many studies focused on treatment (91 [66.4%]), diagnosis (60 [43.8%]), or disease prevention (29 [21.2%]). Most studies (136 [99.3%]) evaluated inaccessible, closed-source LLMs and did not provide enough information to identify the version of the LLM under evaluation. All studies lacked a sufficient description of LLM characteristics, including temperature, token length, fine-tuning availability, layers, and other details. Most studies (136 [99.3%]) did not describe a prompt engineering phase in their study. The date of LLM querying was reported in 54 (39.4%) studies. Most studies (89 [65.0%]) used subjective means to define the successful performance of the chatbot, while less than one-third addressed the ethical, regulatory, and patient safety implications of the clinical integration of LLMs. Conclusions and Relevance In this systematic review of 137 chatbot health advice studies, the reporting quality was heterogeneous and may inform the development of the CHART reporting standards. Ethical, regulatory, and patient safety considerations are crucial as interest grows in the clinical integration of LLMs.
Collapse
Affiliation(s)
- Bright Huo
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Amy Boyle
- Michael G. DeGroote School of Medicine, McMaster University, Hamilton, Ontario, Canada
| | - Nana Marfo
- H. Ross University School of Medicine, Miramar, Florida
| | - Wimonchat Tangamornsuksan
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| | - Jeremy P. Steen
- Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
| | - Tyler McKechnie
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Yung Lee
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Julio Mayol
- Hospital Clinico San Carlos, IdISSC, Universidad Complutense de Madrid, Madrid, Spain
| | | | | | - Stephanie Sanger
- Health Science Library, Faculty of Health Sciences, McMaster University, Hamilton, Ontario, Canada
| | - Karim Ramji
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Gordon Guyatt
- Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada
| |
Collapse
|
20
|
Balta KY, Javidan AP, Walser E, Arntfield R, Prager R. Evaluating the Appropriateness, Consistency, and Readability of ChatGPT in Critical Care Recommendations. J Intensive Care Med 2025; 40:184-190. [PMID: 39118320 PMCID: PMC11639400 DOI: 10.1177/08850666241267871] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2024] [Revised: 06/12/2024] [Accepted: 07/18/2024] [Indexed: 08/10/2024]
Abstract
Background: We assessed 2 versions of the large language model (LLM) ChatGPT-versions 3.5 and 4.0-in generating appropriate, consistent, and readable recommendations on core critical care topics. Research Question: How do successive large language models compare in terms of generating appropriate, consistent, and readable recommendations on core critical care topics? Design and Methods: A set of 50 LLM-generated responses to clinical questions were evaluated by 2 independent intensivists based on a 5-point Likert scale for appropriateness, consistency, and readability. Results: ChatGPT 4.0 showed significantly higher median appropriateness scores compared to ChatGPT 3.5 (4.0 vs 3.0, P < .001). However, there was no significant difference in consistency between the 2 versions (40% vs 28%, P = 0.291). Readability, assessed by the Flesch-Kincaid Grade Level, was also not significantly different between the 2 models (14.3 vs 14.4, P = 0.93). Interpretation: Both models produced "hallucinations"-misinformation delivered with high confidence-which highlights the risk of relying on these tools without domain expertise. Despite potential for clinical application, both models lacked consistency producing different results when asked the same question multiple times. The study underscores the need for clinicians to understand the strengths and limitations of LLMs for safe and effective implementation in critical care settings. Registration: https://osf.io/8chj7/.
Collapse
Affiliation(s)
- Kaan Y. Balta
- Schulich School of Medicine & Dentistry, Western University, London, Ontario, Canada
| | - Arshia P. Javidan
- Division of Vascular Surgery, Department of Surgery, University of Toronto, Toronto, Ontario, Canada
| | - Eric Walser
- Division of Critical Care, London Health Sciences Centre, Western University, London, Ontario, Canada
- Department of Surgery, Trauma Program, London Health Sciences Centre, London, Ontario, Canada
| | - Robert Arntfield
- Division of Critical Care, London Health Sciences Centre, Western University, London, Ontario, Canada
| | - Ross Prager
- Division of Critical Care, London Health Sciences Centre, Western University, London, Ontario, Canada
| |
Collapse
|
21
|
Young CC, Enichen E, Rivera C, Auger CA, Grant N, Rao A, Succi MD. Diagnostic Accuracy of a Custom Large Language Model on Rare Pediatric Disease Case Reports. Am J Med Genet A 2025; 197:e63878. [PMID: 39268988 DOI: 10.1002/ajmg.a.63878] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2024] [Revised: 08/10/2024] [Accepted: 08/29/2024] [Indexed: 09/15/2024]
Abstract
Accurately diagnosing rare pediatric diseases frequently represent a clinical challenge due to their complex and unusual clinical presentations. Here, we explore the capabilities of three large language models (LLMs), GPT-4, Gemini Pro, and a custom-built LLM (GPT-4 integrated with the Human Phenotype Ontology [GPT-4 HPO]), by evaluating their diagnostic performance on 61 rare pediatric disease case reports. The performance of the LLMs were assessed for accuracy in identifying specific diagnoses, listing the correct diagnosis among a differential list, and broad disease categories. In addition, GPT-4 HPO was tested on 100 general pediatrics case reports previously assessed on other LLMs to further validate its performance. The results indicated that GPT-4 was able to predict the correct diagnosis with a diagnostic accuracy of 13.1%, whereas both GPT-4 HPO and Gemini Pro had diagnostic accuracies of 8.2%. Further, GPT-4 HPO showed an improved performance compared with the other two LLMs in identifying the correct diagnosis among its differential list and the broad disease category. Although these findings underscore the potential of LLMs for diagnostic support, particularly when enhanced with domain-specific ontologies, they also stress the need for further improvement prior to integration into clinical practice.
Collapse
Affiliation(s)
- Cameron C Young
- Harvard Medical School, Boston, Massachusetts, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center, Mass General Brigham, Boston, Massachusetts, USA
| | - Ellie Enichen
- Harvard Medical School, Boston, Massachusetts, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center, Mass General Brigham, Boston, Massachusetts, USA
| | - Christian Rivera
- Harvard Medical School, Boston, Massachusetts, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center, Mass General Brigham, Boston, Massachusetts, USA
| | - Corinne A Auger
- Harvard Medical School, Boston, Massachusetts, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center, Mass General Brigham, Boston, Massachusetts, USA
| | - Nathan Grant
- Harvard Medical School, Boston, Massachusetts, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center, Mass General Brigham, Boston, Massachusetts, USA
| | - Arya Rao
- Harvard Medical School, Boston, Massachusetts, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center, Mass General Brigham, Boston, Massachusetts, USA
| | - Marc D Succi
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center, Mass General Brigham, Boston, Massachusetts, USA
- Department of Radiology, Massachusetts General Hospital, Boston, Massachusetts, USA
| |
Collapse
|
22
|
Badia JM, Casanova-Portoles D, Membrilla E, Rubiés C, Pujol M, Sancho J. Evaluation of ChatGPT-4 for the detection of surgical site infections from electronic health records after colorectal surgery: A pilot diagnostic accuracy study. J Infect Public Health 2025; 18:102627. [PMID: 39740340 DOI: 10.1016/j.jiph.2024.102627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2024] [Revised: 11/29/2024] [Accepted: 12/16/2024] [Indexed: 01/02/2025] Open
Abstract
BACKGROUND Surveillance of surgical site infection (SSI) relies on manual methods that are time-consuming and prone to subjectivity. This study evaluates the diagnostic accuracy of ChatGPT for detecting SSI from electronic health records after colorectal surgery via comparison with the results of a nationwide surveillance programme. METHODS This pilot, retrospective, multicentre analysis included 122 patients who underwent colorectal surgery. Patient records were reviewed by both manual surveillance and ChatGPT, which was tasked with identifying SSI and categorizing them as superficial, deep, or organ-space infections. Sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated. Receiver operating characteristic (ROC) curve analysis determined the model's diagnostic performance. RESULTS ChatGPT achieved a sensitivity of 100 %, correctly identifying all SSIs detected by manual methods. The specificity was 54 %, indicating the presence of false positives. The PPV was 67 %, and the NPV was 100 %. The area under the ROC curve was 0.77, indicating good overall accuracy for distinguishing between SSI and non-SSI cases. Minor differences in outcomes were observed between colon and rectal surgeries, as well as between the hospitals participating in the study. CONCLUSIONS ChatGPT shows high sensitivity and good overall accuracy for detecting SSI. It appears to be a useful tool for initial screenings and for reducing manual review workload. The moderate specificity suggests a need for further refinement to reduce the rate of false positives. The integration of ChatGPT alongside electronic medical records, antibiotic consumption and imaging data results for real-time analysis may further improve the surveillance of SSI. CLINICALTRIALS gov Identifier: NCT06556017.
Collapse
Affiliation(s)
- Josep M Badia
- Department of Surgery, Hospital General de Granollers, Granollers, Spain; Universitat Internacional de Catalunya. Sant Cugat del Vallès, Barcelona, Spain.
| | - Daniel Casanova-Portoles
- Department of Surgery, Hospital General de Granollers, Granollers, Spain; Universitat Internacional de Catalunya. Sant Cugat del Vallès, Barcelona, Spain.
| | | | - Carles Rubiés
- Department of Digital Transformation, Hospital General de Granollers, Granollers, Spain.
| | - Miquel Pujol
- VINCat Program, Servei Català de la Salut, Catalonia, Spain; Centro de Investigación Biomédica en Red de Enfermedades Infecciosas (CIBERINFEC), Instituto de Salud Carlos III, Madrid, Spain. VINCat Program, Barcelona, Catalonia, Spain; Department of Infectious Diseases, Hospital Universitari de Bellvitge - IDIBELL. L'Hospitalet de Llobregat, Spain.
| | - Joan Sancho
- Department of Surgery, Hospital del Mar, Barcelona, Spain.
| |
Collapse
|
23
|
Altalla' B, Abdalla S, Altamimi A, Bitar L, Al Omari A, Kardan R, Sultan I. Evaluating GPT models for clinical note de-identification. Sci Rep 2025; 15:3852. [PMID: 39890969 PMCID: PMC11785955 DOI: 10.1038/s41598-025-86890-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2024] [Accepted: 01/14/2025] [Indexed: 02/03/2025] Open
Abstract
The rapid digitalization of healthcare has created a pressing need for solutions that manage clinical data securely while ensuring patient privacy. This study evaluates the capabilities of GPT-3.5 and GPT-4 models in de-identifying clinical notes and generating synthetic data, using API access and zero-shot prompt engineering to optimize computational efficiency. Results show that GPT-4 significantly outperformed GPT-3.5, achieving a precision of 0.9925, a recall of 0.8318, an F1 score of 0.8973, and an accuracy of 0.9911. These results demonstrate GPT-4's potential as a powerful tool for safeguarding patient privacy while increasing the availability of clinical data for research. This work sets a benchmark for balancing data utility and privacy in healthcare data management.
Collapse
Affiliation(s)
- Bayan Altalla'
- King Hussein Cancer Center, Queen Rania Street, Amman, Jordan.
- Princess Sumaya University for Technology, Khalil Al-Saket St, Amman, Jordan.
| | - Sameera Abdalla
- Princess Sumaya University for Technology, Khalil Al-Saket St, Amman, Jordan
| | - Ahmad Altamimi
- Princess Sumaya University for Technology, Khalil Al-Saket St, Amman, Jordan
| | - Layla Bitar
- King Hussein Cancer Center, Queen Rania Street, Amman, Jordan
| | - Amal Al Omari
- King Hussein Cancer Center, Queen Rania Street, Amman, Jordan
| | - Ramiz Kardan
- University of Jordan, Queen Rania Street, Amman, Jordan
| | - Iyad Sultan
- King Hussein Cancer Center, Queen Rania Street, Amman, Jordan.
| |
Collapse
|
24
|
Bedi S, Liu Y, Orr-Ewing L, Dash D, Koyejo S, Callahan A, Fries JA, Wornow M, Swaminathan A, Lehmann LS, Hong HJ, Kashyap M, Chaurasia AR, Shah NR, Singh K, Tazbaz T, Milstein A, Pfeffer MA, Shah NH. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA 2025; 333:319-328. [PMID: 39405325 PMCID: PMC11480901 DOI: 10.1001/jama.2024.21700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Accepted: 09/30/2024] [Indexed: 10/19/2024]
Abstract
Importance Large language models (LLMs) can assist in various health care activities, but current evaluation approaches may not adequately identify the most useful application areas. Objective To summarize existing evaluations of LLMs in health care in terms of 5 components: (1) evaluation data type, (2) health care task, (3) natural language processing (NLP) and natural language understanding (NLU) tasks, (4) dimension of evaluation, and (5) medical specialty. Data Sources A systematic search of PubMed and Web of Science was performed for studies published between January 1, 2022, and February 19, 2024. Study Selection Studies evaluating 1 or more LLMs in health care. Data Extraction and Synthesis Three independent reviewers categorized studies via keyword searches based on the data used, the health care tasks, the NLP and NLU tasks, the dimensions of evaluation, and the medical specialty. Results Of 519 studies reviewed, published between January 1, 2022, and February 19, 2024, only 5% used real patient care data for LLM evaluation. The most common health care tasks were assessing medical knowledge such as answering medical licensing examination questions (44.5%) and making diagnoses (19.5%). Administrative tasks such as assigning billing codes (0.2%) and writing prescriptions (0.2%) were less studied. For NLP and NLU tasks, most studies focused on question answering (84.2%), while tasks such as summarization (8.9%) and conversational dialogue (3.3%) were infrequent. Almost all studies (95.4%) used accuracy as the primary dimension of evaluation; fairness, bias, and toxicity (15.8%), deployment considerations (4.6%), and calibration and uncertainty (1.2%) were infrequently measured. Finally, in terms of medical specialty area, most studies were in generic health care applications (25.6%), internal medicine (16.4%), surgery (11.4%), and ophthalmology (6.9%), with nuclear medicine (0.6%), physical medicine (0.4%), and medical genetics (0.2%) being the least represented. Conclusions and Relevance Existing evaluations of LLMs mostly focus on accuracy of question answering for medical examinations, without consideration of real patient care data. Dimensions such as fairness, bias, and toxicity and deployment considerations received limited attention. Future evaluations should adopt standardized applications and metrics, use clinical data, and broaden focus to include a wider range of tasks and specialties.
Collapse
Affiliation(s)
- Suhana Bedi
- Department of Biomedical Data Science, Stanford School of Medicine, Stanford, California
| | - Yutong Liu
- Clinical Excellence Research Center, Stanford University, Stanford, California
| | - Lucy Orr-Ewing
- Clinical Excellence Research Center, Stanford University, Stanford, California
| | - Dev Dash
- Clinical Excellence Research Center, Stanford University, Stanford, California
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Sanmi Koyejo
- Department of Computer Science, Stanford University, Stanford, California
| | - Alison Callahan
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Jason A. Fries
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Michael Wornow
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Akshay Swaminathan
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | | | - Hyo Jung Hong
- Department of Anesthesiology, Stanford University, Stanford, California
| | - Mehr Kashyap
- Stanford University School of Medicine, Stanford, California
| | - Akash R. Chaurasia
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| | - Nirav R. Shah
- Clinical Excellence Research Center, Stanford University, Stanford, California
| | - Karandeep Singh
- Digital Health Innovation, University of California San Diego Health, San Diego
| | - Troy Tazbaz
- Digital Health Center of Excellence, US Food and Drug Administration, Washington, DC
| | - Arnold Milstein
- Clinical Excellence Research Center, Stanford University, Stanford, California
| | - Michael A. Pfeffer
- Department of Medicine, Stanford University School of Medicine, Stanford, California
| | - Nigam H. Shah
- Clinical Excellence Research Center, Stanford University, Stanford, California
- Center for Biomedical Informatics Research, Stanford University, Stanford, California
| |
Collapse
|
25
|
Kelly BS, Duignan S, Mathur P, Dillon H, Lee EH, Yeom KW, Keane PA, Lawlor A, Killeen RP. Can ChatGPT4-vision identify radiologic progression of multiple sclerosis on brain MRI? Eur Radiol Exp 2025; 9:9. [PMID: 39812885 PMCID: PMC11735712 DOI: 10.1186/s41747-024-00547-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2024] [Accepted: 12/16/2024] [Indexed: 01/16/2025] Open
Abstract
BACKGROUND The large language model ChatGPT can now accept image input with the GPT4-vision (GPT4V) version. We aimed to compare the performance of GPT4V to pretrained U-Net and vision transformer (ViT) models for the identification of the progression of multiple sclerosis (MS) on magnetic resonance imaging (MRI). METHODS Paired coregistered MR images with and without progression were provided as input to ChatGPT4V in a zero-shot experiment to identify radiologic progression. Its performance was compared to pretrained U-Net and ViT models. Accuracy was the primary evaluation metric and 95% confidence interval (CIs) were calculated by bootstrapping. We included 170 patients with MS (50 males, 120 females), aged 21-74 years (mean 42.3), imaged at a single institution from 2019 to 2021, each with 2-5 MRI studies (496 in total). RESULTS One hundred seventy patients were included, 110 for training, 30 for tuning, and 30 for testing; 100 unseen paired images were randomly selected from the test set for evaluation. Both U-Net and ViT had 94% (95% CI: 89-98%) accuracy while GPT4V had 85% (77-91%). GPT4V gave cautious nonanswers in six cases. GPT4V had precision (specificity), recall (sensitivity), and F1 score of 89% (75-93%), 92% (82-98%), 91 (82-97%) compared to 100% (100-100%), 88 (78-96%), and 0.94 (88-98%) for U-Net and 94% (87-100%), 94 (88-100%), and 94 (89-98%) for ViT. CONCLUSION The performance of GPT4V combined with its accessibility suggests has the potential to impact AI radiology research. However, misclassified cases and overly cautious non-answers confirm that it is not yet ready for clinical use. RELEVANCE STATEMENT GPT4V can identify the radiologic progression of MS in a simplified experimental setting. However, GPT4V is not a medical device, and its widespread availability highlights the need for caution and education for lay users, especially those with limited access to expert healthcare. KEY POINTS Without fine-tuning or the need for prior coding experience, GPT4V can perform a zero-shot radiologic change detection task with reasonable accuracy. However, in absolute terms, in a simplified "spot the difference" medical imaging task, GPT4V was inferior to state-of-the-art computer vision methods. GPT4V's performance metrics were more similar to the ViT than the U-net. This is an exploratory experimental study and GPT4V is not intended for use as a medical device.
Collapse
Affiliation(s)
- Brendan S Kelly
- St Vincent's University Hospital, Dublin, Ireland.
- Insight Centre for Data Analytics, UCD, Dublin, Ireland.
- Wellcome Trust-HRB, Irish Clinical Academic Training, Dublin, Ireland.
- School of Medicine, University College Dublin, Dublin, Ireland.
| | | | | | - Henry Dillon
- St Vincent's University Hospital, Dublin, Ireland
| | - Edward H Lee
- Lucille Packard Children's Hospital at Stanford, Stanford, CA, USA
| | - Kristen W Yeom
- Lucille Packard Children's Hospital at Stanford, Stanford, CA, USA
| | | | | | - Ronan P Killeen
- St Vincent's University Hospital, Dublin, Ireland
- School of Medicine, University College Dublin, Dublin, Ireland
| |
Collapse
|
26
|
Chia JLL, He GS, Ngiam KY, Hartman M, Ng QX, Goh SSN. Harnessing Artificial Intelligence to Enhance Global Breast Cancer Care: A Scoping Review of Applications, Outcomes, and Challenges. Cancers (Basel) 2025; 17:197. [PMID: 39857979 PMCID: PMC11764353 DOI: 10.3390/cancers17020197] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2024] [Revised: 01/02/2025] [Accepted: 01/07/2025] [Indexed: 01/27/2025] Open
Abstract
BACKGROUND In recent years, Artificial Intelligence (AI) has shown transformative potential in advancing breast cancer care globally. This scoping review seeks to provide a comprehensive overview of AI applications in breast cancer care, examining how they could reshape diagnosis, treatment, and management on a worldwide scale and discussing both the benefits and challenges associated with their adoption. METHODS In accordance with PRISMA-ScR and ensuing guidelines on scoping reviews, PubMed, Web of Science, Cochrane Library, and Embase were systematically searched from inception to end of May 2024. Keywords included "Artificial Intelligence" and "Breast Cancer". Original studies were included based on their focus on AI applications in breast cancer care and narrative synthesis was employed for data extraction and interpretation, with the findings organized into coherent themes. RESULTS Finally, 84 articles were included. The majority were conducted in developed countries (n = 54). The majority of publications were in the last 10 years (n = 83). The six main themes for AI applications were AI for breast cancer screening (n = 32), AI for image detection of nodal status (n = 7), AI-assisted histopathology (n = 8), AI in assessing post-neoadjuvant chemotherapy (NACT) response (n = 23), AI in breast cancer margin assessment (n = 5), and AI as a clinical decision support tool (n = 9). AI has been used as clinical decision support tools to augment treatment decisions for breast cancer and in multidisciplinary tumor board settings. Overall, AI applications demonstrated improved accuracy and efficiency; however, most articles did not report patient-centric clinical outcomes. CONCLUSIONS AI applications in breast cancer care show promise in enhancing diagnostic accuracy and treatment planning. However, persistent challenges in AI adoption, such as data quality, algorithm transparency, and resource disparities, must be addressed to advance the field.
Collapse
Affiliation(s)
- Jolene Li Ling Chia
- NUS Yong Loo Lin School of Medicine, National University of Singapore, 10 Medical Dr. S117597, Singapore 119077, Singapore (G.S.H.)
| | - George Shiyao He
- NUS Yong Loo Lin School of Medicine, National University of Singapore, 10 Medical Dr. S117597, Singapore 119077, Singapore (G.S.H.)
| | - Kee Yuen Ngiam
- Department of Surgery, National University Hospital, Singapore 119074, Singapore; (K.Y.N.); (M.H.)
- Saw Swee Hock School of Public Health, National University of Singapore and National University Health System, 12 Science Drive 2, #10-01, Singapore 117549, Singapore
| | - Mikael Hartman
- Department of Surgery, National University Hospital, Singapore 119074, Singapore; (K.Y.N.); (M.H.)
- Saw Swee Hock School of Public Health, National University of Singapore and National University Health System, 12 Science Drive 2, #10-01, Singapore 117549, Singapore
| | - Qin Xiang Ng
- Saw Swee Hock School of Public Health, National University of Singapore and National University Health System, 12 Science Drive 2, #10-01, Singapore 117549, Singapore
- SingHealth Duke-NUS Global Health Institute, Singapore 169857, Singapore
| | - Serene Si Ning Goh
- Department of Surgery, National University Hospital, Singapore 119074, Singapore; (K.Y.N.); (M.H.)
- Saw Swee Hock School of Public Health, National University of Singapore and National University Health System, 12 Science Drive 2, #10-01, Singapore 117549, Singapore
| |
Collapse
|
27
|
Farhadi Nia M, Ahmadi M, Irankhah E. Transforming dental diagnostics with artificial intelligence: advanced integration of ChatGPT and large language models for patient care. FRONTIERS IN DENTAL MEDICINE 2025; 5:1456208. [PMID: 39917691 PMCID: PMC11797834 DOI: 10.3389/fdmed.2024.1456208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2024] [Accepted: 10/16/2024] [Indexed: 02/09/2025] Open
Abstract
Artificial intelligence has dramatically reshaped our interaction with digital technologies, ushering in an era where advancements in AI algorithms and Large Language Models (LLMs) have natural language processing (NLP) systems like ChatGPT. This study delves into the impact of cutting-edge LLMs, notably OpenAI's ChatGPT, on medical diagnostics, with a keen focus on the dental sector. Leveraging publicly accessible datasets, these models augment the diagnostic capabilities of medical professionals, streamline communication between patients and healthcare providers, and enhance the efficiency of clinical procedures. The advent of ChatGPT-4 is poised to make substantial inroads into dental practices, especially in the realm of oral surgery. This paper sheds light on the current landscape and explores potential future research directions in the burgeoning field of LLMs, offering valuable insights for both practitioners and developers. Furthermore, it critically assesses the broad implications and challenges within various sectors, including academia and healthcare, thus mapping out an overview of AI's role in transforming dental diagnostics for enhanced patient care.
Collapse
Affiliation(s)
- Masoumeh Farhadi Nia
- Department of Electrical and Computer Engineering, University of Massachusetts Lowell, Lowell, MA, United States
| | - Mohsen Ahmadi
- Department of Electrical Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, United States
- Department of Industrial Engineering, Urmia University of Technology, Urmia, Iran
| | - Elyas Irankhah
- Department of Mechanical Engineering, University of Massachusetts Lowell, Lowell, MA, United States
| |
Collapse
|
28
|
Altalla' B, Ahmad A, Bitar L, Al-Bssol M, Al Omari A, Sultan I. Radiology Report Annotation Using Generative Large Language Models: Comparative Analysis. Int J Biomed Imaging 2025; 2025:5019035. [PMID: 39968311 PMCID: PMC11835477 DOI: 10.1155/ijbi/5019035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2024] [Revised: 09/20/2024] [Accepted: 12/03/2024] [Indexed: 02/20/2025] Open
Abstract
Recent advancements in large language models (LLMs), particularly GPT-3.5 and GPT-4, have sparked significant interest in their application within the medical field. This research offers a detailed comparative analysis of the abilities of GPT-3.5 and GPT-4 in the context of annotating radiology reports and generating impressions from chest computed tomography (CT) scans. The primary objective is to use these models to assist healthcare professionals in handling routine documentation tasks. Employing methods such as in-context learning (ICL) and retrieval-augmented generation (RAG), the study focused on generating impression sections from radiological findings. Comprehensive evaluation was applied using a variety of metrics, including recall-oriented understudy for gisting evaluation (ROUGE) for n-gram analysis, Instructor Similarity for contextual similarity, and BERTScore for semantic similarity, to assess the performance of these models. The study shows distinct performance differences between GPT-3.5 and GPT-4 across both zero-shot and few-shot learning scenarios. It was observed that certain prompts significantly influenced the performance outcomes, with specific prompts leading to more accurate impressions. The RAG method achieved a superior BERTScore of 0.92, showcasing its ability to generate semantically rich and contextually accurate impressions. In contrast, GPT-3.5 and GPT-4 excel in preserving language tone, with Instructor Similarity scores of approximately 0.92 across scenarios, underscoring the importance of prompt design in effective summarization tasks. The findings of this research emphasize the critical role of prompt design in optimizing model efficacy and point to the significant potential for further exploration in prompt engineering. Moreover, the study advocates for the standardized integration of such advanced LLMs in healthcare practices, highlighting their potential to enhance the efficiency and accuracy of medical documentation.
Collapse
Affiliation(s)
- Bayan Altalla'
- Office of Scientific Affairs and Research, King Hussein Cancer Center, Amman, Jordan
- School of Computing Sciences, Princess Sumaya University for Technology, Amman, Jordan
| | - Ashraf Ahmad
- School of Computing Sciences, Princess Sumaya University for Technology, Amman, Jordan
| | - Layla Bitar
- Artificial Intelligence and Data Innovation Office, King Hussein Cancer Center, Amman, Jordan
| | - Mohammed Al-Bssol
- Office of Scientific Affairs and Research, King Hussein Cancer Center, Amman, Jordan
| | - Amal Al Omari
- Office of Scientific Affairs and Research, King Hussein Cancer Center, Amman, Jordan
| | - Iyad Sultan
- Artificial Intelligence and Data Innovation Office, King Hussein Cancer Center, Amman, Jordan
| |
Collapse
|
29
|
Zitu MM, Le TD, Duong T, Haddadan S, Garcia M, Amorrortu R, Zhao Y, Rollison DE, Thieu T. Large language models in cancer: potentials, risks, and safeguards. BJR ARTIFICIAL INTELLIGENCE 2025; 2:ubae019. [PMID: 39777117 PMCID: PMC11703354 DOI: 10.1093/bjrai/ubae019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Revised: 10/26/2024] [Accepted: 12/09/2024] [Indexed: 01/11/2025]
Abstract
This review examines the use of large language models (LLMs) in cancer, analysing articles sourced from PubMed, Embase, and Ovid Medline, published between 2017 and 2024. Our search strategy included terms related to LLMs, cancer research, risks, safeguards, and ethical issues, focusing on studies that utilized text-based data. 59 articles were included in the review, categorized into 3 segments: quantitative studies on LLMs, chatbot-focused studies, and qualitative discussions on LLMs on cancer. Quantitative studies highlight LLMs' advanced capabilities in natural language processing (NLP), while chatbot-focused articles demonstrate their potential in clinical support and data management. Qualitative research underscores the broader implications of LLMs, including the risks and ethical considerations. Our findings suggest that LLMs, notably ChatGPT, have potential in data analysis, patient interaction, and personalized treatment in cancer care. However, the review identifies critical risks, including data biases and ethical challenges. We emphasize the need for regulatory oversight, targeted model development, and continuous evaluation. In conclusion, integrating LLMs in cancer research offers promising prospects but necessitates a balanced approach focusing on accuracy, ethical integrity, and data privacy. This review underscores the need for further study, encouraging responsible exploration and application of artificial intelligence in oncology.
Collapse
Affiliation(s)
- Md Muntasir Zitu
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Tuan Dung Le
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Thanh Duong
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Shohreh Haddadan
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Melany Garcia
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Rossybelle Amorrortu
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Yayi Zhao
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Dana E Rollison
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Thanh Thieu
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| |
Collapse
|
30
|
Levin G, Gotlieb W, Ramirez P, Meyer R, Brezinov Y. ChatGPT in a gynaecologic oncology multidisciplinary team tumour board: A feasibility study. BJOG 2025; 132:99-101. [PMID: 39140200 PMCID: PMC11612612 DOI: 10.1111/1471-0528.17929] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Revised: 07/29/2024] [Accepted: 07/31/2024] [Indexed: 08/15/2024]
Affiliation(s)
- Gabriel Levin
- Department of Gynecologic OncologyJewish General Hospital, McGill UniversityMontrealQuebecCanada
| | - Walter Gotlieb
- Department of Gynecologic OncologyJewish General Hospital, McGill UniversityMontrealQuebecCanada
| | - Pedro Ramirez
- Department of Obstetrics and GynecologyHouston Methodist HospitalHoustonTexasUSA
| | - Raanan Meyer
- Division of Minimally Invasive Gynecologic Surgery, Department of Obstetrics and GynecologyCedars Sinai Medical CenterLos AngelesCaliforniaUSA
| | - Yoav Brezinov
- Lady Davis Institute for Cancer ResearchJewish General Hospital, McGill UniversityMontrealQuebecCanada
| |
Collapse
|
31
|
Nguyen D, Rao A, Mazumder A, Succi MD. Exploring the accuracy of embedded ChatGPT-4 and ChatGPT-4o in generating BI-RADS scores: a pilot study in radiologic clinical support. Clin Imaging 2025; 117:110335. [PMID: 39549561 DOI: 10.1016/j.clinimag.2024.110335] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2024] [Revised: 10/17/2024] [Accepted: 10/24/2024] [Indexed: 11/18/2024]
Abstract
This study evaluates the accuracy of ChatGPT-4 and ChatGPT-4o in generating Breast Imaging Reporting and Data System (BI-RADS) scores from radiographic images. We tested both models using 77 breast cancer images from radiopaedia.org, including mammograms and ultrasounds. Images were analyzed in separate sessions to avoid bias. ChatGPT-4 and ChatGPT-4o achieved a 66.2 % accuracy across all BI-RADS cases. Performance was highest in BI-RADS 5 cases, with ChatGPT-4 and ChatGPT4o scoring 84.4 % and 88.9 %, respectively. However, both models struggled with BIRADS 1-3 cases, often assigning higher severity ratings. This study highlights the limitations of current LLMs in accurately grading these images and emphasizes the need for further research in these technologies before clinical integration.
Collapse
Affiliation(s)
- Dan Nguyen
- University of Massachusetts Chan Medical School, Worcester, MA, United States
| | - Arya Rao
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, United States; Harvard Medical School, Boston, MA, United States; Department of Radiology, Mass General Brigham, Boston, MA, United States
| | - Aneesh Mazumder
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, United States
| | - Marc D Succi
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, United States; Harvard Medical School, Boston, MA, United States; Department of Radiology, Mass General Brigham, Boston, MA, United States; Mass General Brigham Innovation, Mass General Brigham, Boston, MA, United States.
| |
Collapse
|
32
|
Sarangi PK, Panda BB, P. S, Pattanayak D, Panda S, Mondal H. Exploring Radiology Postgraduate Students' Engagement with Large Language Models for Educational Purposes: A Study of Knowledge, Attitudes, and Practices. Indian J Radiol Imaging 2025; 35:35-42. [PMID: 39697505 PMCID: PMC11651873 DOI: 10.1055/s-0044-1788605] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/29/2024] Open
Abstract
Background The integration of large language models (LLMs) into medical education has received increasing attention as a potential tool to enhance learning experiences. However, there remains a need to explore radiology postgraduate students' engagement with LLMs and their perceptions of their utility in medical education. Hence, we conducted this study to investigate radiology postgraduate students' knowledge, attitudes, and practices regarding LLMs in medical education. Materials and Methods A cross-sectional quantitative survey was conducted online on Google Forms. Participants from all over India were recruited via social media platforms and snowball sampling techniques. A previously validated questionnaire was used to assess knowledge, attitude, and practices regarding LLMs. Descriptive statistical analysis was employed to summarize participants' responses. Results A total of 252 (139 [55.16%] males and 113 [44.84%] females) radiology postgraduate students with a mean age of 28.33 ± 3.32 years participated in the study. The majority of the participants (47.62%) were familiar with LLMs with their potential incorporation with traditional teaching-learning tools (71.82%). They are open to including LLMs as a learning tool (71.03%) and think that it would provide comprehensive medical information (62.7%). Residents take the help of LLMs when they do not get the desired information from books (46.43%) or Internet search engines (59.13%). The overall score of knowledge (3.52 ± 0.58), attitude (3.75 ± 0.51), and practice (3.15 ± 0.57) were statistically significantly different (analysis of variance [ANOVA], p < 0.0001), with the highest score in attitude and lowest in practice. However, no significant differences were found in the scores for knowledge ( p = 0.64), attitude ( p = 0.99), and practice ( p = 0.25) depending on the year of training. Conclusion Radiology postgraduate students are familiar with LLM and recognize the potential benefits of LLMs in postgraduate radiology education. Although they have a positive attitude toward the use of LLMs, they are concerned about its limitations and use it only in limited situations for educational purposes.
Collapse
Affiliation(s)
- Pradosh Kumar Sarangi
- Department of Radiodiagnosis, All India Institute of Medical Sciences, Deoghar, Jharkhand, India
| | - Braja Behari Panda
- Department of Radiodiagnosis, Veer Surendra Sai Institute of Medical Sciences and Research, Burla, Odisha, India
| | - Sanjay P.
- Department of Radiodiagnosis, Mysore Medical College and Research Institute, Mysore, India
| | - Debabrata Pattanayak
- Department of Radiodiagnosis, Veer Surendra Sai Institute of Medical Sciences and Research, Burla, Odisha, India
| | - Swaha Panda
- Department of Otorhinolaryngology and Head and Neck Surgery, All India Institute of Medical Sciences, Deoghar, Jharkhand, India
| | - Himel Mondal
- Department of Physiology, All India Institute of Medical Sciences, Deoghar, Jharkhand, India
| |
Collapse
|
33
|
Kim J, Kincaid JW, Rao AS, Lie W, Fuh L, Landman AB, Succi MD. Risk stratification of potential drug interactions involving common over-the-counter medications and herbal supplements by a large language model. J Am Pharm Assoc (2003) 2025; 65:102304. [PMID: 39613295 PMCID: PMC12042285 DOI: 10.1016/j.japh.2024.102304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2024] [Revised: 11/19/2024] [Accepted: 11/19/2024] [Indexed: 12/01/2024]
Abstract
BACKGROUND As polypharmacy, the use of over-the-counter (OTC) drugs, and herbal supplements becomes increasingly prevalent, the potential for adverse drug-drug interactions (DDIs) poses significant challenges to patient safety and health care outcomes. OBJECTIVE This study evaluates the capacity of Generative Pre-trained Transformer (GPT) models to accurately assess DDIs involving prescription drugs (Rx) with OTC medications and herbal supplements. METHODS Leveraging a popular subscription-based tool (Lexicomp), we compared the risk ratings assigned by these models to 43 Rx-OTC and 30 Rx-herbal supplement pairs. RESULTS Our findings reveal that all models generally underperform, with accuracies below 50% and poor agreement with Lexicomp standards as measured by Cohen's kappa. Notably, GPT-4 and GPT-4o demonstrated a modest improvement in identifying higher-risk interactions compared to GPT-3.5. CONCLUSION These results highlight the challenges and limitations of using off-the-shelf large language models for guidance in DDI assessment.
Collapse
Affiliation(s)
- John Kim
- Harvard Medical School, Boston, MA, USA; and Medically Engineered Solutions in Healthcare, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA
| | - John W.R. Kincaid
- Harvard Medical School, Boston, MA, USA; and Medically Engineered Solutions in Healthcare, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA
| | - Arya S. Rao
- Harvard Medical School, Boston, MA, USA; and Medically Engineered Solutions in Healthcare, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA
| | - Winston Lie
- Harvard Medical School, Boston, MA, USA; and Medically Engineered Solutions in Healthcare, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA
| | - Lanting Fuh
- Department of Pharmacy, Massachusetts General Hospital, Boston, MA, USA; and Mass General Brigham, Boston, MA
| | - Adam B. Landman
- Harvard Medical School, Boston, MA, USA; and Mass General Brigham, Boston, MA
| | - Marc D. Succi
- Department of Radiology, Harvard Medical School, Boston, MA, USA; Medically Engineered Solutions in Healthcare, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA; and Massachusetts General Hospital, Boston, MA
| |
Collapse
|
34
|
Kahalian S, Rajabzadeh M, Öçbe M, Medisoglu MS. ChatGPT-4.0 in oral and maxillofacial radiology: prediction of anatomical and pathological conditions from radiographic images. Folia Med (Plovdiv) 2024; 66:863-868. [PMID: 39774357 DOI: 10.3897/folmed.66.e135584] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2024] [Accepted: 11/26/2024] [Indexed: 01/11/2025] Open
Abstract
INTRODUCTION ChatGPT has the ability to generate human-like text, analyze and understand medical images using natural Language processing (NLP) algorithms. It can generate real-time diagnosis and recognize patterns and learn from previous cases to improve accuracy by combining patient history, symptoms, and image characteristics. It has been used recently for learning about maxillofacial diseases, writing and translating radiology reports, and identifying anatomical landmarks, among other things.
Collapse
Affiliation(s)
- Shila Kahalian
- Kocaeli Health and Technology University, Kocaeli, Turkiye
| | | | - Melisa Öçbe
- Kocaeli Health and Technology University, Kocaeli, Turkiye
| | | |
Collapse
|
35
|
Succi MD, Rao AS. Beyond the AJR: Towards Large Language Models for Radiology Decision-Making in the Emergency Department. AJR Am J Roentgenol 2024. [PMID: 39660831 DOI: 10.2214/ajr.24.32465] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2024]
Affiliation(s)
- Marc D Succi
- Harvard Medical School, Boston, MA, United States
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, United States
- Innovation Office, Mass General Brigham, Boston, MA, United States
- Enterprise Radiology, Mass General Brigham, Boston, MA, United States
| | - Arya S Rao
- Harvard Medical School, Boston, MA, United States
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Mass General Brigham, Boston, MA, United States
| |
Collapse
|
36
|
Ye Y, Zeng W, Chen J, Liu L. Application value of generative artificial intelligence in the field of stomatology. HUA XI KOU QIANG YI XUE ZA ZHI = HUAXI KOUQIANG YIXUE ZAZHI = WEST CHINA JOURNAL OF STOMATOLOGY 2024; 42:810-815. [PMID: 39610079 PMCID: PMC11669926 DOI: 10.7518/hxkq.2024.2024144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Subscribe] [Scholar Register] [Received: 04/12/2024] [Revised: 10/28/2024] [Indexed: 11/30/2024]
Abstract
OBJECTIVES This study aims to compare and analyze three types of generative artificial intelligence (GAI) and explore their application value and existing problems in the field of stomatology in the Chinese context. METHODS A total of 36 questions were designed, covering all the professional areas of stomatology. The questions encompassed various aspects including medical records, professional knowledge, and translation and editing. These questions were submitted to ChatGPT4-turbo, Gemini (2024.2) and ERNIE Bot 4.0. After obtaining the answers, a blinded evaluation was conducted by three experienced oral medicine physicians using a four-point Likert scale. The value of GAI in various application scenarios was evaluated. RESULTS Gemini scored 45, ERNIE Bot scored 38, and ChatGPT scored 33 for clinical documentation and image production. For research assistance, Gemini achieved 45, ERNIE Bot had 39, and ChatGPT scored 35. Teaching assistance capabilities were rated at 54 for ERNIE Bot, 50 for Gemini, and 48 for ChatGPT. In patient consultation and guidance, Gemini scored 78, ERNIE Bot scored 59, and ChatGPT scored 48. Overall, the total scores were 218, 190, and 164 for Gemini, ERNIE Bot, and ChatGPT, respectively. Among GAI applications, the top scoring categories were article translation and polishing (26), patient-doctor communication documentation (23), and popular science content creation (23). The lowest scoring categories were literature search and reporting (13) and image generation (12). CONCLUSIONS In the Chinese context, the application value of GAI is the highest for Gemini, followed by ERNIE Bot and ChatGPT. GAI shows significant value in translation, patient-doctor communication, and popular science writing. However, its value in literature search, reporting, and image generation remains limited.
Collapse
Affiliation(s)
- Yuanlong Ye
- State Key Laboratory of Oral Diseases & National Center for Stomatology & National Clinical Research Center for Oral Diseases & Dept. of Traumatic and Plastic Surgery, West China Hospital of Stomatology, Sichuan University, Chengdu 610041, China
| | - Wei Zeng
- State Key Laboratory of Oral Diseases & National Center for Stomatology & National Clinical Research Center for Oral Diseases & Dept. of Traumatic and Plastic Surgery, West China Hospital of Stomatology, Sichuan University, Chengdu 610041, China
| | - Jinlong Chen
- State Key Laboratory of Oral Diseases & National Center for Stomatology & National Clinical Research Center for Oral Diseases & Dept. of Traumatic and Plastic Surgery, West China Hospital of Stomatology, Sichuan University, Chengdu 610041, China
| | - Lei Liu
- State Key Laboratory of Oral Diseases & National Center for Stomatology & National Clinical Research Center for Oral Diseases & Dept. of Traumatic and Plastic Surgery, West China Hospital of Stomatology, Sichuan University, Chengdu 610041, China
| |
Collapse
|
37
|
Horiuchi D, Tatekawa H, Oura T, Oue S, Walston SL, Takita H, Matsushita S, Mitsuyama Y, Shimono T, Miki Y, Ueda D. Comparing the Diagnostic Performance of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and Radiologists in Challenging Neuroradiology Cases. Clin Neuroradiol 2024; 34:779-787. [PMID: 38806794 DOI: 10.1007/s00062-024-01426-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2024] [Accepted: 05/06/2024] [Indexed: 05/30/2024]
Abstract
PURPOSE To compare the diagnostic performance among Generative Pre-trained Transformer (GPT)-4-based ChatGPT, GPT‑4 with vision (GPT-4V) based ChatGPT, and radiologists in challenging neuroradiology cases. METHODS We collected 32 consecutive "Freiburg Neuropathology Case Conference" cases from the journal Clinical Neuroradiology between March 2016 and December 2023. We input the medical history and imaging findings into GPT-4-based ChatGPT and the medical history and images into GPT-4V-based ChatGPT, then both generated a diagnosis for each case. Six radiologists (three radiology residents and three board-certified radiologists) independently reviewed all cases and provided diagnoses. ChatGPT and radiologists' diagnostic accuracy rates were evaluated based on the published ground truth. Chi-square tests were performed to compare the diagnostic accuracy of GPT-4-based ChatGPT, GPT-4V-based ChatGPT, and radiologists. RESULTS GPT‑4 and GPT-4V-based ChatGPTs achieved accuracy rates of 22% (7/32) and 16% (5/32), respectively. Radiologists achieved the following accuracy rates: three radiology residents 28% (9/32), 31% (10/32), and 28% (9/32); and three board-certified radiologists 38% (12/32), 47% (15/32), and 44% (14/32). GPT-4-based ChatGPT's diagnostic accuracy was lower than each radiologist, although not significantly (all p > 0.07). GPT-4V-based ChatGPT's diagnostic accuracy was also lower than each radiologist and significantly lower than two board-certified radiologists (p = 0.02 and 0.03) (not significant for radiology residents and one board-certified radiologist [all p > 0.09]). CONCLUSION While GPT-4-based ChatGPT demonstrated relatively higher diagnostic performance than GPT-4V-based ChatGPT, the diagnostic performance of GPT‑4 and GPT-4V-based ChatGPTs did not reach the performance level of either radiology residents or board-certified radiologists in challenging neuroradiology cases.
Collapse
Affiliation(s)
- Daisuke Horiuchi
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Hiroyuki Tatekawa
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Tatsushi Oura
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Satoshi Oue
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Shannon L Walston
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Hirotaka Takita
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Shu Matsushita
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Yasuhito Mitsuyama
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Taro Shimono
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Yukio Miki
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Daiju Ueda
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan.
- Center for Health Science Innovation, Osaka Metropolitan University, Osaka, Japan.
| |
Collapse
|
38
|
Capiro N, Fischer C, Sadigh G. Reply to "Enhancing breast imaging strategies: The role of ChatGPT in optimizing screening pathways". Clin Imaging 2024; 116:110313. [PMID: 39461250 DOI: 10.1016/j.clinimag.2024.110313] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2024] [Accepted: 10/07/2024] [Indexed: 10/29/2024]
Affiliation(s)
- Nina Capiro
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, United States of America.
| | - Cheryce Fischer
- Department of Radiological Sciences, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, United States of America
| | - Gelareh Sadigh
- Department of Radiological Sciences, University of California, Irvine, Orange, CA, United States of America
| |
Collapse
|
39
|
Lee J, Park S, Shin J, Cho B. Analyzing evaluation methods for large language models in the medical field: a scoping review. BMC Med Inform Decis Mak 2024; 24:366. [PMID: 39614219 PMCID: PMC11606129 DOI: 10.1186/s12911-024-02709-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Accepted: 10/03/2024] [Indexed: 12/01/2024] Open
Abstract
BACKGROUND Owing to the rapid growth in the popularity of Large Language Models (LLMs), various performance evaluation studies have been conducted to confirm their applicability in the medical field. However, there is still no clear framework for evaluating LLMs. OBJECTIVE This study reviews studies on LLM evaluations in the medical field and analyzes the research methods used in these studies. It aims to provide a reference for future researchers designing LLM studies. METHODS & MATERIALS We conducted a scoping review of three databases (PubMed, Embase, and MEDLINE) to identify LLM-related articles published between January 1, 2023, and September 30, 2023. We analyzed the types of methods, number of questions (queries), evaluators, repeat measurements, additional analysis methods, use of prompt engineering, and metrics other than accuracy. RESULTS A total of 142 articles met the inclusion criteria. LLM evaluation was primarily categorized as either providing test examinations (n = 53, 37.3%) or being evaluated by a medical professional (n = 80, 56.3%), with some hybrid cases (n = 5, 3.5%) or a combination of the two (n = 4, 2.8%). Most studies had 100 or fewer questions (n = 18, 29.0%), 15 (24.2%) performed repeated measurements, 18 (29.0%) performed additional analyses, and 8 (12.9%) used prompt engineering. For medical assessment, most studies used 50 or fewer queries (n = 54, 64.3%), had two evaluators (n = 43, 48.3%), and 14 (14.7%) used prompt engineering. CONCLUSIONS More research is required regarding the application of LLMs in healthcare. Although previous studies have evaluated performance, future studies will likely focus on improving performance. A well-structured methodology is required for these studies to be conducted systematically.
Collapse
Affiliation(s)
- Junbok Lee
- Institute for Innovation in Digital Healthcare, Yonsei University, Seoul, Republic of Korea
- Department of Human Systems Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Sungkyung Park
- Department of Bigdata AI Management Information, Seoul National University of Science and Technology, Seoul, Republic of Korea
| | - Jaeyong Shin
- Department of Preventive Medicine and Public Health, Yonsei University College of Medicine, 50-1, Yonsei-ro, Seodaemun-gu, Seoul, 03722, Republic of Korea.
- Institute of Health Services Research, Yonsei University College of Medicine, Seoul, Korea.
| | - Belong Cho
- Department of Human Systems Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea.
- Department of Family Medicine, Seoul National University Hospital, Seoul, Republic of Korea.
- Department of Family Medicine, Seoul National University College of Medicine, 101 Daehak-ro, Jongno-gu, Seoul, 03080, Republic of Korea.
| |
Collapse
|
40
|
Ho CN, Tian T, Ayers AT, Aaron RE, Phillips V, Wolf RM, Mathioudakis N, Dai T, Klonoff DC. Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review. BMC Med Inform Decis Mak 2024; 24:357. [PMID: 39593074 PMCID: PMC11590327 DOI: 10.1186/s12911-024-02757-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2024] [Accepted: 11/08/2024] [Indexed: 11/28/2024] Open
Abstract
BACKGROUND The large language models (LLMs), most notably ChatGPT, released since November 30, 2022, have prompted shifting attention to their use in medicine, particularly for supporting clinical decision-making. However, there is little consensus in the medical community on how LLM performance in clinical contexts should be evaluated. METHODS We performed a literature review of PubMed to identify publications between December 1, 2022, and April 1, 2024, that discussed assessments of LLM-generated diagnoses or treatment plans. RESULTS We selected 108 relevant articles from PubMed for analysis. The most frequently used LLMs were GPT-3.5, GPT-4, Bard, LLaMa/Alpaca-based models, and Bing Chat. The five most frequently used criteria for scoring LLM outputs were "accuracy", "completeness", "appropriateness", "insight", and "consistency". CONCLUSIONS The most frequently used criteria for defining high-quality LLMs have been consistently selected by researchers over the past 1.5 years. We identified a high degree of variation in how studies reported their findings and assessed LLM performance. Standardized reporting of qualitative evaluation metrics that assess the quality of LLM outputs can be developed to facilitate research studies on LLMs in healthcare.
Collapse
Affiliation(s)
- Cindy N Ho
- Diabetes Technology Society, Burlingame, CA, USA
| | - Tiffany Tian
- Diabetes Technology Society, Burlingame, CA, USA
| | | | | | - Vidith Phillips
- School of Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Risa M Wolf
- Division of Pediatric Endocrinology, The Johns Hopkins Hospital, Baltimore, MD, USA
- Hopkins Business of Health Initiative, Johns Hopkins University, Washington, DC, USA
| | | | - Tinglong Dai
- Hopkins Business of Health Initiative, Johns Hopkins University, Washington, DC, USA
- Carey Business School, Johns Hopkins University, Baltimore, MD, USA
- School of Nursing, Johns Hopkins University, Baltimore, MD, USA
| | - David C Klonoff
- Diabetes Research Institute, Mills-Peninsula Medical Center, 100 South San Mateo Drive, Room 1165, San Mateo, CA, 94401, USA.
| |
Collapse
|
41
|
Park HJ, Huh JY, Chae G, Choi MG. Extraction of clinical data on major pulmonary diseases from unstructured radiologic reports using a large language model. PLoS One 2024; 19:e0314136. [PMID: 39585830 PMCID: PMC11588275 DOI: 10.1371/journal.pone.0314136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2024] [Accepted: 11/06/2024] [Indexed: 11/27/2024] Open
Abstract
Despite significant strides in big data technology, extracting information from unstructured clinical data remains a formidable challenge. This study investigated the utility of large language models (LLMs) for extracting clinical data from unstructured radiological reports without additional training. In this retrospective study, 1800 radiologic reports, 600 from each of the three university hospitals, were collected, with seven pulmonary outcomes defined. Three pulmonology-trained specialists discerned the presence or absence of diseases. Data extraction from the reports was executed using Google Gemini Pro 1.0, OpenAI's GPT-3.5, and GPT-4. The gold standard was predicated on agreement between at least two pulmonologists. This study evaluated the performance of the three LLMs in diagnosing seven pulmonary diseases (active tuberculosis, emphysema, interstitial lung disease, lung cancer, pleural effusion, pneumonia, and pulmonary edema) utilizing chest radiography and computed tomography scans. All models exhibited high accuracy (0.85-1.00) for most conditions. GPT-4 consistently outperformed its counterparts, demonstrating a sensitivity of 0.71-1.00; specificity of 0.89-1.00; and accuracy of 0.89 and 0.99 across both modalities, thus underscoring its superior capability in interpreting radiological reports. Notably, the accuracy of pleural effusion and emphysema on chest radiographs and pulmonary edema on chest computed tomography scans reached 0.99. The proficiency of LLMs, particularly GPT-4, in accurately classifying unstructured radiological data hints at their potential as alternatives to the traditional manual chart reviews conducted by clinicians.
Collapse
Affiliation(s)
- Hyung Jun Park
- Department of Internal Medicine, Division of Pulmonary and Critical Care Medicine, Shihwa Medical Center, Siheung, Korea
| | - Jin-Young Huh
- Department of Internal Medicine, Division of Pulmonary, Allergy and Critical Care Medicine, Chung-Ang University Gwangmyeong Hospital, Gwangmyeong, Korea
| | - Ganghee Chae
- Department of Internal Medicine, Division of Pulmonary and Critical Care Medicine, Ulsan University Hospital, University of Ulsan College of Medicine, Ulsan, Korea
| | - Myeong Geun Choi
- Department of Internal Medicine, Division of Pulmonary and Critical Care Medicine, Mokdong Hospital, College of Medicine, Ewha Womans University, Seoul, Korea
| |
Collapse
|
42
|
Sacoransky E, Kwan BYM, Soboleski D. ChatGPT and assistive AI in structured radiology reporting: A systematic review. Curr Probl Diagn Radiol 2024; 53:728-737. [PMID: 39004580 DOI: 10.1067/j.cpradiol.2024.07.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2024] [Revised: 06/08/2024] [Accepted: 07/08/2024] [Indexed: 07/16/2024]
Abstract
INTRODUCTION The rise of transformer-based large language models (LLMs), such as ChatGPT, has captured global attention with recent advancements in artificial intelligence (AI). ChatGPT demonstrates growing potential in structured radiology reporting-a field where AI has traditionally focused on image analysis. METHODS A comprehensive search of MEDLINE and Embase was conducted from inception through May 2024, and primary studies discussing ChatGPT's role in structured radiology reporting were selected based on their content. RESULTS Of the 268 articles screened, eight were ultimately included in this review. These articles explored various applications of ChatGPT, such as generating structured reports from unstructured reports, extracting data from free text, generating impressions from radiology findings and creating structured reports from imaging data. All studies demonstrated optimism regarding ChatGPT's potential to aid radiologists, though common critiques included data privacy concerns, reliability, medical errors, and lack of medical-specific training. CONCLUSION ChatGPT and assistive AI have significant potential to transform radiology reporting, enhancing accuracy and standardization while optimizing healthcare resources. Future developments may involve integrating dynamic few-shot prompting, ChatGPT, and Retrieval Augmented Generation (RAG) into diagnostic workflows. Continued research, development, and ethical oversight are crucial to fully realize AI's potential in radiology.
Collapse
Affiliation(s)
- Ethan Sacoransky
- Queen's University School of Medicine, 15 Arch St, Kingston, ON K7L 3L4, Canada.
| | - Benjamin Y M Kwan
- Queen's University School of Medicine, 15 Arch St, Kingston, ON K7L 3L4, Canada; Department of Diagnostic Radiology, Kingston Health Sciences Centre, Kingston, ON, Canada
| | - Donald Soboleski
- Queen's University School of Medicine, 15 Arch St, Kingston, ON K7L 3L4, Canada; Department of Diagnostic Radiology, Kingston Health Sciences Centre, Kingston, ON, Canada
| |
Collapse
|
43
|
Nguyen D, MacKenzie A, Kim YH. Encouragement vs. liability: How prompt engineering influences ChatGPT-4's radiology exam performance. Clin Imaging 2024; 115:110276. [PMID: 39288636 DOI: 10.1016/j.clinimag.2024.110276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Revised: 08/21/2024] [Accepted: 08/30/2024] [Indexed: 09/19/2024]
Abstract
Large Language Models (LLM) like ChatGPT-4 hold significant promise in medical application, especially in the field of radiology. While previous studies have shown the promise of ChatGTP-4 in textual-based scenarios, its performance on image-based response remains suboptimal. This study investigates the impact of prompt engineering on ChatGPT-4's accuracy on the 2022 American College of Radiology In Training Test Questions for Diagnostic Radiology Residents that include textual and visual-based questions. Four personas were created, each with unique prompts, and evaluated using ChatGPT-4. Results indicate that encouraging prompts and those disclaiming responsibility led to higher overall accuracy (number of questions answered correctly) compared to other personas. Personas that threaten the LLM with legal action or mounting clinical responsibility were not only found to score less, but also refrain of answering questions at a higher rate. These findings highlight the importance of prompt context in optimizing LLM responses and the need for further research to integrate AI responsibly into medical practice.
Collapse
Affiliation(s)
- Daniel Nguyen
- University of Massachusetts Chan Medical School, Worcester, MA, United States of America
| | - Allison MacKenzie
- University of Massachusetts Chan Medical School, Worcester, MA, United States of America
| | - Young H Kim
- Department of Radiology, University of Massachusetts Chan Medical School, Worcester, MA, United States of America.
| |
Collapse
|
44
|
Spuur K, Currie G, Al-Mousa D, Pape R. Suitability of ChatGPT as a Source of Patient Information for Screening Mammography. Health Promot Pract 2024:15248399241285060. [PMID: 39392690 DOI: 10.1177/15248399241285060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/13/2024]
Abstract
ChatGPT3.5 and ChatGPT4 were released publicly in late November 2022 and March 2023, respectively, and have emerged as convenient sources of patient health education and information, including for screening mammography. ChatGPT4 offers enhanced capabilities; however, it is only available by paid subscription. The purported benefits of ChatGPT for health education need to be objectively evaluated. To assess performance differences, ChatGPT3.5 and GPT4 were used between 13 April and 29 May 2023 to generate breast screening patient information sheets, which were evaluated using the Patient Education Materials Assessment Tool for printed materials (PEMAT-P) and the CDC Clear Communication Index (CDC Index) Score Sheet; and benchmarked against gold standard content in BreastScreen NSW's patient information sheet. Mean scores were reported for comparison. GPT3.5 provided the appropriate tone and currency of information but lacked accuracy, omitting key insights: PEMAT-P understandability 68.0% (SD = 6.56) and actionability 36.7% (SD=20.4); CDC Index 58.8% (SD = 15.3). GPT4 was deemed superior to GPT3.5 but included several key omissions: PEMAT-P understandability 75.0% (SD = 17) and actionability 53.3% (SD = 11.54); CDC Index 66.0% (SD = 4.1). Both ChatGPT versions exhibited poor understandability and actionability and were unclear in their messaging. Those with poor health literacy will not benefit from accessing current versions of ChatGPT and may be further disadvantaged if they do not have access to a paid subscription. ChatGPT is evidenced to be an unreliable and inaccurate source of information concerning breast screening that may undermine participation and risk increased morbidity and mortality from breast cancer. ChatGPT may increase the demand on health care educators to rectify misinformation.
Collapse
Affiliation(s)
- Kelly Spuur
- Charles Sturt University, Wagga Wagga, New South Wales, Australia
| | - Geoff Currie
- Charles Sturt University, Wagga Wagga, New South Wales, Australia
| | - Dana Al-Mousa
- Charles Sturt University, Wagga Wagga, New South Wales, Australia
| | - Ruth Pape
- Charles Sturt University, Wagga Wagga, New South Wales, Australia
| |
Collapse
|
45
|
Nógrádi B, Polgár TF, Meszlényi V, Kádár Z, Hertelendy P, Csáti A, Szpisjak L, Halmi D, Erdélyi-Furka B, Tóth M, Molnár F, Tóth D, Bősze Z, Boda K, Klivényi P, Siklós L, Patai R. ChatGPT M.D.: Is there any room for generative AI in neurology? PLoS One 2024; 19:e0310028. [PMID: 39383119 PMCID: PMC11463752 DOI: 10.1371/journal.pone.0310028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 08/22/2024] [Indexed: 10/11/2024] Open
Abstract
ChatGPT, a general artificial intelligence, has been recognized as a powerful tool in scientific writing and programming but its use as a medical tool is largely overlooked. The general accessibility, rapid response time and comprehensive training database might enable ChatGPT to serve as a diagnostic augmentation tool in certain clinical settings. The diagnostic process in neurology is often challenging and complex. In certain time-sensitive scenarios, rapid evaluation and diagnostic decisions are needed, while in other cases clinicians are faced with rare disorders and atypical disease manifestations. Due to these factors, the diagnostic accuracy in neurology is often suboptimal. Here we evaluated whether ChatGPT can be utilized as a valuable and innovative diagnostic augmentation tool in various neurological settings. We used synthetic data generated by neurological experts to represent descriptive anamneses of patients with known neurology-related diseases, then the probability for an appropriate diagnosis made by ChatGPT was measured. To give clarity to the accuracy of the AI-determined diagnosis, all cases have been cross-validated by other experts and general medical doctors as well. We found that ChatGPT-determined diagnostic accuracy (ranging from 68.5% ± 3.28% to 83.83% ± 2.73%) can reach the accuracy of other experts (81.66% ± 2.02%), furthermore, it surpasses the probability of an appropriate diagnosis if the examiner is a general medical doctor (57.15% ± 2.64%). Our results showcase the efficacy of general artificial intelligence like ChatGPT as a diagnostic augmentation tool in medicine. In the future, AI-based supporting tools might be useful amendments in medical practice and help to improve the diagnostic process in neurology.
Collapse
Affiliation(s)
- Bernát Nógrádi
- Institute of Biophysics, HUN-REN Biological Research Centre, Szeged, Hungary
- Department of Neurology, Albert Szent-Györgyi Health Centre, University of Szeged, Szeged, Hungary
| | - Tamás Ferenc Polgár
- Institute of Biophysics, HUN-REN Biological Research Centre, Szeged, Hungary
- Theoretical Medicine Doctoral School, University of Szeged, Szeged, Hungary
| | - Valéria Meszlényi
- Institute of Biophysics, HUN-REN Biological Research Centre, Szeged, Hungary
- Department of Neurology, Albert Szent-Györgyi Health Centre, University of Szeged, Szeged, Hungary
| | - Zalán Kádár
- Institute of Biophysics, HUN-REN Biological Research Centre, Szeged, Hungary
| | - Péter Hertelendy
- Department of Neurology, Albert Szent-Györgyi Health Centre, University of Szeged, Szeged, Hungary
| | - Anett Csáti
- Department of Neurology, Albert Szent-Györgyi Health Centre, University of Szeged, Szeged, Hungary
| | - László Szpisjak
- Department of Neurology, Albert Szent-Györgyi Health Centre, University of Szeged, Szeged, Hungary
| | - Dóra Halmi
- Metabolic Diseases and Cell Signaling Research Group, Department of Biochemistry, Albert Szent-Györgyi Medical School, University of Szeged, Szeged, Hungary
- Interdisciplinary Medicine Doctoral School, University of Szeged, Szeged, Hungary
| | - Barbara Erdélyi-Furka
- Metabolic Diseases and Cell Signaling Research Group, Department of Biochemistry, Albert Szent-Györgyi Medical School, University of Szeged, Szeged, Hungary
- Interdisciplinary Medicine Doctoral School, University of Szeged, Szeged, Hungary
| | - Máté Tóth
- Second Department of Internal Medicine and Cardiology Centre, Albert Szent-Györgyi Health Centre, University of Szeged, Szeged, Hungary
| | - Fanny Molnár
- Department of Family Medicine, Albert Szent-Györgyi Health Centre, University of Szeged, Szeged, Hungary
| | - Dávid Tóth
- Department of Oncotherapy, Albert Szent-Györgyi Health Centre, University of Szeged, Szeged, Hungary
| | - Zsófia Bősze
- Department of Internal Medicine, Albert Szent-Györgyi Health Centre, University of Szeged, Szeged, Hungary
| | - Krisztina Boda
- Department of Medical Physics and Informatics, University of Szeged, Szeged, Hungary
| | - Péter Klivényi
- Department of Neurology, Albert Szent-Györgyi Health Centre, University of Szeged, Szeged, Hungary
| | - László Siklós
- Institute of Biophysics, HUN-REN Biological Research Centre, Szeged, Hungary
| | - Roland Patai
- Institute of Biophysics, HUN-REN Biological Research Centre, Szeged, Hungary
| |
Collapse
|
46
|
Griewing S, Lechner F, Gremke N, Lukac S, Janni W, Wallwiener M, Wagner U, Hirsch M, Kuhn S. Proof-of-concept study of a small language model chatbot for breast cancer decision support - a transparent, source-controlled, explainable and data-secure approach. J Cancer Res Clin Oncol 2024; 150:451. [PMID: 39382778 PMCID: PMC11464535 DOI: 10.1007/s00432-024-05964-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Accepted: 09/19/2024] [Indexed: 10/10/2024]
Abstract
PURPOSE Large language models (LLM) show potential for decision support in breast cancer care. Their use in clinical care is currently prohibited by lack of control over sources used for decision-making, explainability of the decision-making process and health data security issues. Recent development of Small Language Models (SLM) is discussed to address these challenges. This preclinical proof-of-concept study tailors an open-source SLM to the German breast cancer guideline (BC-SLM) to evaluate initial clinical accuracy and technical functionality in a preclinical simulation. METHODS A multidisciplinary tumor board (MTB) is used as the gold-standard to assess the initial clinical accuracy in terms of concordance of the BC-SLM with MTB and comparing it to two publicly available LLM, ChatGPT3.5 and 4. The study includes 20 fictional patient profiles and recommendations for 5 treatment modalities, resulting in 100 binary treatment recommendations (recommended or not recommended). Statistical evaluation includes concordance with MTB in % including Cohen's Kappa statistic (κ). Technical functionality is assessed qualitatively in terms of local hosting, adherence to the guideline and information retrieval. RESULTS The overall concordance amounts to 86% for BC-SLM (κ = 0.721, p < 0.001), 90% for ChatGPT4 (κ = 0.820, p < 0.001) and 83% for ChatGPT3.5 (κ = 0.661, p < 0.001). Specific concordance for each treatment modality ranges from 65 to 100% for BC-SLM, 85-100% for ChatGPT4, and 55-95% for ChatGPT3.5. The BC-SLM is locally functional, adheres to the standards of the German breast cancer guideline and provides referenced sections for its decision-making. CONCLUSION The tailored BC-SLM shows initial clinical accuracy and technical functionality, with concordance to the MTB that is comparable to publicly-available LLMs like ChatGPT4 and 3.5. This serves as a proof-of-concept for adapting a SLM to an oncological disease and its guideline to address prevailing issues with LLM by ensuring decision transparency, explainability, source control, and data security, which represents a necessary step towards clinical validation and safe use of language models in clinical oncology.
Collapse
Affiliation(s)
- Sebastian Griewing
- Institute for Digital Medicine, University Hospital Giessen and Marburg, Philipps-University Marburg, Marburg, Germany.
- Stanford Center for Biomedical Informatics Research, Stanford University School of Medicine, Palo Alto, CA, USA.
- Marburg Gynecological Cancer Center, Giessen and Marburg University Hospital, Philipps-University Marburg, Marburg, Germany.
- Commission Digital Medicine, German Society for Gynecology and Obstetrics (DGGG), Berlin, Germany.
| | - Fabian Lechner
- Institute for Digital Medicine, University Hospital Giessen and Marburg, Philipps-University Marburg, Marburg, Germany
- Institute for Artificial Intelligence in Medicine, University Hospital Giessen and Marburg, Philipps-University Marburg, Marburg, Germany
| | - Niklas Gremke
- Marburg Gynecological Cancer Center, Giessen and Marburg University Hospital, Philipps-University Marburg, Marburg, Germany
| | - Stefan Lukac
- Department of Obstetrics and Gynecology, University Hospital Ulm, University of Ulm, Ulm, Germany
- Commission Digital Medicine, German Society for Gynecology and Obstetrics (DGGG), Berlin, Germany
| | - Wolfgang Janni
- Department of Obstetrics and Gynecology, University Hospital Ulm, University of Ulm, Ulm, Germany
| | - Markus Wallwiener
- Halle Gynecological Cancer Center, Halle University Hospital, Martin-Luther-University Halle-Wittenberg, Halle (Saale), Germany
- Commission Digital Medicine, German Society for Gynecology and Obstetrics (DGGG), Berlin, Germany
| | - Uwe Wagner
- Marburg Gynecological Cancer Center, Giessen and Marburg University Hospital, Philipps-University Marburg, Marburg, Germany
- Commission Digital Medicine, German Society for Gynecology and Obstetrics (DGGG), Berlin, Germany
| | - Martin Hirsch
- Institute for Artificial Intelligence in Medicine, University Hospital Giessen and Marburg, Philipps-University Marburg, Marburg, Germany
| | - Sebastian Kuhn
- Institute for Digital Medicine, University Hospital Giessen and Marburg, Philipps-University Marburg, Marburg, Germany
| |
Collapse
|
47
|
Mankowski MA, Jaffe IS, Xu J, Bae S, Oermann EK, Aphinyanaphongs Y, McAdams-DeMarco MA, Lonze BE, Orandi BJ, Stewart D, Levan M, Massie A, Gentry S, Segev DL. ChatGPT Solving Complex Kidney Transplant Cases: A Comparative Study With Human Respondents. Clin Transplant 2024; 38:e15466. [PMID: 39329220 PMCID: PMC11441623 DOI: 10.1111/ctr.15466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2024] [Revised: 08/26/2024] [Accepted: 09/06/2024] [Indexed: 09/28/2024]
Abstract
INTRODUCTION ChatGPT has shown the ability to answer clinical questions in general medicine but may be constrained by the specialized nature of kidney transplantation. Thus, it is important to explore how ChatGPT can be used in kidney transplantation and how its knowledge compares to human respondents. METHODS We prompted ChatGPT versions 3.5, 4, and 4 Visual (4 V) with 12 multiple-choice questions related to six kidney transplant cases from 2013 to 2015 American Society of Nephrology (ASN) fellowship program quizzes. We compared the performance of ChatGPT with US nephrology fellowship program directors, nephrology fellows, and the audience of the ASN's annual Kidney Week meeting. RESULTS Overall, ChatGPT 4 V correctly answered 10 out of 12 questions, showing a performance level comparable to nephrology fellows (group majority correctly answered 9 of 12 questions) and training program directors (11 of 12). This surpassed ChatGPT 4 (7 of 12 correct) and 3.5 (5 of 12). All three ChatGPT versions failed to correctly answer questions where the consensus among human respondents was low. CONCLUSION Each iterative version of ChatGPT performed better than the prior version, with version 4 V achieving performance on par with nephrology fellows and training program directors. While it shows promise in understanding and answering kidney transplantation questions, ChatGPT should be seen as a complementary tool to human expertise rather than a replacement.
Collapse
Affiliation(s)
- Michal A Mankowski
- Department of Surgery, NYU Grossman School of Medicine, New York, New York, USA
| | - Ian S Jaffe
- Department of Surgery, NYU Grossman School of Medicine, New York, New York, USA
| | - Jingzhi Xu
- Department of Surgery, NYU Grossman School of Medicine, New York, New York, USA
| | - Sunjae Bae
- Department of Surgery, NYU Grossman School of Medicine, New York, New York, USA
- Department of Population Health, NYU Grossman School of Medicine, New York, New York, USA
| | - Eric K Oermann
- Department of Neurosurgery, NYU Grossman School of Medicine, New York, New York, USA
| | - Yindalon Aphinyanaphongs
- Department of Population Health, NYU Grossman School of Medicine, New York, New York, USA
- Department of Medicine, NYU Grossman School of Medicine, New York, New York, USA
| | - Mara A McAdams-DeMarco
- Department of Surgery, NYU Grossman School of Medicine, New York, New York, USA
- Department of Population Health, NYU Grossman School of Medicine, New York, New York, USA
| | - Bonnie E Lonze
- Department of Surgery, NYU Grossman School of Medicine, New York, New York, USA
| | - Babak J Orandi
- Department of Surgery, NYU Grossman School of Medicine, New York, New York, USA
- Department of Medicine, NYU Grossman School of Medicine, New York, New York, USA
| | - Darren Stewart
- Department of Surgery, NYU Grossman School of Medicine, New York, New York, USA
| | - Macey Levan
- Department of Surgery, NYU Grossman School of Medicine, New York, New York, USA
- Department of Population Health, NYU Grossman School of Medicine, New York, New York, USA
| | - Allan Massie
- Department of Surgery, NYU Grossman School of Medicine, New York, New York, USA
- Department of Population Health, NYU Grossman School of Medicine, New York, New York, USA
| | - Sommer Gentry
- Department of Surgery, NYU Grossman School of Medicine, New York, New York, USA
- Department of Population Health, NYU Grossman School of Medicine, New York, New York, USA
| | - Dorry L Segev
- Department of Surgery, NYU Grossman School of Medicine, New York, New York, USA
- Department of Population Health, NYU Grossman School of Medicine, New York, New York, USA
| |
Collapse
|
48
|
Reading Turchioe M, Kisselev S, Van Bulck L, Bakken S. Increasing Generative Artificial Intelligence Competency among Students Enrolled in Doctoral Nursing Research Coursework. Appl Clin Inform 2024; 15:842-851. [PMID: 39053615 PMCID: PMC11483171 DOI: 10.1055/a-2373-3151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Accepted: 07/24/2024] [Indexed: 07/27/2024] Open
Abstract
BACKGROUND Generative artificial intelligence (AI) tools may soon be integrated into health care practice and research. Nurses in leadership roles, many of whom are doctorally prepared, will need to determine whether and how to integrate them in a safe and useful way. OBJECTIVE This study aimed to develop and evaluate a brief intervention to increase PhD nursing students' knowledge of appropriate applications for using generative AI tools in health care. METHODS We created didactic lectures and laboratory-based activities to introduce generative AI to students enrolled in a nursing PhD data science and visualization course. Students were provided with a subscription to Chat Generative Pretrained Transformer (ChatGPT) 4.0, a general-purpose generative AI tool, for use in and outside the class. During the didactic portion, we described generative AI and its current and potential future applications in health care, including examples of appropriate and inappropriate applications. In the laboratory sessions, students were given three tasks representing different use cases of generative AI in health care practice and research (clinical decision support, patient decision support, and scientific communication) and asked to engage with ChatGPT on each. Students (n = 10) independently wrote a brief reflection for each task evaluating safety (accuracy, hallucinations) and usability (ease of use, usefulness, and intention to use in the future). Reflections were analyzed using directed content analysis. RESULTS Students were able to identify the strengths and limitations of ChatGPT in completing all three tasks and developed opinions on whether they would feel comfortable using ChatGPT for similar tasks in the future. All of them reported increasing their self-rated competency in generative AI by one to two points on a five-point rating scale. CONCLUSION This brief educational intervention supported doctoral nursing students in understanding the appropriate uses of ChatGPT, which may support their ability to appraise and use these tools in their future work.
Collapse
Affiliation(s)
| | - Sergey Kisselev
- Columbia University School of Nursing, New York, New York, United States
| | - Liesbet Van Bulck
- Department of Public Health and Primary Care, KU Leuven - University of Leuven, Leuven, Belgium
| | - Suzanne Bakken
- Columbia University School of Nursing, New York, New York, United States
- Department of Biomedical Informatics, Columbia University, New York, New York, United States
- Data Science Institute, Columbia University, New York, New York, United States
| |
Collapse
|
49
|
Davis NM, El-Said E, Fortune P, Shen A, Succi MD. Transforming Health Care Landscapes: The Lever of Radiology Research and Innovation on Emerging Markets Poised for Aggressive Growth. J Am Coll Radiol 2024; 21:1552-1556. [PMID: 39096946 DOI: 10.1016/j.jacr.2024.07.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Accepted: 07/30/2024] [Indexed: 08/05/2024]
Abstract
Advances in radiology are crucial not only to the future of the field but to medicine as a whole. Here, we present three emerging areas of medicine that are poised to change how health care is delivered-hospital at home, artificial intelligence, and precision medicine-and illustrate how advances in radiological tools and technologies are helping to fuel the growth of these markets in the United States and across the globe.
Collapse
Affiliation(s)
- Nicole M Davis
- Innovation Office, Mass General Brigham, Somerville, Massachusetts
| | - Ezat El-Said
- Medically Engineered Solutions in Healthcare Incubator, Innovations in Operations Research Center, Massachusetts General Hospital, Boston, Massachusetts; Harvard Medical School, Boston, Massachusetts; Department of Radiology, Massachusetts General Hospital, Boston, Massachusetts
| | - Patrick Fortune
- Vice President, Strategic Innovation Leaders at Mass General Brigham, Innovation Office, Mass General Brigham, Somerville, Massachusetts
| | - Angela Shen
- Innovation Office, Mass General Brigham, Somerville, Massachusetts; Vice President, Strategic Innovation Leaders at Mass General Brigham
| | - Marc D Succi
- Innovation Office, Mass General Brigham, Somerville, Massachusetts; Harvard Medical School, Boston, Massachusetts; Department of Radiology, Massachusetts General Hospital, Boston, Massachusetts; Medically Engineered Solutions in Healthcare Incubator, Innovations in Operations Research Center, Massachusetts General Hospital, Boston, Massachusetts. MDS is the Associate Chair of Innovation and Commercialization at Mass General Brigham Enterprise Radiology; Strategic Innovation Leader at Mass General Brigham Innovation; Founder and Executive Director of the MESH Incubator at Mass General Brigham.
| |
Collapse
|
50
|
Demirci A. A Comparison of ChatGPT and Human Questionnaire Evaluations of the Urological Cancer Videos Most Watched on YouTube. Clin Genitourin Cancer 2024; 22:102145. [PMID: 39033711 DOI: 10.1016/j.clgc.2024.102145] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Revised: 06/21/2024] [Accepted: 06/22/2024] [Indexed: 07/23/2024]
Abstract
AIM To examine the reliability of ChatGPT in evaluating the quality of medical content of the most watched videos related to urological cancers on YouTube. MATERIAL AND METHODS In March 2024 a playlist was created of the first 20 videos watched on YouTube for each type of urological cancer. The video texts were evaluated by ChatGPT and by a urology specialist using the DISCERN-5 and Global Quality Scale (GQS) questionnaires. The results obtained were compared using the Kruskal-Wallis test. RESULTS For the prostate, bladder, renal, and testicular cancer videos, the median (IQR) DISCERN-5 scores given by the human evaluator and ChatGPT were (Human: 4 [1], 3 [0], 3 [2], 3 [1], P = .11; ChatGPT: 3 [1.75], 3 [1], 3 [2], 3 [0], P = .4, respectively) and the GQS scores were (Human: 4 [1.75], 3 [0.75], 3.5 [2], 3.5 [1], P = .12; ChatGPT: 4 [1], 3 [0.75], 3 [1], 3.5 [1], P = .1, respectively), with no significant difference determined between the scores. The repeatability of the ChatGPT responses was determined to be similar at 25 % for prostate cancer, 30 % for bladder cancer, 30 % for renal cancer, and 35 % for testicular cancer (P = .92). No statistically significant difference was determined between the median (IQR) DISCERN-5 and GQS scores given by humans and ChatGPT for the content of videos about prostate, bladder, renal, and testicular cancer (P > .05). CONCLUSION Although ChatGPT is successful in evaluating the medical quality of video texts, the results should be evaluated with caution as the repeatability of the results is low.
Collapse
Affiliation(s)
- Aykut Demirci
- Department of Urology, Dr. Abdurrahman Yurtaslan Ankara Oncology Training and Research Hospital, University of Health Sciences, Ankara, Turkey.
| |
Collapse
|