1
|
Temel MH, Erden Y, Bağcıer F. Evaluating artificial intelligence performance in medical image analysis: Sensitivity, specificity, accuracy, and precision of ChatGPT-4o on Kellgren-Lawrence grading of knee X-ray radiographs. Knee 2025; 55:79-84. [PMID: 40273525 DOI: 10.1016/j.knee.2025.04.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/19/2024] [Revised: 01/06/2025] [Accepted: 04/06/2025] [Indexed: 04/26/2025]
Abstract
BACKGROUND Recent advancements in artificial intelligence, including ChatGPT, have enabled its application in medical image analysis.This study aimed to evaluate the sensitivity and specificity of ChatGPT in assessing knee osteoarthritis (KOA) radiographs using the Kellgren-Lawrence (KL) grading system. METHODS A retrospective study was conducted at Izzet Baysal Physical Therapy and Rehabilitation Training and Research Hospital. Anteroposterior weight-bearing knee X-rays from 226 patients (excluding 26 due to prostheses or foreign bodies) were evaluated. Two specialists assessed the radiographs using the KL grading system, with a third specialist resolving discrepancies. ChatGPT-4o evaluated the images using the prompt, "Please evaluate this knee anteroposterior radiographic image according to the Kellgren-Lawrence grading system." Diagnostic accuracy metrics, receiver operating characteristic (ROC) curves, and area under the curve (AUC) values were calculated. RESULTS ChatGPT showed low sensitivity across all grades. The accuracy of the model was calculated to be 0.230. ROC AUC values were low for all grades, for KL grade 0 at 0.53, KL grade 1 at 0.56, KL grade 2 at 0.43, KL grade 3 at 0.54, KL grade 4 at 0.49, micro-average at 0.52, macro-average at 0.51, and weighted average at 0.52. CONCLUSIONS The findings of this study highlight the model's inability to reliably distinguish between KL grades, suggesting that its utility in this specific classification task is limited and requires further optimization to improve its predictive accuracy and reliability. The model's current limitations preclude its use as a reliable diagnostic tool. Further refinement is necessary to improve its clinical applicability.
Collapse
Affiliation(s)
- Mustafa Hüseyin Temel
- Physical Medicine and Rehabilitation Clinic, Univeristy of Health Sciences Sultan 2.Abdulhamid Han Training and Research Hospital, İstanbul, Turkey
| | - Yakup Erden
- Physical Medicine and Rehabilitation Clinic, İzzet Baysal Physical Medicine and Rehabilitation Training and Research Hospital, Bolu, Turkey.
| | - Fatih Bağcıer
- Physical Medicine and Rehabilitation Clinic, Başakşehir Çam and Sakura City Hospital, İstanbul, Turkey
| |
Collapse
|
2
|
Gupta A, Hussain M, Nikhileshwar K, Rastogi A, Rangarajan K. Integrating Large language models into radiology workflow: Impact of generating personalized report templates from summary. Eur J Radiol 2025; 189:112198. [PMID: 40435550 DOI: 10.1016/j.ejrad.2025.112198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2024] [Revised: 04/01/2025] [Accepted: 05/23/2025] [Indexed: 06/11/2025]
Abstract
OBJECTIVE To evaluate feasibility of large language models (LLMs) to convert radiologist-generated report summaries into personalized report templates, and assess its impact on scan reporting time and quality. MATERIALS AND METHODS In this retrospective study, 100 CT scans from oncology patients were randomly divided into two equal sets. Two radiologists generated conventional reports for one set and summary reports for the other, and vice versa. Three LLMs - GPT-4, Google Gemini, and Claude Opus - generated complete reports from the summaries using institution-specific generic templates. Two expert radiologists qualitatively evaluated the radiologist summaries and LLM-generated reports using the ACR RADPEER scoring system, using conventional radiologist reports as reference. Reporting time for conventional versus summary-based reports was compared, and LLM-generated reports were analyzed for errors. Quantitative similarity and linguistic metrics were computed to assess report alignment across models with the original radiologist-generated report summaries. Statistical analyses were performed using Python 3.0 to identify significant differences in reporting times, error rates and quantitative metrics. RESULTS The average reporting time was significantly shorter for summary method (6.76 min) compared to conventional method (8.95 min) (p < 0.005). Among the 100 radiologist summaries, 10 received RADPEER scores worse than 1, with three deemed to have clinically significant discrepancies. Only one LLM-generated report received a worse RADPEER score than its corresponding summary. Error frequencies among LLM-generated reports showed no significant differences across models, with template-related errors being most common (χ2 = 1.146, p = 0.564). Quantitative analysis indicated significant differences in similarity and linguistic metrics among the three LLMs (p < 0.05), reflecting unique generation patterns. CONCLUSION Summary-based scan reporting along with use of LLMs to generate complete personalized report templates can shorten reporting time while maintaining the report quality. However, there remains a need for human oversight to address errors in the generated reports. RELEVANCE STATEMENT Summary-based reporting of radiological studies along with the use of large language models to generate tailored reports using generic templates has the potential to make the workflow more efficient by shortening the reporting time while maintaining the quality of reporting.
Collapse
Affiliation(s)
- Amit Gupta
- Department of Radiology, Dr B.R.A.IRCH, All India Institute of Medical Sciences, New Delhi, India.
| | - Manzoor Hussain
- Department of Radiodiagnosis, All India Institute of Medical Sciences, New Delhi, India.
| | | | - Ashish Rastogi
- Department of Radiology, Dr B.R.A.IRCH, All India Institute of Medical Sciences, New Delhi, India.
| | - Krithika Rangarajan
- Department of Radiology, Dr B.R.A.IRCH, All India Institute of Medical Sciences, New Delhi, India.
| |
Collapse
|
3
|
Hoch CC, Funk PF, Guntinas-Lichius O, Volk GF, Lüers JC, Hussain T, Wirth M, Schmidl B, Wollenberg B, Alfertshofer M. Harnessing advanced large language models in otolaryngology board examinations: an investigation using python and application programming interfaces. Eur Arch Otorhinolaryngol 2025; 282:3317-3328. [PMID: 40281318 PMCID: PMC12122622 DOI: 10.1007/s00405-025-09404-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2024] [Accepted: 04/07/2025] [Indexed: 04/29/2025]
Abstract
PURPOSE This study aimed to explore the capabilities of advanced large language models (LLMs), including OpenAI's GPT-4 variants, Google's Gemini series, and Anthropic's Claude series, in addressing highly specialized otolaryngology board examination questions. Additionally, the study included a longitudinal assessment of GPT-3.5 Turbo, which was evaluated using the same set of questions one year ago to identify changes in its performance over time. METHODS We utilized a question bank comprising 2,576 multiple-choice and single-choice questions from a German online education platform tailored for otolaryngology board certification preparation. The questions were submitted to 11 different LLMs, including GPT-3.5 Turbo, GPT-4 variants, Gemini models, and Claude models, through Application Programming Interfaces (APIs) using Python scripts, facilitating efficient data collection and processing. RESULTS GPT-4o demonstrated the highest accuracy among all models, particularly excelling in categories such as allergology and head and neck tumors. While the Claude models showed competitive performance, they generally lagged behind the GPT-4 variants. A comparison of GPT-3.5 Turbo's performance revealed a significant decline in accuracy over the past year. Newer LLMs displayed varied performance levels, with single-choice questions consistently yielding higher accuracy than multiple-choice questions across all models. CONCLUSION While newer LLMs show strong potential in addressing specialized medical content, the observed decline in GPT-3.5 Turbo's performance over time underscores the necessity for continuous evaluation. This study highlights the critical need for ongoing optimization and efficient API usage to improve LLMs potential for applications in medical education and certification.
Collapse
Affiliation(s)
- Cosima C Hoch
- Department of Otolaryngology, Head and Neck Surgery, TUM School of Medicine and Health, Technical University of Munich (TUM), Ismaningerstrasse 22, 81675, Munich, Germany.
| | - Paul F Funk
- Department of Otorhinolaryngology, Jena University Hospital, Friedrich-Schiller-University Jena, 07747, Jena, Germany
| | - Orlando Guntinas-Lichius
- Department of Otorhinolaryngology, Jena University Hospital, Friedrich-Schiller-University Jena, 07747, Jena, Germany
| | - Gerd Fabian Volk
- Department of Otorhinolaryngology, Jena University Hospital, Friedrich-Schiller-University Jena, 07747, Jena, Germany
| | - Jan-Christoffer Lüers
- Department of Otorhinolaryngology, Head and Neck Surgery, Medical Faculty, University of Cologne, 50937, Cologne, Germany
| | - Timon Hussain
- Department of Otolaryngology, Head and Neck Surgery, TUM School of Medicine and Health, Technical University of Munich (TUM), Ismaningerstrasse 22, 81675, Munich, Germany
| | - Markus Wirth
- Department of Otolaryngology, Head and Neck Surgery, TUM School of Medicine and Health, Technical University of Munich (TUM), Ismaningerstrasse 22, 81675, Munich, Germany
| | - Benedikt Schmidl
- Department of Otolaryngology, Head and Neck Surgery, TUM School of Medicine and Health, Technical University of Munich (TUM), Ismaningerstrasse 22, 81675, Munich, Germany
| | - Barbara Wollenberg
- Department of Otolaryngology, Head and Neck Surgery, TUM School of Medicine and Health, Technical University of Munich (TUM), Ismaningerstrasse 22, 81675, Munich, Germany
| | - Michael Alfertshofer
- Department of Oral and Maxillofacial Surgery, Institute of Health, Charité- Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität zu Berlin, Berlin, 10117, Berlin, Germany
| |
Collapse
|
4
|
Cui H, Shen Z, Zhang J, Shao H, Qin L, Ho JC, Yang C. LLMs-based Few-Shot Disease Predictions using EHR: A Novel Approach Combining Predictive Agent Reasoning and Critical Agent Instruction. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2025; 2024:319-328. [PMID: 40417470 PMCID: PMC12099430] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/27/2025]
Abstract
Electronic health records (EHRs) contain valuable patient data for health-related prediction tasks, such as disease prediction. Traditional approaches rely on supervised learning methods that require large labeled datasets, which can be expensive and challenging to obtain. In this study, we investigate the feasibility of applying Large Language Models (LLMs) to convert structured patient visit data (e.g., diagnoses, labs, prescriptions) into natural language narratives. We evaluate the zero-shot and few-shot performance of LLMs using various EHR-prediction-oriented prompting strategies. Furthermore, we propose a novel approach that utilizes LLM agents with different roles: a predictor agent that makes predictions and generates reasoning processes and a critic agent that analyzes incorrect predictions and provides guidance for improving the reasoning of the predictor agent. Our results demonstrate that with the proposed approach, LLMs can achieve decent few-shot performance compared to traditional supervised learning methods in EHR-based disease predictions, suggesting its potential for health-oriented applications.
Collapse
Affiliation(s)
- Hejie Cui
- Department of Computer Science, Emory University, Atlanta, GA, USA
- Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA
| | - Zhuocheng Shen
- Department of Computer Science, Emory University, Atlanta, GA, USA
| | - Jieyu Zhang
- School of Computer Science & Engineering, University of Washington, Seattle, WA, USA
| | - Hui Shao
- Rollins School of Public Health, Emory University, Atlanta, GA, USA
- School of Medicine, Emory University, Atlanta, GA, USA
| | - Lianhui Qin
- Department of Computer Science & Engineering, UCSD, San Diego, CA, USA
| | - Joyce C Ho
- Department of Computer Science, Emory University, Atlanta, GA, USA
| | - Carl Yang
- Department of Computer Science, Emory University, Atlanta, GA, USA
- Rollins School of Public Health, Emory University, Atlanta, GA, USA
| |
Collapse
|
5
|
Tripathi S, Patel J, Mutter L, Dorfner FJ, Bridge CP, Daye D. Large language models as an academic resource for radiologists stepping into artificial intelligence research. Curr Probl Diagn Radiol 2025; 54:342-348. [PMID: 39672727 DOI: 10.1067/j.cpradiol.2024.12.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2024] [Accepted: 12/09/2024] [Indexed: 12/15/2024]
Abstract
BACKGROUND Radiologists increasingly use artificial intelligence (AI) to enhance diagnostic accuracy and optimize workflows. However, many lack the technical skills to effectively apply machine learning (ML) and deep learning (DL) algorithms, limiting the accessibility of these methods to radiology researchers who could otherwise benefit from them. Large language models (LLMs), such as GPT-4o, may serve as virtual advisors, offering tailored algorithm recommendations for specific research needs. This study evaluates GPT-4o's effectiveness as a recommender system to enhance radiologists' understanding and implementation of AI in research. INTERVENTION GPT-4o was used to recommend ML and DL algorithms based on specific details provided by researchers, including dataset characteristics, modality types, data sizes, and research objectives. The model acted as a virtual advisor, guiding researchers in selecting the most appropriate models for their studies. METHODS The study systematically evaluated GPT-4o's recommendations for clarity, task alignment, model diversity, and baseline selection. Responses were graded to assess the model's ability to meet the needs of radiology researchers. RESULTS GPT-4o effectively recommended appropriate ML and DL algorithms for various radiology tasks, including segmentation, classification, and regression in medical imaging. The model suggested a diverse range of established and innovative algorithms, such as U-Net, Random Forest, Attention U-Net, and EfficientNet, aligning well with accepted practices. CONCLUSION GPT-4o shows promise as a valuable tool for radiologists and early career researchers by providing clear and relevant AI and ML algorithm recommendations. Its ability to bridge the knowledge gap in AI implementation could democratize access to advanced technologies, fostering innovation and improving radiology research quality. Further studies should explore integrating LLMs into routine workflows and their role in ongoing professional development.
Collapse
Affiliation(s)
- Satvik Tripathi
- Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA; Athinoula A. Martinos Center for Biomedical Imaging, Charlestown, MA, USA
| | - Jay Patel
- Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA; Athinoula A. Martinos Center for Biomedical Imaging, Charlestown, MA, USA
| | - Liam Mutter
- Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA; Athinoula A. Martinos Center for Biomedical Imaging, Charlestown, MA, USA
| | - Felix J Dorfner
- Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA; Athinoula A. Martinos Center for Biomedical Imaging, Charlestown, MA, USA; Department of Radiology, Charité - Universitätsmedizin Berlin corporate member of Freie Universität Berlin and Humboldt Universität zu Berlin, Hindenburgdamm 30, 12203, Berlin, Germany
| | - Christopher P Bridge
- Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA; Athinoula A. Martinos Center for Biomedical Imaging, Charlestown, MA, USA
| | - Dania Daye
- Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA; Athinoula A. Martinos Center for Biomedical Imaging, Charlestown, MA, USA.
| |
Collapse
|
6
|
Singh R, Hamouda M, Chamberlin JH, Tóth A, Munford J, Silbergleit M, Baruah D, Burt JR, Kabakus IM. ChatGPT vs. Gemini: Comparative accuracy and efficiency in Lung-RADS score assignment from radiology reports. Clin Imaging 2025; 121:110455. [PMID: 40090067 DOI: 10.1016/j.clinimag.2025.110455] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2024] [Revised: 11/16/2024] [Accepted: 03/10/2025] [Indexed: 03/18/2025]
Abstract
OBJECTIVE To evaluate the accuracy of large language models (LLMs) in generating Lung-RADS scores based on lung cancer screening low-dose computed tomography radiology reports. MATERIAL AND METHODS A retrospective cross-sectional analysis was performed on 242 consecutive LDCT radiology reports generated by cardiothoracic fellowship-trained radiologists at a tertiary center. LLMs evaluated included ChatGPT-3.5, ChatGPT-4o, Google Gemini, and Google Gemini Advanced. Each LLM was used to assign Lung-RADS scores based on the findings section of each report. No domain-specific fine-tuning was applied. Accuracy was determined by comparing the LLM-assigned scores to radiologist-assigned scores. Efficiency was assessed by measuring response times for each LLM. RESULTS ChatGPT-4o achieved the highest accuracy (83.6 %) in assigning Lung-RADS scores compared to other models, with ChatGPT-3.5 reaching 70.1 %. Gemini and Gemini Advanced had similar accuracy (70.9 % and 65.1 %, respectively). ChatGPT-3.5 had the fastest response time (median 4 s), while ChatGPT-4o was slower (median 10 s). Higher Lung-RADS categories were associated with marginally longer completion times. ChatGPT-4o demonstrated the greatest agreement with radiologists (κ = 0.836), although it was less than the previously reported human interobserver agreement. CONCLUSION ChatGPT-4o outperformed ChatGPT-3.5, Gemini, and Gemini Advanced in Lung-RADS score assignment accuracy but did not reach the level of human experts. Despite promising results, further work is needed to integrate domain-specific training and ensure LLM reliability for clinical decision-making in lung cancer screening.
Collapse
Affiliation(s)
- Ria Singh
- Osteopathic Medical School, Kansas City University, Kansas, MO, USA
| | - Mohamed Hamouda
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, Medical University of South Carolina, Charleston, SC, USA
| | - Jordan H Chamberlin
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, Medical University of South Carolina, Charleston, SC, USA
| | - Adrienn Tóth
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, Medical University of South Carolina, Charleston, SC, USA
| | - James Munford
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, Medical University of South Carolina, Charleston, SC, USA
| | - Matthew Silbergleit
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, Medical University of South Carolina, Charleston, SC, USA
| | - Dhiraj Baruah
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, Medical University of South Carolina, Charleston, SC, USA
| | - Jeremy R Burt
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, University of Utah School of Medicine, Salt Lake City, UT, USA
| | - Ismail M Kabakus
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, Medical University of South Carolina, Charleston, SC, USA.
| |
Collapse
|
7
|
Sun C, Teichman K, Zhou Y, Critelli B, Nauheim D, Keir G, Wang X, Zhong J, Flanders AE, Shih G, Peng Y. Generative Large Language Models Trained for Detecting Errors in Radiology Reports. Radiology 2025; 315:e242575. [PMID: 40392090 DOI: 10.1148/radiol.242575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/22/2025]
Abstract
Background Large language models (LLMs) offer promising solutions, yet their application in medical proofreading, particularly in detecting errors within radiology reports, remains underexplored. Purpose To develop and evaluate generative LLMs for detecting errors in radiology reports during medical proofreading. Materials and Methods In this retrospective study, a dataset was constructed with two parts. The first part included 1656 synthetic chest radiology reports generated by GPT-4 (OpenAI) using specified prompts, with 828 being error-free synthetic reports and 828 containing errors. The second part included 614 reports: 307 error-free reports between 2011 and 2016 from the MIMIC chest radiograph (MIMIC-CXR) database and 307 corresponding synthetic reports with errors generated by GPT-4 on the basis of these MIMIC-CXR reports and specified prompts. All errors were categorized into four types: negation, left/right, interval change, and transcription errors. Then, several models, including Llama-3 (Meta AI), GPT-4, and BiomedBERT, were refined using zero-shot prompting, few-shot prompting, or fine-tuning strategies. Finally, the performance of these models was evaluated using F1 scores, 95% CIs, and paired-sample t tests on the constructed dataset, with the prediction results further assessed by radiologists. Results Using zero-shot prompting, the fine-tuned Llama-3-70B-Instruct model achieved the best performance, with the following F1 scores: 0.769 (95% CI: 0.757, 0.771) for negation errors, 0.772 (95% CI: 0.762, 0.780) for left/right errors, 0.750 (95% CI: 0.736, 0.763) for interval change errors, 0.828 (95% CI: 0.822, 0.832) for transcription errors, and 0.780 overall. In the real-world evaluation phase, two radiologists reviewed 200 randomly selected reports output by the model (50 for each error type). Of these, 99 were confirmed by both radiologists to contain errors detected by the models, and 163 were confirmed by at least one radiologist to contain model-detected errors. Conclusion Generative LLMs, fine-tuned on synthetic and MIMIC-CXR radiology reports, greatly enhanced error detection in radiology reports. © RSNA, 2025 Supplemental material is available for this article. See also the editorial by Marrocchio and Sverzellati in this issue.
Collapse
Affiliation(s)
- Cong Sun
- Department of Population Health Sciences, Weill Cornell Medicine, 575 Lexington Ave, New York, NY 10022
| | - Kurt Teichman
- Department of Radiology, Weill Cornell Medicine, New York, NY
| | - Yiliang Zhou
- Department of Population Health Sciences, Weill Cornell Medicine, 575 Lexington Ave, New York, NY 10022
| | - Brian Critelli
- Department of Radiology, Weill Cornell Medicine, New York, NY
| | - David Nauheim
- Department of Radiology, Weill Cornell Medicine, New York, NY
| | - Graham Keir
- Department of Radiology, Weill Cornell Medicine, New York, NY
| | - Xindi Wang
- Department of Population Health Sciences, Weill Cornell Medicine, 575 Lexington Ave, New York, NY 10022
| | - Judy Zhong
- Department of Population Health Sciences, Weill Cornell Medicine, 575 Lexington Ave, New York, NY 10022
| | - Adam E Flanders
- Department of Radiology, Thomas Jefferson University, Philadelphia, Pa
| | - George Shih
- Department of Radiology, Weill Cornell Medicine, New York, NY
| | - Yifan Peng
- Department of Population Health Sciences, Weill Cornell Medicine, 575 Lexington Ave, New York, NY 10022
| |
Collapse
|
8
|
Gunes YC, Cesur T, Camur E, Cifci BE, Kaya T, Colakoglu MN, Koc U, Okten RS. Textual Proficiency and Visual Deficiency: A Comparative Study of Large Language Models and Radiologists in MRI Artifact Detection and Correction. Acad Radiol 2025; 32:2411-2421. [PMID: 39939230 DOI: 10.1016/j.acra.2025.01.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2024] [Revised: 12/29/2024] [Accepted: 01/06/2025] [Indexed: 02/14/2025]
Abstract
RATIONALE AND OBJECTIVES To assess the performance of Large Language Models (LLMs) in detecting and correcting MRI artifacts compared to radiologists using text-based and visual questions. MATERIALS AND METHODS This cross-sectional observational study included three phases. Phase 1 involved six LLMs (ChatGPT o1-preview, ChatGPT-4o, ChatGPT-4V, Google Gemini 1.5 Pro, Claude 3.5 Sonnet, Claude 3 Opus) and five radiologists (two residents, two junior radiologists, one senior radiologist) answering 42 text-based questions on MRI artifacts. In Phase 2, the same radiologists and five multimodal LLMs evaluated 100 MRI images, each containing a single artifact. Phase 3 reassessed the identical tasks 1.5 months later to evaluate temporal consistency. Responses were graded using 4-point Likert scales for "Management Score" (text-based) and "Correction Score" (visual). McNemar's test compared response accuracy, and the Wilcoxon test assessed score differences. RESULTS LLMs outperformed radiologists in text-based tasks, with ChatGPT o1-preview scoring the highest (3.71±0.60 in Round 1; 3.76±0.84 in Round 2) (p<0.05). In visual tasks, radiologists performed significantly better, with the Senior Radiologist achieving 92% and 94% accuracy in Rounds 1 and 2, respectively (p<0.05). The top-performing LLM (ChatGPT-4o) achieved only 20% and 18% accuracy. Correction Scores mirrored this difference, with radiologists consistently scoring higher than LLMs (p<0.05). CONCLUSION LLMs excel in text-based tasks but have notable limitations in visual artifact interpretation, making them unsuitable for independent diagnostics. They are promising as educational tools or adjuncts in "human-in-the-loop" systems, with multimodal AI improvements necessary to bridge these gaps.
Collapse
Affiliation(s)
- Yasin Celal Gunes
- Department of Radiology, Kirikkale Yuksek Ihtisas Hospital, Kirikkale, Turkey (Y.C.G.).
| | - Turay Cesur
- Department of Radiology, Mamak State Hospital, Ankara, Turkey (T.C.)
| | - Eren Camur
- Department of Radiology, Ankara 29 Mayıs State Hospital, Ankara, Turkey (E.C.)
| | - Bilal Egemen Cifci
- Department of Radiology, Ankara Bilkent City Hospital, Ankara, Turkey (B.E.C., T.K., M.N.C., U.K., R.S.O.)
| | - Turan Kaya
- Department of Radiology, Ankara Bilkent City Hospital, Ankara, Turkey (B.E.C., T.K., M.N.C., U.K., R.S.O.)
| | - Mehmet Numan Colakoglu
- Department of Radiology, Ankara Bilkent City Hospital, Ankara, Turkey (B.E.C., T.K., M.N.C., U.K., R.S.O.)
| | - Ural Koc
- Department of Radiology, Ankara Bilkent City Hospital, Ankara, Turkey (B.E.C., T.K., M.N.C., U.K., R.S.O.)
| | - Rıza Sarper Okten
- Department of Radiology, Ankara Bilkent City Hospital, Ankara, Turkey (B.E.C., T.K., M.N.C., U.K., R.S.O.)
| |
Collapse
|
9
|
Lopez-Ramirez F, Yasrab M, Tixier F, Kawamoto S, Fishman EK, Chu LC. The Role of AI in the Evaluation of Neuroendocrine Tumors: Current State of the Art. Semin Nucl Med 2025; 55:345-357. [PMID: 40023682 DOI: 10.1053/j.semnuclmed.2025.02.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2025] [Accepted: 02/07/2025] [Indexed: 03/04/2025]
Abstract
Advancements in Artificial Intelligence (AI) are driving a paradigm shift in the field of medical diagnostics, integrating new developments into various aspects of the clinical workflow. Neuroendocrine neoplasms are a diverse and heterogeneous group of tumors that pose significant diagnostic and management challenges due to their variable clinical presentations and biological behavior. Innovative approaches are essential to overcome these challenges and improve the current standard of care. AI-driven applications, particularly in imaging workflows, hold promise for enhancing tumor detection, classification, and grading by leveraging advanced radiomics and deep learning techniques. This article reviews the current and emerging applications of AI computer vision in the care of neuroendocrine neoplasms, focusing on its integration into imaging workflows, diagnostics, prognostic modeling, and therapeutic planning.
Collapse
Affiliation(s)
- Felipe Lopez-Ramirez
- The Russell H. Morgan Department of Radiology and Radiological Science, Johns Hopkins University School of Medicine, Baltimore, Maryland
| | - Mohammad Yasrab
- The Russell H. Morgan Department of Radiology and Radiological Science, Johns Hopkins University School of Medicine, Baltimore, Maryland
| | - Florent Tixier
- The Russell H. Morgan Department of Radiology and Radiological Science, Johns Hopkins University School of Medicine, Baltimore, Maryland
| | - Satomi Kawamoto
- The Russell H. Morgan Department of Radiology and Radiological Science, Johns Hopkins University School of Medicine, Baltimore, Maryland
| | - Elliot K Fishman
- The Russell H. Morgan Department of Radiology and Radiological Science, Johns Hopkins University School of Medicine, Baltimore, Maryland
| | - Linda C Chu
- The Russell H. Morgan Department of Radiology and Radiological Science, Johns Hopkins University School of Medicine, Baltimore, Maryland.
| |
Collapse
|
10
|
Ballard DH, Antigua-Made A, Barre E, Edney E, Gordon EB, Kelahan L, Lodhi T, Martin JG, Ozkan M, Serdynski K, Spieler B, Zhu D, Adams SJ. Impact of ChatGPT and Large Language Models on Radiology Education: Association of Academic Radiology-Radiology Research Alliance Task Force White Paper. Acad Radiol 2025; 32:3039-3049. [PMID: 39616097 DOI: 10.1016/j.acra.2024.10.023] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2024] [Revised: 10/06/2024] [Accepted: 10/17/2024] [Indexed: 04/23/2025]
Abstract
Generative artificial intelligence, including large language models (LLMs), holds immense potential to enhance healthcare, medical education, and health research. Recognizing the transformative opportunities and potential risks afforded by LLMs, the Association of Academic Radiology-Radiology Research Alliance convened a task force to explore the promise and pitfalls of using LLMs such as ChatGPT in radiology. This white paper explores the impact of LLMs on radiology education, highlighting their potential to enrich curriculum development, teaching and learning, and learner assessment. Despite these advantages, the implementation of LLMs presents challenges, including limits on accuracy and transparency, the risk of misinformation, data privacy issues, and potential biases, which must be carefully considered. We provide recommendations for the successful integration of LLMs and LLM-based educational tools into radiology education programs, emphasizing assessment of the technological readiness of LLMs for specific use cases, structured planning, regular evaluation, faculty development, increased training opportunities, academic-industry collaboration, and research on best practices for employing LLMs in education.
Collapse
Affiliation(s)
- David H Ballard
- Mallinckrodt Institute of Radiology, Washington University School of Medicine, St. Louis, Missouri, USA
| | | | - Emily Barre
- Duke University School of Medicine, Durham, North Carolina, USA
| | - Elizabeth Edney
- Department of Radiology, University of Nebraska Medical Center, Omaha, Nebraska, USA
| | - Emile B Gordon
- Department of Radiology, University of California San Diego, San Diego, California, USA
| | - Linda Kelahan
- Department of Radiology, Northwestern University Feinberg School of Medicine, Chicago, Illinois, USA
| | - Taha Lodhi
- Brody School of Medicine at East Carolina University, Greenville, North Carolina, USA
| | | | - Melis Ozkan
- University of Michigan Medical School, Ann Arbor, Michigan, USA
| | | | - Bradley Spieler
- Department of Radiology, Louisiana State University School of Medicine, University Medical Center, New Orleans, Louisiana, USA
| | - Daphne Zhu
- Duke University School of Medicine, Durham, North Carolina, USA
| | - Scott J Adams
- Department of Medical Imaging, Royal University Hospital, College of Medicine, University of Saskatchewan, Saskatoon, Saskatchewan, Canada.
| |
Collapse
|
11
|
Elek A, Ekizalioğlu DD, Güler E. Evaluating Microsoft Bing with ChatGPT-4 for the assessment of abdominal computed tomography and magnetic resonance images. Diagn Interv Radiol 2025; 31:196-205. [PMID: 39155793 PMCID: PMC12057540 DOI: 10.4274/dir.2024.232680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Accepted: 07/10/2024] [Indexed: 08/20/2024]
Abstract
PURPOSE To evaluate the performance of Microsoft Bing with ChatGPT-4 technology in analyzing abdominal computed tomography (CT) and magnetic resonance images (MRI). METHODS A comparative and descriptive analysis was conducted using the institutional picture archiving and communication systems. A total of 80 abdominal images (44 CT, 36 MRI) that showed various entities affecting the abdominal structures were included. Microsoft Bing's interpretations were compared with the impressions of radiologists in terms of recognition of the imaging modality, identification of the imaging planes (axial, coronal, and sagittal), sequences (in the case of MRI), contrast media administration, correct identification of the anatomical region depicted in the image, and detection of abnormalities. RESULTS Microsoft Bing detected that the images were CT scans with 95.4% accuracy (42/44) and that the images were MRI scans with 86.1% accuracy (31/36). However, it failed to detect one CT image (2.3%) and misidentified another CT image as an MRI (2.3%). On the other hand, it also misidentified four MRI as CT images (11.1%) and one as an X-ray (2.7%). Bing achieved an 83.75% success rate in correctly identifying abdominal regions, with 90% accuracy for CT scans (40/44) and 77.7% for MRI scans (28/36). Concerning the identification of imaging planes, Bing achieved a success rate of 95.4% for CT images and 83.3% for MRI. Regarding the identification of MRI sequences (T1-weighted and T2-weighted), the success rate was 68.75%. In the identification of the use of contrast media for CT scans, the success rate was 64.2%. Bing detected abnormalities in 35% of the images but achieved a correct interpretation rate of 10.7% for the definite diagnosis. CONCLUSION While Microsoft Bing, leveraging ChatGPT-4 technology, demonstrates proficiency in basic task identification on abdominal CT and MRI, its inability to reliably interpret abnormalities highlights the need for continued refinement to enhance its clinical applicability. CLINICAL SIGNIFICANCE The contribution of large language models (LLMs) to the diagnostic process in radiology is still being explored. However, with a comprehensive understanding of their capabilities and limitations, LLMs can significantly support radiologists during diagnosis and improve the overall efficiency of abdominal radiology practices. Acknowledging the limitations of current studies related to ChatGPT in this field, our work provides a foundation for future clinical research, paving the way for more integrated and effective diagnostic tools.
Collapse
Affiliation(s)
- Alperen Elek
- Ege University Faculty of Medicine İzmir, Türkiye
| | | | - Ezgi Güler
- Ege University Faculty of Medicine Department of Radiology, İzmir, Türkiye
| |
Collapse
|
12
|
Kuzan BN, Meşe İ, Yaşar S, Kuzan TY. A retrospective evaluation of the potential of ChatGPT in the accurate diagnosis of acute stroke. Diagn Interv Radiol 2025; 31:187-195. [PMID: 39221691 PMCID: PMC12057523 DOI: 10.4274/dir.2024.242892] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2024] [Accepted: 07/28/2024] [Indexed: 09/04/2024]
Abstract
PURPOSE Stroke is a neurological emergency requiring rapid, accurate diagnosis to prevent severe consequences. Early diagnosis is crucial for reducing morbidity and mortality. Artificial intelligence (AI) diagnosis support tools, such as Chat Generative Pre-trained Transformer (ChatGPT), offer rapid diagnostic advantages. This study assesses ChatGPT's accuracy in interpreting diffusion-weighted imaging (DWI) for acute stroke diagnosis. METHODS A retrospective analysis was conducted to identify the presence of stroke using DWI and apparent diffusion coefficient (ADC) map images. Patients aged >18 years who exhibited diffusion restriction and had a clinically explainable condition were included in the study. Patients with artifacts that affected image homogeneity, accuracy, and clarity, as well as those who had undergone previous surgery or had a history of stroke, were excluded from the study. ChatGPT was asked four consecutive questions regarding the identification of the magnetic resonance imaging (MRI) sequence, the demonstration of diffusion restriction on the ADC map after sequence recognition, and the identification of hemispheres and specific lobes. Each question was repeated 10 times to ensure consistency. Senior radiologists subsequently verified the accuracy of ChatGPT's responses, classifying them as either correct or incorrect. We assumed a response to be incorrect if it was partially correct or suggested multiple answers. These responses were systematically recorded. We also recorded non-responses from ChatGPT-4V when it failed to provide an answer to a query. We assessed ChatGPT-4V's performance by calculating the number and percentage of correct responses, incorrect responses, and non-responses across all images and questions, a metric known as "accuracy." ChatGPT-4V was considered successful if it answered ≥80% of the examples correctly. RESULTS A total of 530 diffusion MRI, of which 266 were stroke images and 264 were normal, were evaluated in the study. For the initial query identifying MRI sequence type, ChatGPT-4V's accuracy was 88.3% for stroke and 90.1% for normal images. For detecting diffusion restriction, ChatGPT-4V had an accuracy of 79.5% for stroke images, with a 15% false positive rate for normal images. Regarding identifying the brain or cerebellar hemisphere involved, ChatGPT-4V correctly identified the hemisphere in 26.2% of stroke images. For identifying the specific brain lobe or cerebellar area affected, ChatGPT-4V had a 20.4% accuracy for stroke images. The diagnostic sensitivity of ChatGPT-4V in acute stroke was found to be 79.57%, with a specificity of 84.87%, a positive predictive value of 83.86%, a negative predictive value of 80.80%, and a diagnostic odds ratio of 21.86. CONCLUSION Despite limitations, ChatGPT shows potential as a supportive tool for healthcare professionals in interpreting diffusion examinations in stroke cases, where timely diagnosis is critical. CLINICAL SIGNIFICANCE ChatGPT can play an important role in various aspects of stroke cases, such as risk assessment, early diagnosis, and treatment planning.
Collapse
Affiliation(s)
- Beyza Nur Kuzan
- Kartal Dr. Lütfi Kırdar City Hospital Clinic of Radiology, İstanbul, Türkiye
| | - İsmail Meşe
- Üsküdar State Hospital Clinic of Radiology, İstanbul, Türkiye
| | - Servan Yaşar
- Sancaktepe Şehit Prof. Dr. İlhan Varank Training and Research Hospital Clinic of Radiology, İstanbul, Türkiye
| | - Taha Yusuf Kuzan
- Sancaktepe Şehit Prof. Dr. İlhan Varank Training and Research Hospital Clinic of Radiology, İstanbul, Türkiye
| |
Collapse
|
13
|
Suárez A, Arena S, Herranz Calzada A, Castillo Varón AI, Diaz-Flores García V, Freire Y. Decoding wisdom: Evaluating ChatGPT's accuracy and reproducibility in analyzing orthopantomographic images for third molar assessment. Comput Struct Biotechnol J 2025; 28:141-147. [PMID: 40271108 PMCID: PMC12017887 DOI: 10.1016/j.csbj.2025.04.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2025] [Revised: 04/08/2025] [Accepted: 04/09/2025] [Indexed: 04/25/2025] Open
Abstract
The integration of Artificial Intelligence (AI) into healthcare has opened new avenues for clinical decision support, particularly in radiology. The aim of this study was to evaluate the accuracy and reproducibility of ChatGPT-4o in the radiographic image interpretation of orthopantomograms (OPGs) for assessment of lower third molars, simulating real patient requests for tooth extraction. Thirty OPGs were analyzed, each paired with a standardized prompt submitted to ChatGPT-4o, generating 900 responses (30 per radiograph). Two oral surgery experts independently evaluated the responses using a three-point Likert scale (correct, partially correct/incomplete, incorrect), with disagreements resolved by a third expert. ChatGPT-4o achieved an accuracy rate of 38.44 % (95 % CI: 35.27 %-41.62 %). The percentage agreement among repeated responses was 82.7 %, indicating high consistency, though Gwet's coefficient of agreement (60.4 %) suggested only moderate repeatability. While the model correctly identified general features in some cases, it frequently provided incomplete or fabricated information, particularly in complex radiographs involving overlapping structures or underdeveloped roots. These findings highlight ChatGPT-4o's current limitations in dental radiographic interpretation. Although it demonstrated some capability in analyzing OPGs, its accuracy and reliability remain insufficient for unsupervised clinical use. Professional oversight is essential to prevent diagnostic errors. Further refinement and specialized training of AI models are needed to enhance their performance and ensure safe integration into dental practice, especially in patient-facing applications.
Collapse
Affiliation(s)
- Ana Suárez
- Department of Pre-Clinic Dentistry II, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, Madrid 28670, Spain
| | - Stefania Arena
- Department of Pre-Clinic Dentistry II, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, Madrid 28670, Spain
| | - Alberto Herranz Calzada
- Department of Pre-Clinic Dentistry II, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, Madrid 28670, Spain
- Department of Pre-Clinic Dentistry I, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, Madrid 28670, Spain
| | - Ana Isabel Castillo Varón
- Department of Medicine. Faculty of Medicine, Health and Sports. Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, Madrid 28670, Spain
| | - Victor Diaz-Flores García
- Department of Pre-Clinic Dentistry I, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, Madrid 28670, Spain
| | - Yolanda Freire
- Department of Pre-Clinic Dentistry II, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, Madrid 28670, Spain
| |
Collapse
|
14
|
Bluethgen C, Van Veen D, Zakka C, Link KE, Fanous AH, Daneshjou R, Frauenfelder T, Langlotz CP, Gatidis S, Chaudhari A. Best Practices for Large Language Models in Radiology. Radiology 2025; 315:e240528. [PMID: 40298602 DOI: 10.1148/radiol.240528] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/30/2025]
Abstract
Radiologists must integrate complex imaging data with clinical information to produce actionable insights. This task requires a nuanced application of language across many activities, including managing clinical requests, analyzing imaging findings in the context of clinical data, interpreting these through the radiologist's lens, and effectively documenting and communicating the outcomes. Radiology practices must ensure reliable communication among numerous systems and stakeholders critical for medical decision-making. Large language models (LLMs) offer an opportunity to improve the management and interpretation of the vast amounts of text data in radiology. Despite being developed as general-purpose tools, these advanced computational models demonstrate impressive capabilities in specialized tasks, even without specific training. Unlocking the potential of LLMs for radiology requires an understanding of their foundations and a strategic approach to navigate their idiosyncrasies. This review, drawing from practical radiology and machine learning expertise, provides general and technically adept radiologists insight into the potential of LLMs in radiology. It also equips those interested in implementing applicable best practices that have so far stood the test of time in the rapidly evolving landscape of LLMs. The review provides practical advice for optimizing LLM characteristics for radiology practices, including advice on limitations, effective prompting, and fine-tuning strategies.
Collapse
Affiliation(s)
- Christian Bluethgen
- From the Stanford Center for Artificial Intelligence in Medicine and Imaging, Palo Alto, Calif (C.B., D.V.V., C.P.L., S.G., A.C.); Institute for Diagnostic and Interventional Radiology, University Hospital Zurich, University of Zurich, Rämistrasse 100, 8005 Zurich, Switzerland (C.B., T.F.); Department of Electrical Engineering, Stanford University, Stanford, Calif (D.V.V.); Department of Cardiothoracic Surgery, Stanford Medicine, Stanford, Calif (C.Z.); Department of Medical Education, Icahn School of Medicine at Mount Sinai, New York, NY (K.E.L.); NVIDIA, New York, NY (K.E.L.); UT Health San Antonio, San Antonio, Tex (A.H.F.); Department of Biomedical Data Science, Stanford Medicine, Stanford, Calif (A.H.F., R.D., C.P.L., A.C.); Department of Dermatology, Stanford Medicine, Redwood City, Calif (R.D.); Department of Medicine, Stanford Medicine, Stanford, Calif (C.P.L., A.C.); and Department of Radiology, Stanford University, Stanford, Calif (C.P.L., S.G., A.C.)
| | - Dave Van Veen
- From the Stanford Center for Artificial Intelligence in Medicine and Imaging, Palo Alto, Calif (C.B., D.V.V., C.P.L., S.G., A.C.); Institute for Diagnostic and Interventional Radiology, University Hospital Zurich, University of Zurich, Rämistrasse 100, 8005 Zurich, Switzerland (C.B., T.F.); Department of Electrical Engineering, Stanford University, Stanford, Calif (D.V.V.); Department of Cardiothoracic Surgery, Stanford Medicine, Stanford, Calif (C.Z.); Department of Medical Education, Icahn School of Medicine at Mount Sinai, New York, NY (K.E.L.); NVIDIA, New York, NY (K.E.L.); UT Health San Antonio, San Antonio, Tex (A.H.F.); Department of Biomedical Data Science, Stanford Medicine, Stanford, Calif (A.H.F., R.D., C.P.L., A.C.); Department of Dermatology, Stanford Medicine, Redwood City, Calif (R.D.); Department of Medicine, Stanford Medicine, Stanford, Calif (C.P.L., A.C.); and Department of Radiology, Stanford University, Stanford, Calif (C.P.L., S.G., A.C.)
| | - Cyril Zakka
- From the Stanford Center for Artificial Intelligence in Medicine and Imaging, Palo Alto, Calif (C.B., D.V.V., C.P.L., S.G., A.C.); Institute for Diagnostic and Interventional Radiology, University Hospital Zurich, University of Zurich, Rämistrasse 100, 8005 Zurich, Switzerland (C.B., T.F.); Department of Electrical Engineering, Stanford University, Stanford, Calif (D.V.V.); Department of Cardiothoracic Surgery, Stanford Medicine, Stanford, Calif (C.Z.); Department of Medical Education, Icahn School of Medicine at Mount Sinai, New York, NY (K.E.L.); NVIDIA, New York, NY (K.E.L.); UT Health San Antonio, San Antonio, Tex (A.H.F.); Department of Biomedical Data Science, Stanford Medicine, Stanford, Calif (A.H.F., R.D., C.P.L., A.C.); Department of Dermatology, Stanford Medicine, Redwood City, Calif (R.D.); Department of Medicine, Stanford Medicine, Stanford, Calif (C.P.L., A.C.); and Department of Radiology, Stanford University, Stanford, Calif (C.P.L., S.G., A.C.)
| | - Katherine E Link
- From the Stanford Center for Artificial Intelligence in Medicine and Imaging, Palo Alto, Calif (C.B., D.V.V., C.P.L., S.G., A.C.); Institute for Diagnostic and Interventional Radiology, University Hospital Zurich, University of Zurich, Rämistrasse 100, 8005 Zurich, Switzerland (C.B., T.F.); Department of Electrical Engineering, Stanford University, Stanford, Calif (D.V.V.); Department of Cardiothoracic Surgery, Stanford Medicine, Stanford, Calif (C.Z.); Department of Medical Education, Icahn School of Medicine at Mount Sinai, New York, NY (K.E.L.); NVIDIA, New York, NY (K.E.L.); UT Health San Antonio, San Antonio, Tex (A.H.F.); Department of Biomedical Data Science, Stanford Medicine, Stanford, Calif (A.H.F., R.D., C.P.L., A.C.); Department of Dermatology, Stanford Medicine, Redwood City, Calif (R.D.); Department of Medicine, Stanford Medicine, Stanford, Calif (C.P.L., A.C.); and Department of Radiology, Stanford University, Stanford, Calif (C.P.L., S.G., A.C.)
| | - Aaron Hunter Fanous
- From the Stanford Center for Artificial Intelligence in Medicine and Imaging, Palo Alto, Calif (C.B., D.V.V., C.P.L., S.G., A.C.); Institute for Diagnostic and Interventional Radiology, University Hospital Zurich, University of Zurich, Rämistrasse 100, 8005 Zurich, Switzerland (C.B., T.F.); Department of Electrical Engineering, Stanford University, Stanford, Calif (D.V.V.); Department of Cardiothoracic Surgery, Stanford Medicine, Stanford, Calif (C.Z.); Department of Medical Education, Icahn School of Medicine at Mount Sinai, New York, NY (K.E.L.); NVIDIA, New York, NY (K.E.L.); UT Health San Antonio, San Antonio, Tex (A.H.F.); Department of Biomedical Data Science, Stanford Medicine, Stanford, Calif (A.H.F., R.D., C.P.L., A.C.); Department of Dermatology, Stanford Medicine, Redwood City, Calif (R.D.); Department of Medicine, Stanford Medicine, Stanford, Calif (C.P.L., A.C.); and Department of Radiology, Stanford University, Stanford, Calif (C.P.L., S.G., A.C.)
| | - Roxana Daneshjou
- From the Stanford Center for Artificial Intelligence in Medicine and Imaging, Palo Alto, Calif (C.B., D.V.V., C.P.L., S.G., A.C.); Institute for Diagnostic and Interventional Radiology, University Hospital Zurich, University of Zurich, Rämistrasse 100, 8005 Zurich, Switzerland (C.B., T.F.); Department of Electrical Engineering, Stanford University, Stanford, Calif (D.V.V.); Department of Cardiothoracic Surgery, Stanford Medicine, Stanford, Calif (C.Z.); Department of Medical Education, Icahn School of Medicine at Mount Sinai, New York, NY (K.E.L.); NVIDIA, New York, NY (K.E.L.); UT Health San Antonio, San Antonio, Tex (A.H.F.); Department of Biomedical Data Science, Stanford Medicine, Stanford, Calif (A.H.F., R.D., C.P.L., A.C.); Department of Dermatology, Stanford Medicine, Redwood City, Calif (R.D.); Department of Medicine, Stanford Medicine, Stanford, Calif (C.P.L., A.C.); and Department of Radiology, Stanford University, Stanford, Calif (C.P.L., S.G., A.C.)
| | - Thomas Frauenfelder
- From the Stanford Center for Artificial Intelligence in Medicine and Imaging, Palo Alto, Calif (C.B., D.V.V., C.P.L., S.G., A.C.); Institute for Diagnostic and Interventional Radiology, University Hospital Zurich, University of Zurich, Rämistrasse 100, 8005 Zurich, Switzerland (C.B., T.F.); Department of Electrical Engineering, Stanford University, Stanford, Calif (D.V.V.); Department of Cardiothoracic Surgery, Stanford Medicine, Stanford, Calif (C.Z.); Department of Medical Education, Icahn School of Medicine at Mount Sinai, New York, NY (K.E.L.); NVIDIA, New York, NY (K.E.L.); UT Health San Antonio, San Antonio, Tex (A.H.F.); Department of Biomedical Data Science, Stanford Medicine, Stanford, Calif (A.H.F., R.D., C.P.L., A.C.); Department of Dermatology, Stanford Medicine, Redwood City, Calif (R.D.); Department of Medicine, Stanford Medicine, Stanford, Calif (C.P.L., A.C.); and Department of Radiology, Stanford University, Stanford, Calif (C.P.L., S.G., A.C.)
| | - Curtis P Langlotz
- From the Stanford Center for Artificial Intelligence in Medicine and Imaging, Palo Alto, Calif (C.B., D.V.V., C.P.L., S.G., A.C.); Institute for Diagnostic and Interventional Radiology, University Hospital Zurich, University of Zurich, Rämistrasse 100, 8005 Zurich, Switzerland (C.B., T.F.); Department of Electrical Engineering, Stanford University, Stanford, Calif (D.V.V.); Department of Cardiothoracic Surgery, Stanford Medicine, Stanford, Calif (C.Z.); Department of Medical Education, Icahn School of Medicine at Mount Sinai, New York, NY (K.E.L.); NVIDIA, New York, NY (K.E.L.); UT Health San Antonio, San Antonio, Tex (A.H.F.); Department of Biomedical Data Science, Stanford Medicine, Stanford, Calif (A.H.F., R.D., C.P.L., A.C.); Department of Dermatology, Stanford Medicine, Redwood City, Calif (R.D.); Department of Medicine, Stanford Medicine, Stanford, Calif (C.P.L., A.C.); and Department of Radiology, Stanford University, Stanford, Calif (C.P.L., S.G., A.C.)
| | - Sergios Gatidis
- From the Stanford Center for Artificial Intelligence in Medicine and Imaging, Palo Alto, Calif (C.B., D.V.V., C.P.L., S.G., A.C.); Institute for Diagnostic and Interventional Radiology, University Hospital Zurich, University of Zurich, Rämistrasse 100, 8005 Zurich, Switzerland (C.B., T.F.); Department of Electrical Engineering, Stanford University, Stanford, Calif (D.V.V.); Department of Cardiothoracic Surgery, Stanford Medicine, Stanford, Calif (C.Z.); Department of Medical Education, Icahn School of Medicine at Mount Sinai, New York, NY (K.E.L.); NVIDIA, New York, NY (K.E.L.); UT Health San Antonio, San Antonio, Tex (A.H.F.); Department of Biomedical Data Science, Stanford Medicine, Stanford, Calif (A.H.F., R.D., C.P.L., A.C.); Department of Dermatology, Stanford Medicine, Redwood City, Calif (R.D.); Department of Medicine, Stanford Medicine, Stanford, Calif (C.P.L., A.C.); and Department of Radiology, Stanford University, Stanford, Calif (C.P.L., S.G., A.C.)
| | - Akshay Chaudhari
- From the Stanford Center for Artificial Intelligence in Medicine and Imaging, Palo Alto, Calif (C.B., D.V.V., C.P.L., S.G., A.C.); Institute for Diagnostic and Interventional Radiology, University Hospital Zurich, University of Zurich, Rämistrasse 100, 8005 Zurich, Switzerland (C.B., T.F.); Department of Electrical Engineering, Stanford University, Stanford, Calif (D.V.V.); Department of Cardiothoracic Surgery, Stanford Medicine, Stanford, Calif (C.Z.); Department of Medical Education, Icahn School of Medicine at Mount Sinai, New York, NY (K.E.L.); NVIDIA, New York, NY (K.E.L.); UT Health San Antonio, San Antonio, Tex (A.H.F.); Department of Biomedical Data Science, Stanford Medicine, Stanford, Calif (A.H.F., R.D., C.P.L., A.C.); Department of Dermatology, Stanford Medicine, Redwood City, Calif (R.D.); Department of Medicine, Stanford Medicine, Stanford, Calif (C.P.L., A.C.); and Department of Radiology, Stanford University, Stanford, Calif (C.P.L., S.G., A.C.)
| |
Collapse
|
15
|
Halfmann MC, Mildenberger P, Jorg T. [Artificial intelligence in radiology : Literature overview and reading recommendations]. RADIOLOGIE (HEIDELBERG, GERMANY) 2025; 65:266-270. [PMID: 39904811 DOI: 10.1007/s00117-025-01419-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 01/13/2025] [Indexed: 02/06/2025]
Abstract
BACKGROUND Due to the ongoing rapid advancement of artificial intelligence (AI), including large language models (LLMs), radiologists will soon face the challenge of the responsible clinical integration of these models. OBJECTIVES The aim of this work is to provide an overview of current developments regarding LLMs, potential applications in radiology, and their (future) relevance and limitations. MATERIALS AND METHODS This review analyzes publications on LLMs for specific applications in medicine and radiology. Additionally, literature related to the challenges of clinical LLM use was reviewed and summarized. RESULTS In addition to a general overview of current literature on radiological applications of LLMs, several particularly noteworthy studies on the subject are recommended. CONCLUSIONS In order to facilitate the forthcoming clinical integration of LLMs, radiologists need to engage with the topic, understand various application areas, and be aware of potential limitations in order to address challenges related to patient safety, ethics, and data protection.
Collapse
Affiliation(s)
- Moritz C Halfmann
- Klinik und Poliklinik für diagnostische und interventionelle Radiologie, Universitätsmedizin, Johannes-Gutenberg-Universität Mainz, Langenbeckstraße 1, 55131, Mainz, Deutschland
| | - Peter Mildenberger
- Klinik und Poliklinik für diagnostische und interventionelle Radiologie, Universitätsmedizin, Johannes-Gutenberg-Universität Mainz, Langenbeckstraße 1, 55131, Mainz, Deutschland
| | - Tobias Jorg
- Klinik und Poliklinik für diagnostische und interventionelle Radiologie, Universitätsmedizin, Johannes-Gutenberg-Universität Mainz, Langenbeckstraße 1, 55131, Mainz, Deutschland.
| |
Collapse
|
16
|
Lyo S, Mohan S, Hassankhani A, Noor A, Dako F, Cook T. From Revisions to Insights: Converting Radiology Report Revisions into Actionable Educational Feedback Using Generative AI Models. JOURNAL OF IMAGING INFORMATICS IN MEDICINE 2025; 38:1265-1279. [PMID: 39160366 PMCID: PMC11950553 DOI: 10.1007/s10278-024-01233-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Revised: 08/07/2024] [Accepted: 08/08/2024] [Indexed: 08/21/2024]
Abstract
Expert feedback on trainees' preliminary reports is crucial for radiologic training, but real-time feedback can be challenging due to non-contemporaneous, remote reading and increasing imaging volumes. Trainee report revisions contain valuable educational feedback, but synthesizing data from raw revisions is challenging. Generative AI models can potentially analyze these revisions and provide structured, actionable feedback. This study used the OpenAI GPT-4 Turbo API to analyze paired synthesized and open-source analogs of preliminary and finalized reports, identify discrepancies, categorize their severity and type, and suggest review topics. Expert radiologists reviewed the output by grading discrepancies, evaluating the severity and category accuracy, and suggested review topic relevance. The reproducibility of discrepancy detection and maximal discrepancy severity was also examined. The model exhibited high sensitivity, detecting significantly more discrepancies than radiologists (W = 19.0, p < 0.001) with a strong positive correlation (r = 0.778, p < 0.001). Interrater reliability for severity and type were fair (Fleiss' kappa = 0.346 and 0.340, respectively; weighted kappa = 0.622 for severity). The LLM achieved a weighted F1 score of 0.66 for severity and 0.64 for type. Generated teaching points were considered relevant in ~ 85% of cases, and relevance correlated with the maximal discrepancy severity (Spearman ρ = 0.76, p < 0.001). The reproducibility was moderate to good (ICC (2,1) = 0.690) for the number of discrepancies and substantial for maximal discrepancy severity (Fleiss' kappa = 0.718; weighted kappa = 0.94). Generative AI models can effectively identify discrepancies in report revisions and generate relevant educational feedback, offering promise for enhancing radiology training.
Collapse
Affiliation(s)
- Shawn Lyo
- Department of Radiology, Hospital of the University of Pennsylvania, Philadelphia, PA, USA.
| | - Suyash Mohan
- Department of Radiology, Hospital of the University of Pennsylvania, Philadelphia, PA, USA
| | - Alvand Hassankhani
- Department of Radiology, Hospital of the University of Pennsylvania, Philadelphia, PA, USA
| | - Abass Noor
- Department of Radiology, Hospital of the University of Pennsylvania, Philadelphia, PA, USA
| | - Farouk Dako
- Department of Radiology, Hospital of the University of Pennsylvania, Philadelphia, PA, USA
| | - Tessa Cook
- Department of Radiology, Hospital of the University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
17
|
Kim TT, Makutonin M, Sirous R, Javan R. Optimizing Large Language Models in Radiology and Mitigating Pitfalls: Prompt Engineering and Fine-tuning. Radiographics 2025; 45:e240073. [PMID: 40048389 DOI: 10.1148/rg.240073] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/13/2025]
Abstract
Large language models (LLMs) such as generative pretrained transformers (GPTs) have had a major impact on society, and there is increasing interest in using these models for applications in medicine and radiology. This article presents techniques to optimize these models and describes their known challenges and limitations. Specifically, the authors explore how to best craft natural language prompts, a process known as prompt engineering, for these models to elicit more accurate and desirable responses. The authors also explain how fine-tuning is conducted, in which a more general model, such as GPT-4, is further trained on a more specific use case, such as summarizing clinical notes, to further improve reliability and relevance. Despite the enormous potential of these models, substantial challenges limit their widespread implementation. These tools differ substantially from traditional health technology in their complexity and their probabilistic and nondeterministic nature, and these differences lead to issues such as "hallucinations," biases, lack of reliability, and security risks. Therefore, the authors provide radiologists with baseline knowledge of the technology underpinning these models and an understanding of how to use them, in addition to exploring best practices in prompt engineering and fine-tuning. Also discussed are current proof-of-concept use cases of LLMs in the radiology literature, such as in clinical decision support and report generation, and the limitations preventing their current adoption in medicine and radiology. ©RSNA, 2025 See invited commentary by Chung and Mongan in this issue.
Collapse
Affiliation(s)
- Theodore Taehoon Kim
- From the Department of Radiology, George Washington University School of Medicine and Health Sciences, 2300 I St NW, Washington, DC 20052 (T.T.K., R.J.); Yale School of Medicine, New Haven, Conn (M.M.); and University of California San Francisco, San Francisco, Calif (R.S.)
| | - Michael Makutonin
- From the Department of Radiology, George Washington University School of Medicine and Health Sciences, 2300 I St NW, Washington, DC 20052 (T.T.K., R.J.); Yale School of Medicine, New Haven, Conn (M.M.); and University of California San Francisco, San Francisco, Calif (R.S.)
| | - Reza Sirous
- From the Department of Radiology, George Washington University School of Medicine and Health Sciences, 2300 I St NW, Washington, DC 20052 (T.T.K., R.J.); Yale School of Medicine, New Haven, Conn (M.M.); and University of California San Francisco, San Francisco, Calif (R.S.)
| | - Ramin Javan
- From the Department of Radiology, George Washington University School of Medicine and Health Sciences, 2300 I St NW, Washington, DC 20052 (T.T.K., R.J.); Yale School of Medicine, New Haven, Conn (M.M.); and University of California San Francisco, San Francisco, Calif (R.S.)
| |
Collapse
|
18
|
Blüthgen C. [Technical foundations of large language models]. RADIOLOGIE (HEIDELBERG, GERMANY) 2025; 65:227-234. [PMID: 40063090 PMCID: PMC11937190 DOI: 10.1007/s00117-025-01427-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Accepted: 02/10/2025] [Indexed: 03/26/2025]
Abstract
BACKGROUND Large language models (LLMs) such as ChatGPT have rapidly revolutionized the way computers can analyze human language and the way we can interact with computers. OBJECTIVE To give an overview of the emergence and basic principles of computational language models. METHODS Narrative literature-based analysis of the history of the emergence of language models, the technical foundations, the training process and the limitations of LLMs. RESULTS Nowadays, LLMs are mostly based on transformer models that can capture context through their attention mechanism. Through a multistage training process with comprehensive pretraining, supervised fine-tuning and alignment with human preferences, LLMs have developed a general understanding of language. This enables them to flexibly analyze texts and produce outputs of high linguistic quality. CONCLUSION Their technical foundations and training process make large language models versatile general-purpose tools for text processing, with numerous applications in radiology. The main limitation is the tendency to postulate incorrect but plausible-sounding information with high confidence.
Collapse
Affiliation(s)
- Christian Blüthgen
- Institut für Diagnostische und Interventionelle Radiologie, Universitätsspital Zürich, Universität Zürich, Rämistrasse 100, 8091, Zürich, Schweiz.
| |
Collapse
|
19
|
Mese I, Kocak B. ChatGPT as an effective tool for quality evaluation of radiomics research. Eur Radiol 2025; 35:2030-2042. [PMID: 39406959 DOI: 10.1007/s00330-024-11122-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2024] [Revised: 09/09/2024] [Accepted: 09/18/2024] [Indexed: 03/18/2025]
Abstract
OBJECTIVES This study aimed to evaluate the effectiveness of ChatGPT-4o in assessing the methodological quality of radiomics research using the radiomics quality score (RQS) compared to human experts. METHODS Published in European Radiology, European Radiology Experimental, and Insights into Imaging between 2023 and 2024, open-access and peer-reviewed radiomics research articles with creative commons attribution license (CC-BY) were included in this study. Pre-prints from MedRxiv were also included to evaluate potential peer-review bias. Using the RQS, each study was independently assessed twice by ChatGPT-4o and by two radiologists with consensus. RESULTS In total, 52 open-access and peer-reviewed articles were included in this study. Both ChatGPT-4o evaluation (average of two readings) and human experts had a median RQS of 14.5 (40.3% percentage score) (p > 0.05). Pairwise comparisons revealed no statistically significant difference between the readings of ChatGPT and human experts (corrected p > 0.05). The intraclass correlation coefficient for intra-rater reliability of ChatGPT-4o was 0.905 (95% CI: 0.840-0.944), and those for inter-rater reliability with human experts for each evaluation of ChatGPT-4o were 0.859 (95% CI: 0.756-0.919) and 0.914 (95% CI: 0.855-0.949), corresponding to good to excellent reliability for all. The evaluation by ChatGPT-4o took less time (2.9-3.5 min per article) compared to human experts (13.9 min per article by one reader). Item-wise reliability analysis showed ChatGPT-4o maintained consistently high reliability across almost all RQS items. CONCLUSION ChatGPT-4o provides reliable and efficient assessments of radiomics research quality. Its evaluations closely align with those of human experts and reduce evaluation time. KEY POINTS Question Is ChatGPT effective and reliable in evaluating radiomics research quality based on RQS? Findings ChatGPT-4o showed high reliability and efficiency, with evaluations closely matching human experts. It can considerably reduce the time required for radiomics research quality assessment. Clinical relevance ChatGPT-4o offers a quick and reliable automated alternative for evaluating the quality of radiomics research, with the potential to assess radiomics research at a large scale in the future.
Collapse
Affiliation(s)
- Ismail Mese
- Department of Radiology, Erenkoy Mental Health and Neurology Training and Research Hospital, University of Health Sciences, Istanbul, Turkey
| | - Burak Kocak
- Department of Radiology, Basaksehir Cam and Sakura City Hospital, University of Health Sciences, Istanbul, Turkey.
| |
Collapse
|
20
|
Arnold P, Henkel M, Bamberg F, Kotter E. [Integration of large language models into the clinic : Revolution in analysing and processing patient data to increase efficiency and quality in radiology]. RADIOLOGIE (HEIDELBERG, GERMANY) 2025; 65:243-248. [PMID: 40072530 DOI: 10.1007/s00117-025-01431-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 02/18/2025] [Indexed: 03/14/2025]
Abstract
BACKGROUND Large Language Models (LLMs) like ChatGPT, Llama and Claude are transforming healthcare by interpreting complex text, extracting information, and providing guideline-based support. Radiology, with its high patient volume and digital workflows, is a ideal field for LLM integration. OBJECTIVE Assessment of the potential of LLMs to enhance efficiency, standardization, and decision support in radiology, while addressing ethical and regulatory challenges. MATERIAL AND METHODS Pilot studies at Freiburg and Basel university hospitals evaluated local LLM systems for tasks like prior report summarization and guideline-driven reporting. Integration with Picture Archiving and Communication System (PACS) and Electronic Health Record (EHR) systems was achieved via Digital Imaging and Communications in Medicine (DICOM) and Fast Healthcare Interoperability Resources (FHIR) standards. Metrics included time savings, compliance with the European Union (EU) Artificial Intelligence (AI) Act, and user acceptance. RESULTS LLMs demonstrate significant potential as a support tool for radiologists in clinical practice by reducing reporting times, automating routine tasks, and ensuring consistent, high-quality results. They also support interdisciplinary workflows (e.g., tumor boards) and meet data protection requirements when locally implemented. DISCUSSION Local LLM systems are feasible and beneficial in radiology, enhancing efficiency and diagnostic quality. Future work should refine transparency, expand applications, and ensure LLMs complement medical expertise while adhering to ethical and legal standards.
Collapse
Affiliation(s)
- Philipp Arnold
- Klinik für Diagnostische und Interventionelle Radiologie, Universitätsklinikum Freiburg, Hugstetterstr. 55, 79106, Freiburg, Deutschland.
| | - Maurice Henkel
- Abteilung für Forschung und Analyse Services, Universitätsspital Basel, Basel, Schweiz
| | - Fabian Bamberg
- Klinik für Diagnostische und Interventionelle Radiologie, Universitätsklinikum Freiburg, Hugstetterstr. 55, 79106, Freiburg, Deutschland
| | - Elmar Kotter
- Klinik für Diagnostische und Interventionelle Radiologie, Universitätsklinikum Freiburg, Hugstetterstr. 55, 79106, Freiburg, Deutschland
| |
Collapse
|
21
|
Arita Y, Nissan N. Integrating Deep Learning in Breast MRI: Technical Advances and Clinical Promise. Acad Radiol 2025:S1076-6332(25)00290-9. [PMID: 40169328 DOI: 10.1016/j.acra.2025.03.047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2025] [Accepted: 03/24/2025] [Indexed: 04/03/2025]
Affiliation(s)
- Yuki Arita
- Department of Radiology, Memorial Sloan Kettering Cancer Center, 1275 York Avenue, New York, NY 10065 (Y.A.).
| | - Noam Nissan
- Department of Radiology, Sheba Medical Center, 2 Derech Sheba, Tel HaShomer, Ramat Gan 5262000, Israel (N.N.)
| |
Collapse
|
22
|
Singh S, Chaurasia A, Raichandani S, Grewal H, Udare A, Jawahar A. Commentary: Leveraging Large Language Models for Radiology Education and Training. J Comput Assist Tomogr 2025:00004728-990000000-00433. [PMID: 40164970 DOI: 10.1097/rct.0000000000001736] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2024] [Accepted: 01/12/2025] [Indexed: 04/02/2025]
Abstract
In the rapidly evolving landscape of medical education, artificial intelligence (AI) holds transformative potential. This manuscript explores the integration of large language models (LLMs) in Radiology education and training. These advanced AI tools, trained on vast data sets, excel in processing and generating human-like text, and have even demonstrated the ability to pass medical board exams. In radiology, LLMs enhance clinical education by providing interactive training environments that improve diagnostic skills and structured reporting. They also support research by streamlining literature reviews and automating data analysis, thus boosting productivity. However, their integration raises significant challenges, including the risk of over-reliance on AI, ethical concerns related to patient privacy, and potential biases in AI-generated content. This commentary from the Early Career Committee of the Society for Advanced Body Imaging (SABI) offers insights into the current applications and future possibilities of LLMs in Radiology education while being mindful of their limitations and ethical implications to optimize their use in the health care system.
Collapse
Affiliation(s)
- Shiva Singh
- Diagnostic Radiology, University of Arkansas for Medical Sciences, Little Rock, AR
| | - Aditi Chaurasia
- Diagnostic Radiology, University of Arkansas for Medical Sciences, Little Rock, AR
| | - Surbhi Raichandani
- Department of Radiology and Imaging Sciences, Emory University, Atlanta, GA
| | - Harpreet Grewal
- Radiology, Florida State University College of Medicine, Pensacola, FL
| | - Ashlesha Udare
- Department of Radiology, Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA
| | - Anugayathri Jawahar
- Department of Radiology, Northwestern Memorial Hospital, Northwestern University Feinberg School of Medicine, Chicago, IL
| |
Collapse
|
23
|
Tong MW, Zhou J, Akkaya Z, Majumdar S, Bhattacharjee R. Artificial intelligence in musculoskeletal applications: a primer for radiologists. Diagn Interv Radiol 2025; 31:89-101. [PMID: 39157958 PMCID: PMC11880867 DOI: 10.4274/dir.2024.242830] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2024] [Accepted: 07/11/2024] [Indexed: 08/20/2024]
Abstract
As an umbrella term, artificial intelligence (AI) covers machine learning and deep learning. This review aimed to elaborate on these terms to act as a primer for radiologists to learn more about the algorithms commonly used in musculoskeletal radiology. It also aimed to familiarize them with the common practices and issues in the use of AI in this domain.
Collapse
Affiliation(s)
- Michelle W. Tong
- University of California San Francisco Department of Radiology and Biomedical Imaging, San Francisco, USA
- University of California San Francisco Department of Bioengineering, San Francisco, USA
- University of California Berkeley Department of Bioengineering, Berkeley, USA
| | - Jiamin Zhou
- University of California San Francisco Department of Orthopaedic Surgery, San Francisco, USA
| | - Zehra Akkaya
- University of California San Francisco Department of Radiology and Biomedical Imaging, San Francisco, USA
- Ankara University Faculty of Medicine Department of Radiology, Ankara, Türkiye
| | - Sharmila Majumdar
- University of California San Francisco Department of Radiology and Biomedical Imaging, San Francisco, USA
- University of California San Francisco Department of Bioengineering, San Francisco, USA
| | - Rupsa Bhattacharjee
- University of California San Francisco Department of Radiology and Biomedical Imaging, San Francisco, USA
| |
Collapse
|
24
|
Güneş YC, Cesur T, Çamur E, Karabekmez LG. Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5 th edition. Diagn Interv Radiol 2025; 31:111-129. [PMID: 39248152 PMCID: PMC11880873 DOI: 10.4274/dir.2024.242876] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2024] [Accepted: 08/24/2024] [Indexed: 09/10/2024]
Abstract
PURPOSE This study aimed to evaluate the performance of large language models (LLMs) and multimodal LLMs in interpreting the Breast Imaging Reporting and Data System (BI-RADS) categories and providing clinical management recommendations for breast radiology in text-based and visual questions. METHODS This cross-sectional observational study involved two steps. In the first step, we compared ten LLMs (namely ChatGPT 4o, ChatGPT 4, ChatGPT 3.5, Google Gemini 1.5 Pro, Google Gemini 1.0, Microsoft Copilot, Perplexity, Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Opus 200K), general radiologists, and a breast radiologist using 100 text-based multiple-choice questions (MCQs) related to the BI-RADS Atlas 5th edition. In the second step, we assessed the performance of five multimodal LLMs (ChatGPT 4o, ChatGPT 4V, Claude 3.5 Sonnet, Claude 3 Opus, and Google Gemini 1.5 Pro) in assigning BI-RADS categories and providing clinical management recommendations on 100 breast ultrasound images. The comparison of correct answers and accuracy by question types was analyzed using McNemar's and chi-squared tests. Management scores were analyzed using the Kruskal- Wallis and Wilcoxon tests. RESULTS Claude 3.5 Sonnet achieved the highest accuracy in text-based MCQs (90%), followed by ChatGPT 4o (89%), outperforming all other LLMs and general radiologists (78% and 76%) (P < 0.05), except for the Claude 3 Opus models and the breast radiologist (82%) (P > 0.05). Lower-performing LLMs included Google Gemini 1.0 (61%) and ChatGPT 3.5 (60%). Performance across different categories of showed no significant variation among LLMs or radiologists (P > 0.05). For breast ultrasound images, Claude 3.5 Sonnet achieved 59% accuracy, significantly higher than other multimodal LLMs (P < 0.05). Management recommendations were evaluated using a 3-point Likert scale, with Claude 3.5 Sonnet scoring the highest (mean: 2.12 ± 0.97) (P < 0.05). Accuracy varied significantly across BI-RADS categories, except Claude 3 Opus (P < 0.05). Gemini 1.5 Pro failed to answer any BI-RADS 5 questions correctly. Similarly, ChatGPT 4V failed to answer any BI-RADS 1 questions correctly, making them the least accurate in these categories (P < 0.05). CONCLUSION Although LLMs such as Claude 3.5 Sonnet and ChatGPT 4o show promise in text-based BI-RADS assessments, their limitations in visual diagnostics suggest they should be used cautiously and under radiologists' supervision to avoid misdiagnoses. CLINICAL SIGNIFICANCE This study demonstrates that while LLMs exhibit strong capabilities in text-based BI-RADS assessments, their visual diagnostic abilities are currently limited, necessitating further development and cautious application in clinical practice.
Collapse
Affiliation(s)
- Yasin Celal Güneş
- Kırıkkale Yüksek İhtisas Hospital Clinic of Radiology, Kırıkkale, Türkiye
| | - Turay Cesur
- Mamak State Hospital Clinic of Radiology, Ankara, Türkiye
| | - Eren Çamur
- Ankara 29 Mayıs State Hospital Clinic of Radiology, Ankara, Türkiye
| | - Leman Günbey Karabekmez
- Ankara Yıldırım Beyazıt University Faculty of Medicine Department of Radiology, Ankara, Türkiye
| |
Collapse
|
25
|
Pinard CJ, Poon AC, Lagree A, Wu K, Li J, Tran WT. Precision in Parsing: Evaluation of an Open-Source Named Entity Recognizer (NER) in Veterinary Oncology. Vet Comp Oncol 2025; 23:102-108. [PMID: 39711253 PMCID: PMC11830456 DOI: 10.1111/vco.13035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2024] [Revised: 11/14/2024] [Accepted: 12/02/2024] [Indexed: 12/24/2024]
Abstract
Integrating Artificial Intelligence (AI) through Natural Language Processing (NLP) can improve veterinary medical oncology clinical record analytics. Named Entity Recognition (NER), a critical component of NLP, can facilitate efficient data extraction and automated labelling for research and clinical decision-making. This study assesses the efficacy of the Bio-Epidemiology-NER (BioEN), an open-source NER developed using human epidemiological and medical data, on veterinary medical oncology records. The NER's performance was compared with manual annotations by a veterinary medical oncologist and a veterinary intern. Evaluation metrics included Jaccard similarity, intra-rater reliability, ROUGE scores, and standard NER performance metrics (precision, recall, F1-score). Results indicate poor direct translatability to veterinary medical oncology record text and room for improvement in the NER's performance, with precision, recall, and F1-score suggesting a marginally better alignment with the oncologist than the intern. While challenges remain, these insights contribute to the ongoing development of AI tools tailored for veterinary healthcare and highlight the need for veterinary-specific models.
Collapse
Affiliation(s)
- Christopher J. Pinard
- Department of Clinical StudiesOntario Veterinary College, University of GuelphGuelphOntarioCanada
- Department of OncologyLakeshore Animal Health PartnersMississaugaOntarioCanada
- Centre for Advancing Responsible & Ethical Artificial Intelligence, University of GuelphGuelphOntarioCanada
- Radiogenomics Laboratory, Sunnybrook Health Sciences CentreTorontoOntarioCanada
- ANI.ML Research, ANI.ML Health Inc.TorontoOntarioCanada
| | - Andrew C. Poon
- VCA Mississauga Oakville Veterinary Emergency HospitalMississaugaOntarioCanada
| | - Andrew Lagree
- Radiogenomics Laboratory, Sunnybrook Health Sciences CentreTorontoOntarioCanada
- ANI.ML Research, ANI.ML Health Inc.TorontoOntarioCanada
- Odette Cancer Program, Sunnybrook Health Sciences CentreTorontoOntarioCanada
| | - Kuan‐Chuen Wu
- ANI.ML Research, ANI.ML Health Inc.TorontoOntarioCanada
| | - Jiaxu Li
- Radiogenomics Laboratory, Sunnybrook Health Sciences CentreTorontoOntarioCanada
| | - William T. Tran
- Radiogenomics Laboratory, Sunnybrook Health Sciences CentreTorontoOntarioCanada
- Odette Cancer Program, Sunnybrook Health Sciences CentreTorontoOntarioCanada
- Department of Radiation OncologyUniversity of TorontoTorontoOntarioCanada
- Temerty Centre for AI Research and Education in Medicine, University of TorontoTorontoOntarioCanada
| |
Collapse
|
26
|
Kaba E, Beyazal M, Çeliker FB, Yel İ, Vogl TJ. Accuracy and Readability of ChatGPT on Potential Complications of Interventional Radiology Procedures: AI-Powered Patient Interviewing. Acad Radiol 2025; 32:1547-1553. [PMID: 39551684 DOI: 10.1016/j.acra.2024.10.028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2024] [Revised: 10/18/2024] [Accepted: 10/20/2024] [Indexed: 11/19/2024]
Abstract
RATIONALE AND OBJECTIVES It is crucial to inform the patient about potential complications and obtain consent before interventional radiology procedures. In this study, we investigated the accuracy, reliability, and readability of the information provided by ChatGPT-4 about potential complications of interventional radiology procedures. MATERIALS AND METHODS Potential major and minor complications of 25 different interventional radiology procedures (8 non-vascular, 17 vascular) were asked to ChatGPT-4 chatbot. The responses were evaluated by two experienced interventional radiologists (>25 years and 10 years of experience) using a 5-point Likert scale according to Cardiovascular and Interventional Radiological Society of Europe guidelines. The correlation between the two interventional radiologists' scoring was evaluated by the Wilcoxon signed-rank test, Intraclass Correlation Coefficient (ICC), and Pearson correlation coefficient (PCC). In addition, readability and complexity were quantitatively assessed using the Flesch-Kincaid Grade Level, Flesch Reading Ease scores, and Simple Measure of Gobbledygook (SMOG) index. RESULTS Interventional radiologist 1 (IR1) and interventional radiologist 2 (IR2) gave 104 and 109 points, respectively, out of a potential 125 points for the total of all procedures. There was no statistically significant difference between the total scores of the two IRs (p = 0.244). The IRs demonstrated high agreement across all procedure ratings (ICC=0.928). Both IRs scored 34 out of 40 points for the eight non-vascular procedures. 17 vascular procedures received 70 points out of 85 from IR1 and 75 from IR2. The agreement between the two observers' assessments was good, with PCC values of 0.908 and 0.896 for non-vascular and vascular procedures, respectively. Readability levels were overall low. The mean Flesch-Kincaid Grade Level, Flesch Reading Ease scores, and SMOG index were 12.51 ± 1.14 (college level) 30.27 ± 8.38 (college level), and 14.46 ± 0.76 (college level), respectively. There was no statistically significant difference in readability between non-vascular and vascular procedures (p = 0.16). CONCLUSION ChatGPT-4 demonstrated remarkable performance, highlighting its potential to enhance accessibility to information about interventional radiology procedures and support the creation of educational materials for patients. Based on the findings of our study, while ChatGPT provides accurate information and shows no evidence of hallucinations, it is important to emphasize that a high level of education and health literacy are required to fully comprehend its responses.
Collapse
Affiliation(s)
- Esat Kaba
- Recep Tayyip Erdogan University, Department of Radiology, Rize, Turkey (E.K., M.B., F.B.C.).
| | - Mehmet Beyazal
- Recep Tayyip Erdogan University, Department of Radiology, Rize, Turkey (E.K., M.B., F.B.C.)
| | - Fatma Beyazal Çeliker
- Recep Tayyip Erdogan University, Department of Radiology, Rize, Turkey (E.K., M.B., F.B.C.)
| | - İbrahim Yel
- University Hospital Frankfurt, Department of Diagnostic and Interventional Radiology, Frankfurt, Germany (I.Y., T.J.V.)
| | - Thomas J Vogl
- University Hospital Frankfurt, Department of Diagnostic and Interventional Radiology, Frankfurt, Germany (I.Y., T.J.V.)
| |
Collapse
|
27
|
Dietzel M, Resch A, Baltzer PAT. [Artificial intelligence in breast imaging : Hopes and challenges]. RADIOLOGIE (HEIDELBERG, GERMANY) 2025; 65:187-193. [PMID: 39915299 PMCID: PMC11845416 DOI: 10.1007/s00117-024-01409-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 12/16/2024] [Indexed: 02/22/2025]
Abstract
CLINICAL/METHODICAL ISSUE Artificial intelligence (AI) is being increasingly integrated into clinical practice. However, the specific benefits are still unclear to many users. STANDARD RADIOLOGICAL METHODS In principle, AI applications are available for all imaging modalities, with a particular focus on mammography in breast diagnostics. METHODICAL INNOVATIONS AI promises to filter examinations into negative and clearly positive findings, and thereby reduces part of the radiological workload. Other applications are not yet as widely established. PERFORMANCE AI methods for mammography, and to a lesser extent tomosynthesis, have already reached the diagnostic quality of radiologists. ACHIEVEMENTS Except for second-opinion applications/triage in mammography, most methods are still under development. PRACTICAL RECOMMENDATIONS Currently, most AI applications must be critically evaluated by potential users regarding their maturity and practical benefits.
Collapse
Affiliation(s)
- Matthias Dietzel
- Department of Radiology, University Hospital Erlangen, Erlangen, Deutschland
| | - Alexandra Resch
- Department of Radiology, St. Francis Hospital Vienna, Sigmund Freud Private University Vienna, Vienna, Österreich
| | - Pascal A T Baltzer
- Department of Biomedical Imaging and Image-Guided Therapy, Division of Molecular and Gender Imaging, Medical University of Vienna, Waehringer-Guertel 18-20, 1090, Vienna, Österreich.
| |
Collapse
|
28
|
Hallinan JTPD, Leow NW, Ong W, Lee A, Low YX, Chan MDZ, Devi GK, Loh DDL, He SS, Nor FEM, Lim DSW, Teo EC, Low XZ, Furqan SM, Tham WWY, Tan JH, Kumar N, Makmur A, Ting Y. MRI spine request form enhancement and auto protocoling using a secure institutional large language model. Spine J 2025; 25:505-514. [PMID: 39536908 DOI: 10.1016/j.spinee.2024.10.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/23/2024] [Revised: 10/08/2024] [Accepted: 10/27/2024] [Indexed: 11/16/2024]
Abstract
BACKGROUND CONTEXT Secure institutional large language models (LLM) could reduce the burden of noninterpretative tasks for radiologists. PURPOSE Assess the utility of a secure institutional LLM for MRI spine request form enhancement and auto-protocoling. STUDY DESIGN/SETTING Retrospective study conducted from December 2023 to February 2024, including patients with clinical entries accessible on the electronic medical record (EMR). PATIENT SAMPLE Overall, 250 spine MRI request forms were analyzed from 218 patients (mean age = 55.9 years ± 18.9 [SD]; 108 women) across the cervical (n=56/250, 22.4%), thoracic (n=13/250, 5.2%), lumbar (n=166/250, 66.4%), and whole (n=15/250, 6.0%) spine. Of these, 60/250 (24.0%) required contrast and 41/250 (16.4%) had prior spine surgery/instrumentation. OUTCOME MEASURES Primary-Adequacy of clinical information on clinician and LLM-augmented request forms were rated using a four-point scale. Secondary-Correct MRI protocol suggestion by LLM and first-year board-certified radiologists (Rad4 and Rad5) compared to a consensus reference standard. METHODS A secured institutional LLM (Claude 2.0) used a majority decision prompt (out of six runs) to enhance clinical information on clinician request forms using the EMR, and suggest the appropriate MRI protocol. The adequacy of clinical information on the clinician and LLM-augmented request forms was rated by three musculoskeletal radiologists independently (Rad1:10-years-experience; Rad2:12-years-experience; Rad3:10-years-experience). The same radiologists provided a consensus reference standard for the correct protocol, which was compared to the protocol suggested by the LLM and two first-year board-certified radiologists (Rad4 and Rad5). Overall agreement (Fleiss kappas for inter-rater agreement or % agreement with the reference standard and respective 95%CIs) were provided where appropriate. RESULTS LLM-augmented forms were rated by Rads 1-3 as having adequate clinical information in 93.6-96.0% of cases compared to 46.8-58.8% of the clinician request forms (p<0.01). Substantial interobserver agreement was observed with kappas of 0.71 (95% CI: 0.67-0.76) for original forms and 0.66 (95% CI: 0.61-0.72) for LLM-enhanced requests. Rads 1-3 showed almost perfect agreement on protocol decisions, with kappas of 0.99 (95% CI: 0.94-1.0) for spine region selection, 0.93 (95% CI: 0.86-1.0) for contrast necessity, and 0.93 (95% CI: 0.86-0.99) for recognition of prior spine surgery. Compared to the consensus reference standard, the LLM suggested the correct protocol in 78.4% (196/250, p<0.01) of cases, albeit inferior to Rad4 (90.0%, p<0.01) and Rad5 (89.2%, p<0.01). The secure LLM did best in identifying spinal instrumentation in 39/41 (95.1%) cases, improved compared to Rad4 (61.0%) and Rad5 (41.5%) (both p<0.01). The secure LLM had high consistency with 227/250 cases (90.8%) having 100% (6/6 runs) agreement. CONCLUSIONS Enhancing spine MRI request forms with a secure institutional LLM improved the adequacy of clinical information. The LLM also accurately suggested the correct protocol in 78.4% of cases which could optimize the MRI workflow.
Collapse
Affiliation(s)
- James Thomas Patrick Decourcy Hallinan
- Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore; Department of Diagnostic Radiology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore.
| | - Naomi Wenxin Leow
- AIO Innovation Office, National University Health System, 3 Research Link #02-04 Innovation 4.0, Singapore, 117602, Singapore
| | - Wilson Ong
- Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore
| | - Aric Lee
- Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore
| | - Yi Xian Low
- Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore
| | - Matthew Ding Zhou Chan
- Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore
| | - Ganakirthana Kalpenya Devi
- Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore
| | - Daniel De-Liang Loh
- Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore
| | - Stephanie Shengjie He
- Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore
| | - Faimee Erwan Muhamat Nor
- Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore; Department of Diagnostic Radiology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - Desmond Shi Wei Lim
- Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore
| | - Ee Chin Teo
- Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore
| | - Xi Zhen Low
- Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore
| | - Shaheryar Mohammad Furqan
- Division of Biomedical Informatics, Department of Surgery, Yong Loo Lin School of Medicine NUS, 16 Science Drive 4, Singapore 117558, Singapore
| | - Wilson Wei Yang Tham
- University Spine centre, University Orthopaedics, Hand and Reconstructive Microsurgery (UOHC), National University Health System, Singapore
| | - Jiong Hao Tan
- University Spine centre, University Orthopaedics, Hand and Reconstructive Microsurgery (UOHC), National University Health System, Singapore
| | - Naresh Kumar
- University Spine centre, University Orthopaedics, Hand and Reconstructive Microsurgery (UOHC), National University Health System, Singapore
| | - Andrew Makmur
- Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore; Department of Diagnostic Radiology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - Yonghan Ting
- Department of Diagnostic Imaging, National University Hospital, 5 Lower Kent Ridge Rd, Singapore 119074, Singapore; Department of Diagnostic Radiology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| |
Collapse
|
29
|
Lee KL, Kessler DA, Caglic I, Kuo YH, Shaida N, Barrett T. Assessing the performance of ChatGPT and Bard/Gemini against radiologists for Prostate Imaging-Reporting and Data System classification based on prostate multiparametric MRI text reports. Br J Radiol 2025; 98:368-374. [PMID: 39535870 PMCID: PMC11840166 DOI: 10.1093/bjr/tqae236] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 10/10/2024] [Accepted: 11/10/2024] [Indexed: 11/16/2024] Open
Abstract
OBJECTIVES Large language models (LLMs) have shown potential for clinical applications. This study assesses their ability to assign Prostate Imaging-Reporting and Data System (PI-RADS) categories based on clinical text reports. METHODS One hundred consecutive biopsy-naïve patients' multiparametric prostate MRI reports were independently classified by 2 uroradiologists, ChatGPT-3.5 (GPT-3.5), ChatGPT-4o mini (GPT-4), Bard, and Gemini. Original report classifications were considered definitive. RESULTS Out of 100 MRIs, 52 were originally reported as PI-RADS 1-2, 9 PI-RADS 3, 19 PI-RADS 4, and 20 PI-RADS 5. Radiologists demonstrated 95% and 90% accuracy, while GPT-3.5 and Bard both achieved 67%. Accuracy of the updated versions of LLMs increased to 83% (GTP-4) and 79% (Gemini), respectively. In low suspicion studies (PI-RADS 1-2), Bard and Gemini (F1: 0.94, 0.98, respectively) outperformed GPT-3.5 and GTP-4 (F1:0.77, 0.94, respectively), whereas for high probability MRIs (PI-RADS 4-5), GPT-3.5 and GTP-4 (F1: 0.95, 0.98, respectively) outperformed Bard and Gemini (F1: 0.71, 0.87, respectively). Bard assigned a non-existent PI-RADS 6 "hallucination" for 2 patients. Inter-reader agreements (Κ) between the original reports and the senior radiologist, junior radiologist, GPT-3.5, GTP-4, BARD, and Gemini were 0.93, 0.84, 0.65, 0.86, 0.57, and 0.81, respectively. CONCLUSIONS Radiologists demonstrated high accuracy in PI-RADS classification based on text reports, while GPT-3.5 and Bard exhibited poor performance. GTP-4 and Gemini demonstrated improved performance compared to their predecessors. ADVANCES IN KNOWLEDGE This study highlights the limitations of LLMs in accurately classifying PI-RADS categories from clinical text reports. While the performance of LLMs has improved with newer versions, caution is warranted before integrating such technologies into clinical practice.
Collapse
Affiliation(s)
- Kang-Lung Lee
- Department of Radiology, University of Cambridge, Cambridge CB2 0QQ, United Kingdom
- Department of Radiology, Cambridge University Hospitals NHS Foundation Trust Addenbrooke’s Hospital, Cambridge CB2 0QQ, United Kingdom
- Department of Radiology, Taipei Veterans General Hospital, Taipei 112, Taiwan
- School of Medicine, National Yang Ming Chiao Tung University, Taipei 112, Taiwan
| | - Dimitri A Kessler
- Department of Radiology, University of Cambridge, Cambridge CB2 0QQ, United Kingdom
- Department of Radiology, Cambridge University Hospitals NHS Foundation Trust Addenbrooke’s Hospital, Cambridge CB2 0QQ, United Kingdom
| | - Iztok Caglic
- Department of Radiology, Cambridge University Hospitals NHS Foundation Trust Addenbrooke’s Hospital, Cambridge CB2 0QQ, United Kingdom
| | - Yi-Hsin Kuo
- Department of Radiology, Taipei Veterans General Hospital, Taipei 112, Taiwan
| | - Nadeem Shaida
- Department of Radiology, Cambridge University Hospitals NHS Foundation Trust Addenbrooke’s Hospital, Cambridge CB2 0QQ, United Kingdom
| | - Tristan Barrett
- Department of Radiology, University of Cambridge, Cambridge CB2 0QQ, United Kingdom
- Department of Radiology, Cambridge University Hospitals NHS Foundation Trust Addenbrooke’s Hospital, Cambridge CB2 0QQ, United Kingdom
| |
Collapse
|
30
|
Anisuzzaman D, Malins JG, Friedman PA, Attia ZI. Fine-Tuning Large Language Models for Specialized Use Cases. MAYO CLINIC PROCEEDINGS. DIGITAL HEALTH 2025; 3:100184. [PMID: 40206998 PMCID: PMC11976015 DOI: 10.1016/j.mcpdig.2024.11.005] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/02/2024] [Revised: 11/06/2024] [Accepted: 11/18/2024] [Indexed: 04/11/2025]
Abstract
Large language models (LLMs) are a type of artificial intelligence, which operate by predicting and assembling sequences of words that are statistically likely to follow from a given text input. With this basic ability, LLMs are able to answer complex questions and follow extremely complex instructions. Products created using LLMs such as ChatGPT by OpenAI and Claude by Anthropic have created a huge amount of traction and user engagements and revolutionized the way we interact with technology, bringing a new dimension to human-computer interaction. Fine-tuning is a process in which a pretrained model, such as an LLM, is further trained on a custom data set to adapt it for specialized tasks or domains. In this review, we outline some of the major methodologic approaches and techniques that can be used to fine-tune LLMs for specialized use cases and enumerate the general steps required for carrying out LLM fine-tuning. We then illustrate a few of these methodologic approaches by describing several specific use cases of fine-tuning LLMs across medical subspecialties. Finally, we close with a consideration of some of the benefits and limitations associated with fine-tuning LLMs for specialized use cases, with an emphasis on specific concerns in the field of medicine.
Collapse
Affiliation(s)
- D.M. Anisuzzaman
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN
| | | | - Paul A. Friedman
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN
| | - Zachi I. Attia
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN
| |
Collapse
|
31
|
Mese I, Kocak B. Large language models in methodological quality evaluation of radiomics research based on METRICS: ChatGPT vs NotebookLM vs radiologist. Eur J Radiol 2025; 184:111960. [PMID: 39938163 DOI: 10.1016/j.ejrad.2025.111960] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2024] [Revised: 01/14/2025] [Accepted: 01/28/2025] [Indexed: 02/14/2025]
Abstract
OBJECTIVES This study aimed to evaluate the effectiveness of large language models (LLM) in assessing the methodological quality of radiomics research, using METhodological RadiomICs Score (METRICS) tool. METHODS This study included open access radiomic research articles published in 2024 across various journals and a preprint repository, all under the Creative Commons Attribution License. Each study was independently evaluated using METRICS by two LLMs, ChatGPT-4 and NotebookLM, and a consensus assessment performed by two radiologists with expertise in radiomics research. RESULTS A total of 48 open access articles were included in this study. ChatGPT-4, NotebookLM, and human readers achieved median scores of 79.5 %, 61.6 %, and 69.0 %, respectively, with a statistically significant difference across these evaluations (p < 0.05). Pairwise comparisons indicated no statistically significant difference for NotebookLM vs human experts (p > 0.05), in contrast to other pairs (p < 0.05). Intraclass correlation coefficient (ICC) for ChatGPT-4 and human experts was 0.563 (95 % CI: 0.050---0.795), corresponding to poor to good agreement. The ICC for ChatGPT-4 and NotebookLM and for human experts and NotebookLM were 0.391 (95 % CI: -0.031---0.665) and 0.555 (95 % CI: 0.326---0.723), respectively, indicating poor to moderate agreement. LLMs completed the tasks in a significantly shorter time (p < 0.05). In item-wise reliability analysis, ChatGPT-4 generally demonstrated higher consistency than NotebookLM. CONCLUSION LLMs hold promise for automatically evaluating the quality of radiomics research using METRICS, a new tool that is relatively more complex yet comprehensive compared to its counterparts. However, substantial improvements are needed for full alignment with human experts.
Collapse
Affiliation(s)
- Ismail Mese
- Department of Radiology, Uskudar State Hospital, Istanbul 34662, Turkey; Department of Radiology, University of Health Sciences, Basaksehir Cam and Sakura City Hospital, Istanbul 34480, Turkey.
| | - Burak Kocak
- Department of Radiology, Uskudar State Hospital, Istanbul 34662, Turkey; Department of Radiology, University of Health Sciences, Basaksehir Cam and Sakura City Hospital, Istanbul 34480, Turkey.
| |
Collapse
|
32
|
Leutz-Schmidt P, Palm V, Mathy RM, Grözinger M, Kauczor HU, Jang H, Sedaghat S. Performance of Large Language Models ChatGPT and Gemini on Workplace Management Questions in Radiology. Diagnostics (Basel) 2025; 15:497. [PMID: 40002648 PMCID: PMC11854386 DOI: 10.3390/diagnostics15040497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2024] [Revised: 02/17/2025] [Accepted: 02/17/2025] [Indexed: 02/27/2025] Open
Abstract
Background/Objectives: Despite the growing popularity of large language models (LLMs), there remains a notable lack of research examining their role in workplace management. This study aimed to address this gap by evaluating the performance of ChatGPT-3.5, ChatGPT-4.0, Gemini, and Gemini Advanced as famous LLMs in responding to workplace management questions specific to radiology. Methods: ChatGPT-3.5 and ChatGPT-4.0 (both OpenAI, San Francisco, CA, USA) and Gemini and Gemini Advanced (both Google Deep Mind, Mountain View, CA, USA) generated answers to 31 pre-selected questions on four different areas of workplace management in radiology: (1) patient management, (2) imaging and radiation management, (3) learning and personal development, and (4) administrative and department management. Two readers independently evaluated the answers provided by the LLM chatbots. Three 4-point scores were used to assess the quality of the responses: (1) overall quality score (OQS), (2) understandabilityscore (US), and (3) implementability score (IS). The mean quality score (MQS) was calculated from these three scores. Results: The overall inter-rater reliability (IRR) was good for Gemini Advanced (IRR 79%), Gemini (IRR 78%), and ChatGPT-3.5 (IRR 65%), and moderate for ChatGPT-4.0 (IRR 54%). The overall MQS averaged 3.36 (SD: 0.64) for ChatGPT-3.5, 3.75 (SD: 0.43) for ChatGPT-4.0, 3.29 (SD: 0.64) for Gemini, and 3.51 (SD: 0.53) for Gemini Advanced. The highest OQS, US, IS, and MQS were achieved by ChatGPT-4.0 in all categories, followed by Gemini Advanced. ChatGPT-4.0 was the most consistently superior performer and outperformed all other chatbots (p < 0.001-0.002). Gemini Advanced performed significantly better than Gemini (p = 0.003) and showed a non-significant trend toward outperforming ChatGPT-3.5 (p = 0.056). ChatGPT-4.0 provided superior answers in most cases compared with the other LLM chatbots. None of the answers provided by the chatbots were rated "insufficient". Conclusions: All four LLM chatbots performed well on workplace management questions in radiology. ChatGPT-4.0 outperformed ChatGPT-3.5, Gemini, and Gemini Advanced. Our study revealed that LLMs have the potential to improve workplace management in radiology by assisting with various tasks, making these processes more efficient without requiring specialized management skills.
Collapse
Affiliation(s)
- Patricia Leutz-Schmidt
- Department of Diagnostic and Interventional Radiology, University Hospital Heidelberg, 69120 Heidelberg, Germany; (P.L.-S.); (V.P.); (R.M.M.); (H.-U.K.)
| | - Viktoria Palm
- Department of Diagnostic and Interventional Radiology, University Hospital Heidelberg, 69120 Heidelberg, Germany; (P.L.-S.); (V.P.); (R.M.M.); (H.-U.K.)
| | - René Michael Mathy
- Department of Diagnostic and Interventional Radiology, University Hospital Heidelberg, 69120 Heidelberg, Germany; (P.L.-S.); (V.P.); (R.M.M.); (H.-U.K.)
| | - Martin Grözinger
- German Cancer Research Center (DKFZ), 69120 Heidelberg, Germany;
| | - Hans-Ulrich Kauczor
- Department of Diagnostic and Interventional Radiology, University Hospital Heidelberg, 69120 Heidelberg, Germany; (P.L.-S.); (V.P.); (R.M.M.); (H.-U.K.)
| | - Hyungseok Jang
- Department of Radiology, University of California Davis, Davis, CA 95616, USA;
| | - Sam Sedaghat
- Department of Diagnostic and Interventional Radiology, University Hospital Heidelberg, 69120 Heidelberg, Germany; (P.L.-S.); (V.P.); (R.M.M.); (H.-U.K.)
| |
Collapse
|
33
|
Ahyad RA, Zaylaee Y, Hassan T, Khoja O, Noorelahi Y, Alharthy A, Alabsi H, Mimish R, Badeeb A. Cutting Edge to Cutting Time: Can ChatGPT Improve the Radiologist's Reporting? JOURNAL OF IMAGING INFORMATICS IN MEDICINE 2025; 38:346-356. [PMID: 39020157 PMCID: PMC11811338 DOI: 10.1007/s10278-024-01196-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/04/2024] [Revised: 06/20/2024] [Accepted: 07/05/2024] [Indexed: 07/19/2024]
Abstract
Radiology-structured reports (SR) have many advantages over free text (FT), but the wide implementation of SR is still lagging. A powerful tool such as GPT-4 can address this issue. We aim to employ a web-based reporting tool powered by GPT-4 capable of converting FT to SR and then evaluate its impact on reporting time and report quality. Thirty abdominopelvic CT scans were reported by two radiologists across two sessions (15 scans each): a control session using traditional reporting methods and an AI-assisted session employing a GPT-4-powered web application to structure free text into structured reports. For each radiologist, the output included 15 control finalized reports, 15 AI-assisted pre-edits, and 15 post-edit finalized reports. Reporting turnaround times were assessed, including total reporting time (TRT) and case reporting time (TATc). Quality assessments were conducted by two blinded radiologists. TRT and TATc have decreased with the use of the AI-assisted reporting tool, although statistically not significant (p-value > 0.05). Mean TATc for RAD-1 decreased from 00:20:08 to 00:16:30 (hours:minutes:seconds) and TRT decreased from 05:02:00 to 04:08:00. Mean TATc for RAD-2 decreased from 00:12:04 to 00:10:04 and TRT decreased from 03:01:00 to 02:31:00. Quality scores of the finalized reports with and without AI-assistance were comparable with no significant differences. Adjusting the AI-assisted TATc by removing the editing time showed statistically significant results compared to the control for both radiologists (p-value < 0.05). The AI-assisted reporting tool can generate SR while reducing TRT and TATc without sacrificing report quality. Editing time is a potential area for further improvement.
Collapse
Affiliation(s)
- Rayan A Ahyad
- Department of Radiology, Faculty of Medicine, King Abdulaziz University, Jeddah, Saudi Arabia.
| | - Yasir Zaylaee
- Department of Radiology, Faculty of Medicine, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Tasneem Hassan
- Department of Radiology, Faculty of Medicine, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Ohood Khoja
- Department of Radiology, Faculty of Medicine, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Yasser Noorelahi
- Department of Radiology, Faculty of Medicine, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Ahmed Alharthy
- Department of Radiology, Faculty of Medicine, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Hatim Alabsi
- Department of Radiology, Faculty of Medicine, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Reem Mimish
- Department of Radiology, Faculty of Medicine, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Arwa Badeeb
- Department of Radiology, Faculty of Medicine, King Abdulaziz University, Jeddah, Saudi Arabia
| |
Collapse
|
34
|
Koyun M, Taskent I. Evaluation of Advanced Artificial Intelligence Algorithms' Diagnostic Efficacy in Acute Ischemic Stroke: A Comparative Analysis of ChatGPT-4o and Claude 3.5 Sonnet Models. J Clin Med 2025; 14:571. [PMID: 39860577 PMCID: PMC11765597 DOI: 10.3390/jcm14020571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2024] [Revised: 01/15/2025] [Accepted: 01/16/2025] [Indexed: 01/27/2025] Open
Abstract
Background/Objectives: Acute ischemic stroke (AIS) is a leading cause of mortality and disability worldwide, with early and accurate diagnosis being critical for timely intervention and improved patient outcomes. This retrospective study aimed to assess the diagnostic performance of two advanced artificial intelligence (AI) models, Chat Generative Pre-trained Transformer (ChatGPT-4o) and Claude 3.5 Sonnet, in identifying AIS from diffusion-weighted imaging (DWI). Methods: The DWI images of a total of 110 cases (AIS group: n = 55, healthy controls: n = 55) were provided to the AI models via standardized prompts. The models' responses were compared to radiologists' gold-standard evaluations, and performance metrics such as sensitivity, specificity, and diagnostic accuracy were calculated. Results: Both models exhibited a high sensitivity for AIS detection (ChatGPT-4o: 100%, Claude 3.5 Sonnet: 94.5%). However, ChatGPT-4o demonstrated a significantly lower specificity (3.6%) compared to Claude 3.5 Sonnet (74.5%). The agreement with radiologists was poor for ChatGPT-4o (κ = 0.036; %95 CI: -0.013, 0.085) but good for Claude 3.5 Sonnet (κ = 0.691; %95 CI: 0.558, 0.824). In terms of the AIS hemispheric localization accuracy, Claude 3.5 Sonnet (67.2%) outperformed ChatGPT-4o (32.7%). Similarly, for specific AIS localization, Claude 3.5 Sonnet (30.9%) showed greater accuracy than ChatGPT-4o (7.3%), with these differences being statistically significant (p < 0.05). Conclusions: This study highlights the superior diagnostic performance of Claude 3.5 Sonnet compared to ChatGPT-4o in identifying AIS from DWI. Despite its advantages, both models demonstrated notable limitations in accuracy, emphasizing the need for further development before achieving full clinical applicability. These findings underline the potential of AI tools in radiological diagnostics while acknowledging their current limitations.
Collapse
Affiliation(s)
- Mustafa Koyun
- Department of Radiology, Kastamonu Training and Research Hospital, Kastamonu 37150, Turkey
| | - Ismail Taskent
- Department of Radiology, Kastamonu University, Kastamonu 37150, Turkey;
| |
Collapse
|
35
|
Wei B. Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis. JMIR MEDICAL EDUCATION 2025; 11:e64284. [PMID: 39819381 PMCID: PMC11756834 DOI: 10.2196/64284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/14/2024] [Revised: 10/10/2024] [Accepted: 12/03/2024] [Indexed: 01/19/2025]
Abstract
Background Artificial intelligence advancements have enabled large language models to significantly impact radiology education and diagnostic accuracy. Objective This study evaluates the performance of mainstream large language models, including GPT-4, Claude, Bard, Tongyi Qianwen, and Gemini Pro, in radiology board exams. Methods A comparative analysis of 150 multiple-choice questions from radiology board exams without images was conducted. Models were assessed on their accuracy for text-based questions and were categorized by cognitive levels and medical specialties using χ2 tests and ANOVA. Results GPT-4 achieved the highest accuracy (83.3%, 125/150), significantly outperforming all other models. Specifically, Claude achieved an accuracy of 62% (93/150; P<.001), Bard 54.7% (82/150; P<.001), Tongyi Qianwen 70.7% (106/150; P=.009), and Gemini Pro 55.3% (83/150; P<.001). The odds ratios compared to GPT-4 were 0.33 (95% CI 0.18-0.60) for Claude, 0.24 (95% CI 0.13-0.44) for Bard, and 0.25 (95% CI 0.14-0.45) for Gemini Pro. Tongyi Qianwen performed relatively well with an accuracy of 70.7% (106/150; P=0.02) and had an odds ratio of 0.48 (95% CI 0.27-0.87) compared to GPT-4. Performance varied across question types and specialties, with GPT-4 excelling in both lower-order and higher-order questions, while Claude and Bard struggled with complex diagnostic questions. Conclusions GPT-4 and Tongyi Qianwen show promise in medical education and training. The study emphasizes the need for domain-specific training datasets to enhance large language models' effectiveness in specialized fields like radiology.
Collapse
Affiliation(s)
- Boxiong Wei
- Department of Ultrasound, Peking University First Hospital, 8 Xishiku Rd, Xicheng District, Beijing, 100034, China, 86 13132150190, 86 314521
| |
Collapse
|
36
|
Sarangi PK, Panda BB, P. S, Pattanayak D, Panda S, Mondal H. Exploring Radiology Postgraduate Students' Engagement with Large Language Models for Educational Purposes: A Study of Knowledge, Attitudes, and Practices. Indian J Radiol Imaging 2025; 35:35-42. [PMID: 39697505 PMCID: PMC11651873 DOI: 10.1055/s-0044-1788605] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/29/2024] Open
Abstract
Background The integration of large language models (LLMs) into medical education has received increasing attention as a potential tool to enhance learning experiences. However, there remains a need to explore radiology postgraduate students' engagement with LLMs and their perceptions of their utility in medical education. Hence, we conducted this study to investigate radiology postgraduate students' knowledge, attitudes, and practices regarding LLMs in medical education. Materials and Methods A cross-sectional quantitative survey was conducted online on Google Forms. Participants from all over India were recruited via social media platforms and snowball sampling techniques. A previously validated questionnaire was used to assess knowledge, attitude, and practices regarding LLMs. Descriptive statistical analysis was employed to summarize participants' responses. Results A total of 252 (139 [55.16%] males and 113 [44.84%] females) radiology postgraduate students with a mean age of 28.33 ± 3.32 years participated in the study. The majority of the participants (47.62%) were familiar with LLMs with their potential incorporation with traditional teaching-learning tools (71.82%). They are open to including LLMs as a learning tool (71.03%) and think that it would provide comprehensive medical information (62.7%). Residents take the help of LLMs when they do not get the desired information from books (46.43%) or Internet search engines (59.13%). The overall score of knowledge (3.52 ± 0.58), attitude (3.75 ± 0.51), and practice (3.15 ± 0.57) were statistically significantly different (analysis of variance [ANOVA], p < 0.0001), with the highest score in attitude and lowest in practice. However, no significant differences were found in the scores for knowledge ( p = 0.64), attitude ( p = 0.99), and practice ( p = 0.25) depending on the year of training. Conclusion Radiology postgraduate students are familiar with LLM and recognize the potential benefits of LLMs in postgraduate radiology education. Although they have a positive attitude toward the use of LLMs, they are concerned about its limitations and use it only in limited situations for educational purposes.
Collapse
Affiliation(s)
- Pradosh Kumar Sarangi
- Department of Radiodiagnosis, All India Institute of Medical Sciences, Deoghar, Jharkhand, India
| | - Braja Behari Panda
- Department of Radiodiagnosis, Veer Surendra Sai Institute of Medical Sciences and Research, Burla, Odisha, India
| | - Sanjay P.
- Department of Radiodiagnosis, Mysore Medical College and Research Institute, Mysore, India
| | - Debabrata Pattanayak
- Department of Radiodiagnosis, Veer Surendra Sai Institute of Medical Sciences and Research, Burla, Odisha, India
| | - Swaha Panda
- Department of Otorhinolaryngology and Head and Neck Surgery, All India Institute of Medical Sciences, Deoghar, Jharkhand, India
| | - Himel Mondal
- Department of Physiology, All India Institute of Medical Sciences, Deoghar, Jharkhand, India
| |
Collapse
|
37
|
Chen LC, Zack T, Demirci A, Sushil M, Miao B, Kasap C, Butte A, Collisson EA, Hong JC. Assessing Large Language Models for Oncology Data Inference From Radiology Reports. JCO Clin Cancer Inform 2024; 8:e2400126. [PMID: 39661914 DOI: 10.1200/cci.24.00126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2024] [Revised: 08/14/2024] [Accepted: 09/23/2024] [Indexed: 12/13/2024] Open
Abstract
PURPOSE We examined the effectiveness of proprietary and open large language models (LLMs) in detecting disease presence, location, and treatment response in pancreatic cancer from radiology reports. METHODS We analyzed 203 deidentified radiology reports, manually annotated for disease status, location, and indeterminate nodules needing follow-up. Using generative pre-trained transformer (GPT)-4, GPT-3.5-turbo, and open models such as Gemma-7B and Llama3-8B, we employed strategies such as ablation and prompt engineering to boost accuracy. Discrepancies between human and model interpretations were reviewed by a secondary oncologist. RESULTS Among 164 patients with pancreatic tumor, GPT-4 showed the highest accuracy in inferring disease status, achieving a 75.5% correctness (F1-micro). Open models Mistral-7B and Llama3-8B performed comparably, with accuracies of 68.6% and 61.4%, respectively. Mistral-7B excelled in deriving correct inferences from objective findings directly. Most tested models demonstrated proficiency in identifying disease containing anatomic locations from a list of choices, with GPT-4 and Llama3-8B showing near-parity in precision and recall for disease site identification. However, open models struggled with differentiating benign from malignant postsurgical changes, affecting their precision in identifying findings indeterminate for cancer. A secondary review occasionally favored GPT-3.5's interpretations, indicating the variability in human judgment. CONCLUSION LLMs, especially GPT-4, are proficient in deriving oncologic insights from radiology reports. Their performance is enhanced by effective summarization strategies, demonstrating their potential in clinical support and health care analytics. This study also underscores the possibility of zero-shot open model utility in environments where proprietary models are restricted. Finally, by providing a set of annotated radiology reports, this paper presents a valuable data set for further LLM research in oncology.
Collapse
Affiliation(s)
- Li-Ching Chen
- University of California, Berkeley, Berkeley, CA
- University of California, San Francisco, San Francisco, CA
| | - Travis Zack
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA
| | - Arda Demirci
- University of California, Berkeley, Berkeley, CA
| | - Madhumita Sushil
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA
| | - Brenda Miao
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA
| | - Corynn Kasap
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA
| | - Atul Butte
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA
| | - Eric A Collisson
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA
| | - Julian C Hong
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA
| |
Collapse
|
38
|
Kalidindi S, Baradwaj J. Advancing radiology with GPT-4: Innovations in clinical applications, patient engagement, research, and learning. Eur J Radiol Open 2024; 13:100589. [PMID: 39170856 PMCID: PMC11337693 DOI: 10.1016/j.ejro.2024.100589] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2024] [Revised: 06/30/2024] [Accepted: 07/08/2024] [Indexed: 08/23/2024] Open
Abstract
The rapid evolution of artificial intelligence (AI) in healthcare, particularly in radiology, underscores a transformative era marked by a potential for enhanced diagnostic precision, increased patient engagement, and streamlined clinical workflows. Amongst the key developments at the heart of this transformation are Large Language Models like the Generative Pre-trained Transformer 4 (GPT-4), whose integration into radiological practices could potentially herald a significant leap by assisting in the generation and summarization of radiology reports, aiding in differential diagnoses, and recommending evidence-based treatments. This review delves into the multifaceted potential applications of Large Language Models within radiology, using GPT-4 as an example, from improving diagnostic accuracy and reporting efficiency to translating complex medical findings into patient-friendly summaries. The review acknowledges the ethical, privacy, and technical challenges inherent in deploying AI technologies, emphasizing the importance of careful oversight, validation, and adherence to regulatory standards. Through a balanced discourse on the potential and pitfalls of GPT-4 in radiology, the article aims to provide a comprehensive overview of how these models have the potential to reshape the future of radiological services, fostering improvements in patient care, educational methodologies, and clinical research.
Collapse
|
39
|
Silbergleit M, Tóth A, Chamberlin JH, Hamouda M, Baruah D, Derrick S, Schoepf UJ, Burt JR, Kabakus IM. ChatGPT vs Gemini: Comparative Accuracy and Efficiency in CAD-RADS Score Assignment from Radiology Reports. JOURNAL OF IMAGING INFORMATICS IN MEDICINE 2024:10.1007/s10278-024-01328-y. [PMID: 39528887 DOI: 10.1007/s10278-024-01328-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/11/2024] [Revised: 10/29/2024] [Accepted: 10/30/2024] [Indexed: 11/16/2024]
Abstract
This study aimed to evaluate the accuracy and efficiency of ChatGPT-3.5, ChatGPT-4o, Google Gemini, and Google Gemini Advanced in generating CAD-RADS scores based on radiology reports. This retrospective study analyzed 100 consecutive coronary computed tomography angiography reports performed between March 15, 2024, and April 1, 2024, at a single tertiary center. Each report containing a radiologist-assigned CAD-RADS score was processed using four large language models (LLMs) without fine-tuning. The findings section of each report was input into the LLMs, and the models were tasked with generating CAD-RADS scores. The accuracy of LLM-generated scores was compared to the radiologist's score. Additionally, the time taken by each model to complete the task was recorded. Statistical analyses included Mann-Whitney U test and interobserver agreement using unweighted Cohen's Kappa and Krippendorff's Alpha. ChatGPT-4o demonstrated the highest accuracy, correctly assigning CAD-RADS scores in 87% of cases (κ = 0.838, α = 0.886), followed by Gemini Advanced with 82.6% accuracy (κ = 0.784, α = 0.897). ChatGPT-3.5, although the fastest (median time = 5 s), was the least accurate (50.5% accuracy, κ = 0.401, α = 0.787). Gemini exhibited a higher failure rate (12%) compared to the other models, with Gemini Advanced slightly improving upon its predecessor. ChatGPT-4o outperformed other LLMs in both accuracy and agreement with radiologist-assigned CAD-RADS scores, though ChatGPT-3.5 was significantly faster. Despite their potential, current publicly available LLMs require further refinement before being deployed for clinical decision-making in CAD-RADS scoring.
Collapse
Affiliation(s)
- Matthew Silbergleit
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, Clinical Science Building, Medical University of South Carolina, 96 Jonathan Lucas Street, Suite 210, MSC 323, Charleston, SC, 29425, USA
| | - Adrienn Tóth
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, Clinical Science Building, Medical University of South Carolina, 96 Jonathan Lucas Street, Suite 210, MSC 323, Charleston, SC, 29425, USA
| | - Jordan H Chamberlin
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, Clinical Science Building, Medical University of South Carolina, 96 Jonathan Lucas Street, Suite 210, MSC 323, Charleston, SC, 29425, USA
| | - Mohamed Hamouda
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, Clinical Science Building, Medical University of South Carolina, 96 Jonathan Lucas Street, Suite 210, MSC 323, Charleston, SC, 29425, USA
| | - Dhiraj Baruah
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, Clinical Science Building, Medical University of South Carolina, 96 Jonathan Lucas Street, Suite 210, MSC 323, Charleston, SC, 29425, USA
| | - Sydney Derrick
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, Clinical Science Building, Medical University of South Carolina, 96 Jonathan Lucas Street, Suite 210, MSC 323, Charleston, SC, 29425, USA
| | - U Joseph Schoepf
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, Clinical Science Building, Medical University of South Carolina, 96 Jonathan Lucas Street, Suite 210, MSC 323, Charleston, SC, 29425, USA
| | - Jeremy R Burt
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, University of Utah School of Medicine, Salt Lake City, UT, USA
| | - Ismail M Kabakus
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, Clinical Science Building, Medical University of South Carolina, 96 Jonathan Lucas Street, Suite 210, MSC 323, Charleston, SC, 29425, USA.
| |
Collapse
|
40
|
Chow JCL, Li K. Ethical Considerations in Human-Centered AI: Advancing Oncology Chatbots Through Large Language Models. JMIR BIOINFORMATICS AND BIOTECHNOLOGY 2024; 5:e64406. [PMID: 39321336 PMCID: PMC11579624 DOI: 10.2196/64406] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/16/2024] [Revised: 08/23/2024] [Accepted: 09/23/2024] [Indexed: 09/27/2024]
Abstract
The integration of chatbots in oncology underscores the pressing need for human-centered artificial intelligence (AI) that addresses patient and family concerns with empathy and precision. Human-centered AI emphasizes ethical principles, empathy, and user-centric approaches, ensuring technology aligns with human values and needs. This review critically examines the ethical implications of using large language models (LLMs) like GPT-3 and GPT-4 (OpenAI) in oncology chatbots. It examines how these models replicate human-like language patterns, impacting the design of ethical AI systems. The paper identifies key strategies for ethically developing oncology chatbots, focusing on potential biases arising from extensive datasets and neural networks. Specific datasets, such as those sourced from predominantly Western medical literature and patient interactions, may introduce biases by overrepresenting certain demographic groups. Moreover, the training methodologies of LLMs, including fine-tuning processes, can exacerbate these biases, leading to outputs that may disproportionately favor affluent or Western populations while neglecting marginalized communities. By providing examples of biased outputs in oncology chatbots, the review highlights the ethical challenges LLMs present and the need for mitigation strategies. The study emphasizes integrating human-centric values into AI to mitigate these biases, ultimately advocating for the development of oncology chatbots that are aligned with ethical principles and capable of serving diverse patient populations equitably.
Collapse
Affiliation(s)
- James C L Chow
- Department of Radiation Oncology, University of Toronto, Toronto, ON, Canada
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
| | - Kay Li
- Department of English, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
41
|
Tayebi Arasteh S, Siepmann R, Huppertz M, Lotfinia M, Puladi B, Kuhl C, Truhn D, Nebelung S. The Treasure Trove Hidden in Plain Sight: The Utility of GPT-4 in Chest Radiograph Evaluation. Radiology 2024; 313:e233441. [PMID: 39530893 DOI: 10.1148/radiol.233441] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2024]
Abstract
Background Limited statistical knowledge can slow critical engagement with and adoption of artificial intelligence (AI) tools for radiologists. Large language models (LLMs) such as OpenAI's GPT-4, and notably its Advanced Data Analysis (ADA) extension, may improve the adoption of AI in radiology. Purpose To validate GPT-4 ADA outputs when autonomously conducting analyses of varying complexity on a multisource clinical dataset. Materials and Methods In this retrospective study, unique itemized radiologic reports of bedside chest radiographs, associated demographic data, and laboratory markers of inflammation from patients in intensive care from January 2009 to December 2019 were evaluated. GPT-4 ADA, accessed between December 2023 and January 2024, was tasked with autonomously analyzing this dataset by plotting radiography usage rates, providing descriptive statistics measures, quantifying factors of pulmonary opacities, and setting up machine learning (ML) models to predict their presence. Three scientists with 6-10 years of ML experience validated the outputs by verifying the methodology, assessing coding quality, re-executing the provided code, and comparing ML models head-to-head with their human-developed counterparts (based on the area under the receiver operating characteristic curve [AUC], accuracy, sensitivity, and specificity). Statistical significance was evaluated using bootstrapping. Results A total of 43 788 radiograph reports, with their laboratory values, from University Hospital RWTH Aachen were evaluated from 43 788 patients (mean age, 66 years ± 15 [SD]; 26 804 male). While GPT-4 ADA provided largely appropriate visualizations, descriptive statistical measures, quantitative statistical associations based on logistic regression, and gradient boosting machines for the predictive task (AUC, 0.75), some statistical errors and inaccuracies were encountered. ML strategies were valid and based on consistent coding routines, resulting in valid outputs on par with human specialist-developed reference models (AUC, 0.80 [95% CI: 0.80, 0.81] vs 0.80 [95% CI: 0.80, 0.81]; P = .51) (accuracy, 79% [6910 of 8758 patients] vs 78% [6875 of 8758 patients], respectively; P = .27). Conclusion LLMs may facilitate data analysis in radiology, from basic statistics to advanced ML-based predictive modeling. © RSNA, 2024 Supplemental material is available for this article.
Collapse
Affiliation(s)
- Soroosh Tayebi Arasteh
- From the Department of Diagnostic and Interventional Radiology (S.T.A., R.S., M.H., M.L., C.K., D.T., S.N.), Department of Oral and Maxillofacial Surgery (B.P.), and Institute of Medical Informatics (B.P.), University Hospital RWTH Aachen, Pauwelsstr 30, 52074 Aachen, Germany; Pattern Recognition Laboratory, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany (S.T.A.); and Institute of Heat and Mass Transfer, RWTH Aachen University, Aachen, Germany (M.L.)
| | - Robert Siepmann
- From the Department of Diagnostic and Interventional Radiology (S.T.A., R.S., M.H., M.L., C.K., D.T., S.N.), Department of Oral and Maxillofacial Surgery (B.P.), and Institute of Medical Informatics (B.P.), University Hospital RWTH Aachen, Pauwelsstr 30, 52074 Aachen, Germany; Pattern Recognition Laboratory, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany (S.T.A.); and Institute of Heat and Mass Transfer, RWTH Aachen University, Aachen, Germany (M.L.)
| | - Marc Huppertz
- From the Department of Diagnostic and Interventional Radiology (S.T.A., R.S., M.H., M.L., C.K., D.T., S.N.), Department of Oral and Maxillofacial Surgery (B.P.), and Institute of Medical Informatics (B.P.), University Hospital RWTH Aachen, Pauwelsstr 30, 52074 Aachen, Germany; Pattern Recognition Laboratory, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany (S.T.A.); and Institute of Heat and Mass Transfer, RWTH Aachen University, Aachen, Germany (M.L.)
| | - Mahshad Lotfinia
- From the Department of Diagnostic and Interventional Radiology (S.T.A., R.S., M.H., M.L., C.K., D.T., S.N.), Department of Oral and Maxillofacial Surgery (B.P.), and Institute of Medical Informatics (B.P.), University Hospital RWTH Aachen, Pauwelsstr 30, 52074 Aachen, Germany; Pattern Recognition Laboratory, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany (S.T.A.); and Institute of Heat and Mass Transfer, RWTH Aachen University, Aachen, Germany (M.L.)
| | - Behrus Puladi
- From the Department of Diagnostic and Interventional Radiology (S.T.A., R.S., M.H., M.L., C.K., D.T., S.N.), Department of Oral and Maxillofacial Surgery (B.P.), and Institute of Medical Informatics (B.P.), University Hospital RWTH Aachen, Pauwelsstr 30, 52074 Aachen, Germany; Pattern Recognition Laboratory, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany (S.T.A.); and Institute of Heat and Mass Transfer, RWTH Aachen University, Aachen, Germany (M.L.)
| | - Christiane Kuhl
- From the Department of Diagnostic and Interventional Radiology (S.T.A., R.S., M.H., M.L., C.K., D.T., S.N.), Department of Oral and Maxillofacial Surgery (B.P.), and Institute of Medical Informatics (B.P.), University Hospital RWTH Aachen, Pauwelsstr 30, 52074 Aachen, Germany; Pattern Recognition Laboratory, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany (S.T.A.); and Institute of Heat and Mass Transfer, RWTH Aachen University, Aachen, Germany (M.L.)
| | - Daniel Truhn
- From the Department of Diagnostic and Interventional Radiology (S.T.A., R.S., M.H., M.L., C.K., D.T., S.N.), Department of Oral and Maxillofacial Surgery (B.P.), and Institute of Medical Informatics (B.P.), University Hospital RWTH Aachen, Pauwelsstr 30, 52074 Aachen, Germany; Pattern Recognition Laboratory, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany (S.T.A.); and Institute of Heat and Mass Transfer, RWTH Aachen University, Aachen, Germany (M.L.)
| | - Sven Nebelung
- From the Department of Diagnostic and Interventional Radiology (S.T.A., R.S., M.H., M.L., C.K., D.T., S.N.), Department of Oral and Maxillofacial Surgery (B.P.), and Institute of Medical Informatics (B.P.), University Hospital RWTH Aachen, Pauwelsstr 30, 52074 Aachen, Germany; Pattern Recognition Laboratory, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany (S.T.A.); and Institute of Heat and Mass Transfer, RWTH Aachen University, Aachen, Germany (M.L.)
| |
Collapse
|
42
|
Voinea ȘV, Mămuleanu M, Teică RV, Florescu LM, Selișteanu D, Gheonea IA. GPT-Driven Radiology Report Generation with Fine-Tuned Llama 3. Bioengineering (Basel) 2024; 11:1043. [PMID: 39451418 PMCID: PMC11504957 DOI: 10.3390/bioengineering11101043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2024] [Revised: 10/05/2024] [Accepted: 10/16/2024] [Indexed: 10/26/2024] Open
Abstract
The integration of deep learning into radiology has the potential to enhance diagnostic processes, yet its acceptance in clinical practice remains limited due to various challenges. This study aimed to develop and evaluate a fine-tuned large language model (LLM), based on Llama 3-8B, to automate the generation of accurate and concise conclusions in magnetic resonance imaging (MRI) and computed tomography (CT) radiology reports, thereby assisting radiologists and improving reporting efficiency. A dataset comprising 15,000 radiology reports was collected from the University of Medicine and Pharmacy of Craiova's Imaging Center, covering a diverse range of MRI and CT examinations made by four experienced radiologists. The Llama 3-8B model was fine-tuned using transfer-learning techniques, incorporating parameter quantization to 4-bit precision and low-rank adaptation (LoRA) with a rank of 16 to optimize computational efficiency on consumer-grade GPUs. The model was trained over five epochs using an NVIDIA RTX 3090 GPU, with intermediary checkpoints saved for monitoring. Performance was evaluated quantitatively using Bidirectional Encoder Representations from Transformers Score (BERTScore), Recall-Oriented Understudy for Gisting Evaluation (ROUGE), Bilingual Evaluation Understudy (BLEU), and Metric for Evaluation of Translation with Explicit Ordering (METEOR) metrics on a held-out test set. Additionally, a qualitative assessment was conducted, involving 13 independent radiologists who participated in a Turing-like test and provided ratings for the AI-generated conclusions. The fine-tuned model demonstrated strong quantitative performance, achieving a BERTScore F1 of 0.8054, a ROUGE-1 F1 of 0.4998, a ROUGE-L F1 of 0.4628, and a METEOR score of 0.4282. In the human evaluation, the artificial intelligence (AI)-generated conclusions were preferred over human-written ones in approximately 21.8% of cases, indicating that the model's outputs were competitive with those of experienced radiologists. The average rating of the AI-generated conclusions was 3.65 out of 5, reflecting a generally favorable assessment. Notably, the model maintained its consistency across various types of reports and demonstrated the ability to generalize to unseen data. The fine-tuned Llama 3-8B model effectively generates accurate and coherent conclusions for MRI and CT radiology reports. By automating the conclusion-writing process, this approach can assist radiologists in reducing their workload and enhancing report consistency, potentially addressing some barriers to the adoption of deep learning in clinical practice. The positive evaluations from independent radiologists underscore the model's potential utility. While the model demonstrated strong performance, limitations such as dataset bias, limited sample diversity, a lack of clinical judgment, and the need for large computational resources require further refinement and real-world validation. Future work should explore the integration of such models into clinical workflows, address ethical and legal considerations, and extend this approach to generate complete radiology reports.
Collapse
Affiliation(s)
- Ștefan-Vlad Voinea
- Department of Automatic Control and Electronics, University of Craiova, 200585 Craiova, Romania; (Ș.-V.V.); (M.M.)
| | - Mădălin Mămuleanu
- Department of Automatic Control and Electronics, University of Craiova, 200585 Craiova, Romania; (Ș.-V.V.); (M.M.)
| | - Rossy Vlăduț Teică
- Doctoral School, University of Medicine and Pharmacy of Craiova, 200349 Craiova, Romania;
| | - Lucian Mihai Florescu
- Department of Radiology and Medical Imaging, University of Medicine and Pharmacy of Craiova, 200349 Craiova, Romania; (L.M.F.); (I.A.G.)
| | - Dan Selișteanu
- Department of Automatic Control and Electronics, University of Craiova, 200585 Craiova, Romania; (Ș.-V.V.); (M.M.)
| | - Ioana Andreea Gheonea
- Department of Radiology and Medical Imaging, University of Medicine and Pharmacy of Craiova, 200349 Craiova, Romania; (L.M.F.); (I.A.G.)
| |
Collapse
|
43
|
Künzle P, Paris S. Performance of large language artificial intelligence models on solving restorative dentistry and endodontics student assessments. Clin Oral Investig 2024; 28:575. [PMID: 39373739 PMCID: PMC11458639 DOI: 10.1007/s00784-024-05968-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2024] [Accepted: 09/24/2024] [Indexed: 10/08/2024]
Abstract
OBJECTIVES The advent of artificial intelligence (AI) and large language model (LLM)-based AI applications (LLMAs) has tremendous implications for our society. This study analyzed the performance of LLMAs on solving restorative dentistry and endodontics (RDE) student assessment questions. MATERIALS AND METHODS 151 questions from a RDE question pool were prepared for prompting using LLMAs from OpenAI (ChatGPT-3.5,-4.0 and -4.0o) and Google (Gemini 1.0). Multiple-choice questions were sorted into four question subcategories, entered into LLMAs and answers recorded for analysis. P-value and chi-square statistical analyses were performed using Python 3.9.16. RESULTS The total answer accuracy of ChatGPT-4.0o was the highest, followed by ChatGPT-4.0, Gemini 1.0 and ChatGPT-3.5 (72%, 62%, 44% and 25%, respectively) with significant differences between all LLMAs except GPT-4.0 models. The performance on subcategories direct restorations and caries was the highest, followed by indirect restorations and endodontics. CONCLUSIONS Overall, there are large performance differences among LLMAs. Only the ChatGPT-4 models achieved a success ratio that could be used with caution to support the dental academic curriculum. CLINICAL RELEVANCE While LLMAs could support clinicians to answer dental field-related questions, this capacity depends strongly on the employed model. The most performant model ChatGPT-4.0o achieved acceptable accuracy rates in some subject sub-categories analyzed.
Collapse
Affiliation(s)
- Paul Künzle
- Department of Operative, Preventive and Pediatric Dentistry, Charité - Universitätsmedizin Berlin, Aßmannshauser Str. 4-6, Berlin, 14197, Germany.
| | - Sebastian Paris
- Department of Operative, Preventive and Pediatric Dentistry, Charité - Universitätsmedizin Berlin, Aßmannshauser Str. 4-6, Berlin, 14197, Germany
| |
Collapse
|
44
|
Çamur E, Cesur T, Güneş YC. Can large language models be new supportive tools in coronary computed tomography angiography reporting? Clin Imaging 2024; 114:110271. [PMID: 39236553 DOI: 10.1016/j.clinimag.2024.110271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2024] [Accepted: 08/24/2024] [Indexed: 09/07/2024]
Abstract
The advent of large language models (LLMs) marks a transformative leap in natural language processing, offering unprecedented potential in radiology, particularly in enhancing the accuracy and efficiency of coronary artery disease (CAD) diagnosis. While previous studies have explored the capabilities of specific LLMs like ChatGPT in cardiac imaging, a comprehensive evaluation comparing multiple LLMs in the context of CAD-RADS 2.0 has been lacking. This study addresses this gap by assessing the performance of various LLMs, including ChatGPT 4, ChatGPT 4o, Claude 3 Opus, Gemini 1.5 Pro, Mistral Large, Meta Llama 3 70B, and Perplexity Pro, in answering 30 multiple-choice questions derived from the CAD-RADS 2.0 guidelines. Our findings reveal that ChatGPT 4o achieved the highest accuracy at 100 %, with ChatGPT 4 and Claude 3 Opus closely following at 96.6 %. Other models, including Mistral Large, Perplexity Pro, Meta Llama 3 70B, and Gemini 1.5 Pro, also demonstrated commendable performance, though with slightly lower accuracy ranging from 90 % to 93.3 %. This study underscores the proficiency of current LLMs in understanding and applying CAD-RADS 2.0, suggesting their potential to significantly enhance radiological reporting and patient care in coronary artery disease. The variations in model performance highlight the need for further research, particularly in evaluating the visual diagnostic capabilities of LLMs-a critical component of radiology practice. This study provides a foundational comparison of LLMs in CAD-RADS 2.0 and sets the stage for future investigations into their broader applications in radiology, emphasizing the importance of integrating both text-based and visual knowledge for optimal clinical outcomes.
Collapse
Affiliation(s)
- Eren Çamur
- Department of Radiology, Ministry of Health Ankara 29 Mayis State Hospital, Ankara, Türkiye.
| | - Turay Cesur
- Department of Radiology, Ankara Mamak State Hospital, Ankara, Türkiye
| | - Yasin Celal Güneş
- Department of Radiology, TC Saglik Bakanligi Kirikkale Yuksek Ihtisas Hastanesi, Kırıkkale, Türkiye
| |
Collapse
|
45
|
Siepmann R, Huppertz M, Rastkhiz A, Reen M, Corban E, Schmidt C, Wilke S, Schad P, Yüksel C, Kuhl C, Truhn D, Nebelung S. The virtual reference radiologist: comprehensive AI assistance for clinical image reading and interpretation. Eur Radiol 2024; 34:6652-6666. [PMID: 38627289 PMCID: PMC11399201 DOI: 10.1007/s00330-024-10727-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 02/27/2024] [Accepted: 03/08/2024] [Indexed: 04/20/2024]
Abstract
OBJECTIVES Large language models (LLMs) have shown potential in radiology, but their ability to aid radiologists in interpreting imaging studies remains unexplored. We investigated the effects of a state-of-the-art LLM (GPT-4) on the radiologists' diagnostic workflow. MATERIALS AND METHODS In this retrospective study, six radiologists of different experience levels read 40 selected radiographic [n = 10], CT [n = 10], MRI [n = 10], and angiographic [n = 10] studies unassisted (session one) and assisted by GPT-4 (session two). Each imaging study was presented with demographic data, the chief complaint, and associated symptoms, and diagnoses were registered using an online survey tool. The impact of Artificial Intelligence (AI) on diagnostic accuracy, confidence, user experience, input prompts, and generated responses was assessed. False information was registered. Linear mixed-effect models were used to quantify the factors (fixed: experience, modality, AI assistance; random: radiologist) influencing diagnostic accuracy and confidence. RESULTS When assessing if the correct diagnosis was among the top-3 differential diagnoses, diagnostic accuracy improved slightly from 181/240 (75.4%, unassisted) to 188/240 (78.3%, AI-assisted). Similar improvements were found when only the top differential diagnosis was considered. AI assistance was used in 77.5% of the readings. Three hundred nine prompts were generated, primarily involving differential diagnoses (59.1%) and imaging features of specific conditions (27.5%). Diagnostic confidence was significantly higher when readings were AI-assisted (p > 0.001). Twenty-three responses (7.4%) were classified as hallucinations, while two (0.6%) were misinterpretations. CONCLUSION Integrating GPT-4 in the diagnostic process improved diagnostic accuracy slightly and diagnostic confidence significantly. Potentially harmful hallucinations and misinterpretations call for caution and highlight the need for further safeguarding measures. CLINICAL RELEVANCE STATEMENT Using GPT-4 as a virtual assistant when reading images made six radiologists of different experience levels feel more confident and provide more accurate diagnoses; yet, GPT-4 gave factually incorrect and potentially harmful information in 7.4% of its responses.
Collapse
Affiliation(s)
- Robert Siepmann
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Marc Huppertz
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Annika Rastkhiz
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Matthias Reen
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Eric Corban
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Christian Schmidt
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Stephan Wilke
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Philipp Schad
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Can Yüksel
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Christiane Kuhl
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Daniel Truhn
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Sven Nebelung
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany.
| |
Collapse
|
46
|
Zaghir J, Naguib M, Bjelogrlic M, Névéol A, Tannier X, Lovis C. Prompt Engineering Paradigms for Medical Applications: Scoping Review. J Med Internet Res 2024; 26:e60501. [PMID: 39255030 PMCID: PMC11422740 DOI: 10.2196/60501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Revised: 07/09/2024] [Accepted: 07/22/2024] [Indexed: 09/11/2024] Open
Abstract
BACKGROUND Prompt engineering, focusing on crafting effective prompts to large language models (LLMs), has garnered attention for its capabilities at harnessing the potential of LLMs. This is even more crucial in the medical domain due to its specialized terminology and language technicity. Clinical natural language processing applications must navigate complex language and ensure privacy compliance. Prompt engineering offers a novel approach by designing tailored prompts to guide models in exploiting clinically relevant information from complex medical texts. Despite its promise, the efficacy of prompt engineering in the medical domain remains to be fully explored. OBJECTIVE The aim of the study is to review research efforts and technical approaches in prompt engineering for medical applications as well as provide an overview of opportunities and challenges for clinical practice. METHODS Databases indexing the fields of medicine, computer science, and medical informatics were queried in order to identify relevant published papers. Since prompt engineering is an emerging field, preprint databases were also considered. Multiple data were extracted, such as the prompt paradigm, the involved LLMs, the languages of the study, the domain of the topic, the baselines, and several learning, design, and architecture strategies specific to prompt engineering. We include studies that apply prompt engineering-based methods to the medical domain, published between 2022 and 2024, and covering multiple prompt paradigms such as prompt learning (PL), prompt tuning (PT), and prompt design (PD). RESULTS We included 114 recent prompt engineering studies. Among the 3 prompt paradigms, we have observed that PD is the most prevalent (78 papers). In 12 papers, PD, PL, and PT terms were used interchangeably. While ChatGPT is the most commonly used LLM, we have identified 7 studies using this LLM on a sensitive clinical data set. Chain-of-thought, present in 17 studies, emerges as the most frequent PD technique. While PL and PT papers typically provide a baseline for evaluating prompt-based approaches, 61% (48/78) of the PD studies do not report any nonprompt-related baseline. Finally, we individually examine each of the key prompt engineering-specific information reported across papers and find that many studies neglect to explicitly mention them, posing a challenge for advancing prompt engineering research. CONCLUSIONS In addition to reporting on trends and the scientific landscape of prompt engineering, we provide reporting guidelines for future studies to help advance research in the medical field. We also disclose tables and figures summarizing medical prompt engineering papers available and hope that future contributions will leverage these existing works to better advance the field.
Collapse
Affiliation(s)
- Jamil Zaghir
- Division of Medical Information Sciences, Geneva University Hospitals, Geneva, Switzerland
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
| | - Marco Naguib
- Université Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique, Orsay, France
| | - Mina Bjelogrlic
- Division of Medical Information Sciences, Geneva University Hospitals, Geneva, Switzerland
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
| | - Aurélie Névéol
- Université Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique, Orsay, France
| | - Xavier Tannier
- Sorbonne Université, INSERM, Université Sorbonne Paris-Nord, Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances en eSanté, LIMICS, Paris, France
| | - Christian Lovis
- Division of Medical Information Sciences, Geneva University Hospitals, Geneva, Switzerland
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
| |
Collapse
|
47
|
Raminpour S, Weisberg EM, Kauffman L, Fishman EK. Websites, mobile apps, and social media: Premier online educational tools for radiology. Clin Imaging 2024; 113:110239. [PMID: 39067224 DOI: 10.1016/j.clinimag.2024.110239] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2024] [Revised: 07/18/2024] [Accepted: 07/22/2024] [Indexed: 07/30/2024]
Abstract
Demand for online educational tools has risen steadily as technological innovations have evolved over the past several decades. Websites were the first platform to be introduced, and eventually used for online schooling, soon after the advent of the World Wide Web. Access to information and updated content in a short period of time on a wide-screen device such as a computer made websites popular early in their development. With the technological revolution of smart phones, mobile applications have been developed on various operating systems and, through this progress, a new form of educational platform was initiated. The portable features of mobile applications represent a pioneer era of educational tools for medical professionals. Online communications have transformed into social media over the last decade and have since been adopted by much of the world. All three of these educational platforms have created a significant impact on medical education communities, specifically in radiology. We describe the relative strengths of each platform and illustrate how our experience over more than two decades guides our recommendations.
Collapse
Affiliation(s)
- Sara Raminpour
- Johns Hopkins University School of Medicine, The Russell H. Morgan Department of Radiology and Radiological Science, 601 North Caroline Street, Baltimore, MD 21287, United States of America.
| | - Edmund M Weisberg
- Johns Hopkins University School of Medicine, The Russell H. Morgan Department of Radiology and Radiological Science, 601 North Caroline Street, Baltimore, MD 21287, United States of America.
| | - Lilly Kauffman
- Johns Hopkins University School of Medicine, The Russell H. Morgan Department of Radiology and Radiological Science, 601 North Caroline Street, Baltimore, MD 21287, United States of America.
| | - Elliot K Fishman
- Johns Hopkins Hospital, The Russell H. Morgan Department of Radiology and Radiological Science, 601 North Caroline Street, Baltimore, MD 21287, United States of America.
| |
Collapse
|
48
|
Park SH, Kim N. Challenges and Proposed Additional Considerations for Medical Device Approval of Large Language Models Beyond Conventional AI. Radiology 2024; 312:e241703. [PMID: 39315904 DOI: 10.1148/radiol.241703] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/25/2024]
Affiliation(s)
- Seong Ho Park
- From the Department of Radiology and Research Institute of Radiology (S.H.P., N.K.) and Department of Convergence Medicine (N.K.), University of Ulsan College of Medicine, Asan Medical Center, 88 Olympic-ro 43-gil, Songpa-gu, Seoul 05505, Republic of Korea
| | - Namkug Kim
- From the Department of Radiology and Research Institute of Radiology (S.H.P., N.K.) and Department of Convergence Medicine (N.K.), University of Ulsan College of Medicine, Asan Medical Center, 88 Olympic-ro 43-gil, Songpa-gu, Seoul 05505, Republic of Korea
| |
Collapse
|
49
|
Freyer O, Wiest IC, Kather JN, Gilbert S. A future role for health applications of large language models depends on regulators enforcing safety standards. Lancet Digit Health 2024; 6:e662-e672. [PMID: 39179311 DOI: 10.1016/s2589-7500(24)00124-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Revised: 05/17/2024] [Accepted: 06/06/2024] [Indexed: 08/26/2024]
Abstract
Among the rapid integration of artificial intelligence in clinical settings, large language models (LLMs), such as Generative Pre-trained Transformer-4, have emerged as multifaceted tools that have potential for health-care delivery, diagnosis, and patient care. However, deployment of LLMs raises substantial regulatory and safety concerns. Due to their high output variability, poor inherent explainability, and the risk of so-called AI hallucinations, LLM-based health-care applications that serve a medical purpose face regulatory challenges for approval as medical devices under US and EU laws, including the recently passed EU Artificial Intelligence Act. Despite unaddressed risks for patients, including misdiagnosis and unverified medical advice, such applications are available on the market. The regulatory ambiguity surrounding these tools creates an urgent need for frameworks that accommodate their unique capabilities and limitations. Alongside the development of these frameworks, existing regulations should be enforced. If regulators fear enforcing the regulations in a market dominated by supply or development by large technology companies, the consequences of layperson harm will force belated action, damaging the potentiality of LLM-based applications for layperson medical advice.
Collapse
Affiliation(s)
- Oscar Freyer
- Else Kröner Fresenius Center for Digital Health, TUD Dresden University of Technology, Dresden, Germany
| | - Isabella Catharina Wiest
- Else Kröner Fresenius Center for Digital Health, TUD Dresden University of Technology, Dresden, Germany; Department of Medicine, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
| | - Jakob Nikolas Kather
- Else Kröner Fresenius Center for Digital Health, TUD Dresden University of Technology, Dresden, Germany; Department of Medicine, University Hospital Dresden, Dresden, Germany; Medical Oncology, National Center for Tumor Diseases, University Hospital Heidelberg, Heidelberg, Germany
| | - Stephen Gilbert
- Else Kröner Fresenius Center for Digital Health, TUD Dresden University of Technology, Dresden, Germany.
| |
Collapse
|
50
|
Çamur E, Cesur T, Güneş YC. Comparison of the performance of large language models and general radiologist on Ovarian-Adnexal Reporting and Data System (O-RADS)-related questions. Quant Imaging Med Surg 2024; 14:6990-6991. [PMID: 39281125 PMCID: PMC11400705 DOI: 10.21037/qims-24-1142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2024] [Accepted: 06/25/2024] [Indexed: 09/18/2024]
Affiliation(s)
- Eren Çamur
- Department of Radiology, Ministry of Health Ankara 29 Mayis State Hospital, Ankara, Türkiye
| | - Turay Cesur
- Department of Radiology, Ankara Mamak State Hospital, Ankara, Türkiye
| | - Yasin Celal Güneş
- Department of Radiology, Ministry of Health Kirikkale Yuksek Ihtisas Hospital, Kırıkkale, Türkiye
| |
Collapse
|