1
|
Kreso A, Boban Z, Kabic S, Rada F, Batistic D, Barun I, Znaor L, Kumric M, Bozic J, Vrdoljak J. Using large language models as decision support tools in emergency ophthalmology. Int J Med Inform 2025; 199:105886. [PMID: 40147415 DOI: 10.1016/j.ijmedinf.2025.105886] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2024] [Revised: 03/10/2025] [Accepted: 03/19/2025] [Indexed: 03/29/2025]
Abstract
BACKGROUND Large language models (LLMs) have shown promise in various medical applications, but their potential as decision support tools in emergency ophthalmology remains unevaluated using real-world cases. OBJECTIVES We assessed the performance of state-of-the-art LLMs (GPT-4, GPT-4o, and Llama-3-70b) as decision support tools in emergency ophthalmology compared to human experts. METHODS In this prospective comparative study, LLM-generated diagnoses and treatment plans were evaluated against those determined by certified ophthalmologists using 73 anonymized emergency cases from the University Hospital of Split. Two independent expert ophthalmologists graded both LLM and human-generated reports using a 4-point Likert scale. RESULTS Human experts achieved a mean score of 3.72 (SD = 0.50), while GPT-4 scored 3.52 (SD = 0.64) and Llama-3-70b scored 3.48 (SD = 0.48). GPT-4o had lower performance with 3.20 (SD = 0.81). Significant differences were found between human and LLM reports (P < 0.001), specifically between human scores and GPT-4o. GPT-4 and Llama-3-70b showed performance comparable to ophthalmologists, with no statistically significant differences. CONCLUSION Large language models demonstrated accuracy as decision support tools in emergency ophthalmology, with performance comparable to human experts, suggesting potential for integration into clinical practice.
Collapse
Affiliation(s)
- Ante Kreso
- University Hospital Split, Department for Ophthalmology, Croatia
| | - Zvonimir Boban
- University of Split School of Medicine, Department for Medical Physics, Croatia
| | - Sime Kabic
- University Hospital Split, Department for Ophthalmology, Croatia
| | - Filip Rada
- University Hospital Split, Department for Ophthalmology, Croatia
| | - Darko Batistic
- University Hospital Split, Department for Ophthalmology, Croatia
| | - Ivana Barun
- University Hospital Split, Department for Ophthalmology, Croatia
| | - Ljubo Znaor
- University Hospital Split, Department for Ophthalmology, Croatia
| | - Marko Kumric
- University of Split School of Medicine, Department for Pathophysiology, Croatia
| | - Josko Bozic
- University of Split School of Medicine, Department for Pathophysiology, Croatia
| | - Josip Vrdoljak
- University of Split School of Medicine, Department for Pathophysiology, Croatia.
| |
Collapse
|
2
|
Pushpanathan K, Zou M, Srinivasan S, Wong WM, Mangunkusumo EA, Thomas GN, Lai Y, Sun CH, Lam JSH, Tan MCJ, Lin HAH, Ma W, Koh VTC, Chen DZ, Tham YC. Can OpenAI's New o1 Model Outperform Its Predecessors in Common Eye Care Queries? OPHTHALMOLOGY SCIENCE 2025; 5:100745. [PMID: 40291392 PMCID: PMC12022690 DOI: 10.1016/j.xops.2025.100745] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/30/2024] [Revised: 02/01/2025] [Accepted: 02/14/2025] [Indexed: 04/30/2025]
Abstract
Objective The newly launched OpenAI o1 is said to offer improved reasoning, potentially providing higher quality responses to eye care queries. However, its performance remains unassessed. We evaluated the performance of o1, ChatGPT-4o, and ChatGPT-4 in addressing ophthalmic-related queries, focusing on correctness, completeness, and readability. Design Cross-sectional study. Subjects Sixteen queries, previously identified as suboptimally responded to by ChatGPT-4 from prior studies, were used, covering 3 subtopics: myopia (6 questions), ocular symptoms (4 questions), and retinal conditions (6 questions). Methods For each subtopic, 3 attending-level ophthalmologists, masked to the model sources, evaluated the responses based on correctness, completeness, and readability (on a 5-point scale for each metric). Main Outcome Measures Mean summed scores of each model for correctness, completeness, and readability, rated on a 5-point scale (maximum score: 15). Results O1 scored highest in correctness (12.6) and readability (14.2), outperforming ChatGPT-4, which scored 10.3 (P = 0.010) and 12.4 (P < 0.001), respectively. No significant difference was found between o1 and ChatGPT-4o. When stratified by subtopics, o1 consistently demonstrated superior correctness and readability. In completeness, ChatGPT-4o achieved the highest score of 12.4, followed by o1 (10.8), though the difference was not statistically significant. o1 showed notable limitations in completeness for ocular symptom queries, scoring 5.5 out of 15. Conclusions While o1 is marketed as offering improved reasoning capabilities, its performance in addressing eye care queries does not significantly differ from its predecessor, ChatGPT-4o. Nevertheless, it surpasses ChatGPT-4, particularly in correctness and readability. Financial Disclosures Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
Collapse
Affiliation(s)
- Krithi Pushpanathan
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
| | - Minjie Zou
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
| | - Sahana Srinivasan
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
| | - Wendy Meihua Wong
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Erlangga Ariadarma Mangunkusumo
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - George Naveen Thomas
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Yien Lai
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Chen-Hsin Sun
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Janice Sing Harn Lam
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Marcus Chun Jin Tan
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Hazel Anne Hui'En Lin
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Weizhi Ma
- Institute for AI Industry Research, Tsinghua University, Beijing, China
| | - Victor Teck Chang Koh
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - David Ziyou Chen
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Yih-Chung Tham
- Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health, Yong Loo Lin School of Medicine, National University of Singapore and National University Health System, Singapore
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore
- Eye Academic Clinical Program (Eye ACP), Duke NUS Medical School, Singapore
| |
Collapse
|
3
|
Carl N, Haggenmüller S, Wies C, Nguyen L, Winterstein JT, Hetz MJ, Mangold MH, Hartung FO, Grüne B, Holland‐Letz T, Michel MS, Brinker TJ, Wessels F. Evaluating interactions of patients with large language models for medical information. BJU Int 2025; 135:1010-1017. [PMID: 39967059 PMCID: PMC12053131 DOI: 10.1111/bju.16676] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/20/2025]
Abstract
OBJECTIVES To explore the interaction of real-world patients with a chatbot in a clinical setting, investigating key aspects of medical information provided by large language models (LLMs). PATIENTS AND METHODS The study enrolled 300 patients seeking urological counselling between February and July 2024. First, participants voluntarily conversed with a Generative Pre-trained Transformer 4 (GPT-4) powered chatbot to ask questions related to their medical situation. In the following survey, patients rated the perceived utility, completeness, and understandability of the information provided during the simulated conversation as well as user-friendliness. Finally, patients were asked which, in their experience, best answered their questions: LLMs, urologists, or search engines. RESULTS A total of 292 patients completed the study. The majority of patients perceived the chatbot as providing useful, complete, and understandable information, as well as being user-friendly. However, the ability of human urologists to answer medical questions in an understandable way was rated higher than of LLMs. Interestingly, 53% of participants rated the question-answering ability of LLMs higher than search engines. Age was not associated with preferences. Limitations include social desirability and sampling biases. DISCUSSION This study highlights the potential of LLMs to enhance patient education and communication in clinical settings, with patients valuing their user-friendliness and comprehensiveness for medical information. By addressing preliminary questions, LLMs could potentially relieve time constraints on healthcare providers, enabling medical personnel to focus on complex inquiries and patient care.
Collapse
Affiliation(s)
- Nicolas Carl
- Department of Urology, University Medical Center MannheimRuprecht‐Karls University of HeidelbergMannheimGermany
- Division of Digital Prevention, Diagnostics and Therapy GuidanceGerman Cancer Research Center (DKFZ)HeidelbergGermany
| | - Sarah Haggenmüller
- Division of Digital Prevention, Diagnostics and Therapy GuidanceGerman Cancer Research Center (DKFZ)HeidelbergGermany
| | - Christoph Wies
- Medical FacultyRuprecht‐Karls University of HeidelbergMannheimGermany
- Division of Digital Prevention, Diagnostics and Therapy GuidanceGerman Cancer Research Center (DKFZ)HeidelbergGermany
| | - Lisa Nguyen
- Medical Faculty MannheimRuprecht‐Karls University of HeidelbergMannheimGermany
| | - Jana Theres Winterstein
- Medical FacultyRuprecht‐Karls University of HeidelbergMannheimGermany
- Division of Digital Prevention, Diagnostics and Therapy GuidanceGerman Cancer Research Center (DKFZ)HeidelbergGermany
| | - Martin Joachim Hetz
- Division of Digital Prevention, Diagnostics and Therapy GuidanceGerman Cancer Research Center (DKFZ)HeidelbergGermany
| | - Maurin Helen Mangold
- Department of Urology, University Medical Center MannheimRuprecht‐Karls University of HeidelbergMannheimGermany
| | - Friedrich Otto Hartung
- Department of Urology, University Medical Center MannheimRuprecht‐Karls University of HeidelbergMannheimGermany
| | - Britta Grüne
- Department of Urology, University Medical Center MannheimRuprecht‐Karls University of HeidelbergMannheimGermany
| | - Tim Holland‐Letz
- Department of BiostatisticsGerman Cancer Research Center (DKFZ)HeidelbergGermany
| | - Maurice Stephan Michel
- Department of Urology, University Medical Center MannheimRuprecht‐Karls University of HeidelbergMannheimGermany
| | - Titus Josef Brinker
- Division of Digital Prevention, Diagnostics and Therapy GuidanceGerman Cancer Research Center (DKFZ)HeidelbergGermany
| | - Frederik Wessels
- Department of Urology, University Medical Center MannheimRuprecht‐Karls University of HeidelbergMannheimGermany
| |
Collapse
|
4
|
Shi J, Xia X, Zhuang H, Li Z, Xu K. Empowering individuals to adopt artificial intelligence for health information seeking: A latent profile analysis among users in Hong Kong. Soc Sci Med 2025; 375:118059. [PMID: 40253978 DOI: 10.1016/j.socscimed.2025.118059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2024] [Revised: 02/21/2025] [Accepted: 04/08/2025] [Indexed: 04/22/2025]
Abstract
RATIONALES Using AI for health information seeking is a novel behavior, and as such, developing effective communication strategies to optimize AI adoption in this area presents challenges. To lay the groundwork, research is needed to map out users' behavioral underpinnings regarding AI use, as understanding users' needs, concerns and perspectives could inform the design of targeted and effective communication strategies in this context. OBJECTIVE Guided by the planned risk information seeking model and the comprehensive model of information seeking, our study examines how socio-psychological factors (i.e., attitudes, perceived descriptive and injunctive norms, self-efficacy, technological anxiety) and factors related to information carriers (i.e., trust in and perceived accuracy of AI), shape users' latent profiles. In addition, we explore how individual differences in demographic attributes and anthropocentrism predict membership in these user profiles. METHODS We conducted a quota-sampled survey with 1051 AI-experienced users in Hong Kong. Latent profile analysis was used to examine users' profile patterns. The hierarchical multiple logistic regression was employed to examine how individual differences predict membership in these user profiles. RESULTS The latent profile analysis revealed five heterogeneous profiles, which we labeled "Discreet Approachers," "Casual Investigators," "Apprehensive Moderates," "Apathetic Bystanders," and "Anxious Explorers." Each profile was associated with specific predictors related to individual differences in demographic attributes and/or aspects of anthropocentrism. CONCLUSION The findings advance theoretical understandings of using AI for health information seeking, provide theory-driven strategies to empower users to make well-informed decisions, and offer insights to optimize the adoption of AI technology.
Collapse
Affiliation(s)
- Jingyuan Shi
- Department of Interactive Media, Hong Kong Baptist University, Hong Kong Special Administrative Region of China
| | - Xiaoyu Xia
- School of Communication, Hong Kong Baptist University, Hong Kong Special Administrative Region of China
| | - Huijun Zhuang
- School of Communication, Hong Kong Baptist University, Hong Kong Special Administrative Region of China
| | - Zixi Li
- School of Communication, Hong Kong Baptist University, Hong Kong Special Administrative Region of China.
| | - Kun Xu
- Department of Media Production, Management, and Technology, University of Florida, United States
| |
Collapse
|
5
|
Roch FE, Hahn FM, Jäckle K, Meier MP, Stinus H, Lehmann W, Perthel R, Roch PJ. Diagnosis, treatment, and prevention of ankle sprains: Comparing free chatbot recommendations with clinical guidelines. Foot Ankle Surg 2025; 31:329-351. [PMID: 39730224 DOI: 10.1016/j.fas.2024.12.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/23/2024] [Accepted: 12/10/2024] [Indexed: 12/29/2024]
Abstract
BACKGROUND Free chatbots powered by large language models offer lateral ankle sprains (LAS) treatment recommendations but lack scientific validation. METHODS The chatbots-Claude, Perplexity, and ChatGPT-were evaluated by comparing their responses to a questionnaire and their treatment algorithms against current clinical guidelines. Responses were graded on accuracy, conclusiveness, supplementary information, and incompleteness, and evaluated individually and collectively, with a 60 % pass threshold. RESULTS The collective analysis of the questionnaire showed Perplexity scored significantly higher than Claude and ChatGPT (p < 0.001). In the individual analysis, Perplexity provided significantly more supplementary information than the other chatbots (p < 0.001). All chatbots met the pass threshold. In the algorithm evaluation, ChatGPT scored significantly higher than the others (p = 0.023), with Perplexity below the pass threshold. CONCLUSIONS Chatbots' recommendations generally aligned with current guidelines but sometimes missed crucial details. While they offer useful supplementary information, they cannot yet replace professional medical consultation or established guidelines.
Collapse
Affiliation(s)
- Friederike Eva Roch
- Department of Trauma Surgery, Orthopaedics and Plastic Surgery, University of Göttingen, Robert-Koch-Str. 40, Göttingen 37075, Germany.
| | - Franziska Melanie Hahn
- Department of Trauma Surgery, Orthopaedics and Plastic Surgery, University of Göttingen, Robert-Koch-Str. 40, Göttingen 37075, Germany.
| | - Katharina Jäckle
- Department of Trauma Surgery, Orthopaedics and Plastic Surgery, University of Göttingen, Robert-Koch-Str. 40, Göttingen 37075, Germany.
| | - Marc-Pascal Meier
- Department of Trauma Surgery, Orthopaedics and Plastic Surgery, University of Göttingen, Robert-Koch-Str. 40, Göttingen 37075, Germany.
| | - Hartmut Stinus
- Department of Trauma Surgery, Orthopaedics and Plastic Surgery, University of Göttingen, Robert-Koch-Str. 40, Göttingen 37075, Germany.
| | - Wolfgang Lehmann
- Department of Trauma Surgery, Orthopaedics and Plastic Surgery, University of Göttingen, Robert-Koch-Str. 40, Göttingen 37075, Germany.
| | - Ronny Perthel
- Department of Trauma Surgery, Orthopaedics and Plastic Surgery, University of Göttingen, Robert-Koch-Str. 40, Göttingen 37075, Germany.
| | - Paul Jonathan Roch
- Department of Trauma Surgery, Orthopaedics and Plastic Surgery, University of Göttingen, Robert-Koch-Str. 40, Göttingen 37075, Germany.
| |
Collapse
|
6
|
An K. Navigating the Future: Opportunities and Challenges of Generative AI in Nursing Research. Res Nurs Health 2025; 48:299-300. [PMID: 40317754 DOI: 10.1002/nur.22464] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2025] [Accepted: 03/26/2025] [Indexed: 05/07/2025]
Affiliation(s)
- Kyungeh An
- Georgia State University, Atlanta, Georgia, USA
| |
Collapse
|
7
|
Asgari E, Montaña-Brown N, Dubois M, Khalil S, Balloch J, Yeung JA, Pimenta D. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. NPJ Digit Med 2025; 8:274. [PMID: 40360677 DOI: 10.1038/s41746-025-01670-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2024] [Accepted: 04/22/2025] [Indexed: 05/15/2025] Open
Abstract
Integrating large language models (LLMs) into healthcare can enhance workflow efficiency and patient care by automating tasks such as summarising consultations. However, the fidelity between LLM outputs and ground truth information is vital to prevent miscommunication that could lead to compromise in patient safety. We propose a framework comprising (1) an error taxonomy for classifying LLM outputs, (2) an experimental structure for iterative comparisons in our LLM document generation pipeline, (3) a clinical safety framework to evaluate the harms of errors, and (4) a graphical user interface, CREOLA, to facilitate these processes. Our clinical error metrics were derived from 18 experimental configurations involving LLMs for clinical note generation, consisting of 12,999 clinician-annotated sentences. We observed a 1.47% hallucination rate and a 3.45% omission rate. By refining prompts and workflows, we successfully reduced major errors below previously reported human note-taking rates, highlighting the framework's potential for safer clinical documentation.
Collapse
Affiliation(s)
- Elham Asgari
- Tortus AI, London, UK.
- Guy's and St Thomas NHS Trust, London, UK.
| | | | | | | | | | | | | |
Collapse
|
8
|
Cornelius J, Knitza J, Hack J, Pavlovic M, Kuhn S. [Potential applications of large language models in trauma surgery : Opportunities, risks and perspectives]. UNFALLCHIRURGIE (HEIDELBERG, GERMANY) 2025:10.1007/s00113-025-01581-y. [PMID: 40355629 DOI: 10.1007/s00113-025-01581-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 04/07/2025] [Indexed: 05/14/2025]
Abstract
The integration of large language models (LLM) into the care of trauma surgery patients offers an exciting opportunity with immense potential to enhance the efficiency and quality of care. The LLM can serve as supportive tools for diagnosis, decision making and patient communication by efficiently providing medical knowledge and generating personalized treatment recommendations; however, there are also substantial challenges that must be addressed. The lack of transparency in the decision-making processes of LLM as well as currently unresolved legal and ethical issues, necessitate careful implementation and examination by medical professionals to ensure the safety and effectiveness of these technologies.
Collapse
Affiliation(s)
- Jakob Cornelius
- Zentrum für Orthopädie und Unfallchirurgie, Universitätsklinikum Gießen und Marburg, Standort Marburg, Philipps-Universität Marburg, Baldingerstraße, 35043, Marburg, Deutschland.
| | - Johannes Knitza
- Institut für Digitale Medizin, Universitätsklinikum Marburg, Philipps-Universität Marburg, Marburg, Deutschland
| | - Juliana Hack
- Zentrum für Orthopädie und Unfallchirurgie, Universitätsklinikum Gießen und Marburg, Standort Marburg, Philipps-Universität Marburg, Baldingerstraße, 35043, Marburg, Deutschland
| | - Melina Pavlovic
- Zentrum für Orthopädie und Unfallchirurgie, Universitätsklinikum Gießen und Marburg, Standort Marburg, Philipps-Universität Marburg, Baldingerstraße, 35043, Marburg, Deutschland
| | - Sebastian Kuhn
- Institut für Digitale Medizin, Universitätsklinikum Marburg, Philipps-Universität Marburg, Marburg, Deutschland
| |
Collapse
|
9
|
Jin Y, Liu J, Li P, Wang B, Yan Y, Zhang H, Ni C, Wang J, Li Y, Bu Y, Wang Y. The Applications of Large Language Models in Mental Health: Scoping Review. J Med Internet Res 2025; 27:e69284. [PMID: 40324177 DOI: 10.2196/69284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2024] [Revised: 02/09/2025] [Accepted: 03/14/2025] [Indexed: 05/07/2025] Open
Abstract
BACKGROUND Mental health is emerging as an increasingly prevalent public issue globally. There is an urgent need in mental health for efficient detection methods, effective treatments, affordable privacy-focused health care solutions, and increased access to specialized psychiatrists. The emergence and rapid development of large language models (LLMs) have shown the potential to address these mental health demands. However, a comprehensive review summarizing the application areas, processes, and performance comparisons of LLMs in mental health has been lacking until now. OBJECTIVE This review aimed to summarize the applications of LLMs in mental health, including trends, application areas, performance comparisons, challenges, and prospective future directions. METHODS A scoping review was conducted to map the landscape of LLMs' applications in mental health, including trends, application areas, comparative performance, and future trajectories. We searched 7 electronic databases, including Web of Science, PubMed, Cochrane Library, IEEE Xplore, Weipu, CNKI, and Wanfang, from January 1, 2019, to August 31, 2024. Studies eligible for inclusion were peer-reviewed articles focused on LLMs' applications in mental health. Studies were excluded if they (1) were not peer-reviewed or did not focus on mental health or mental disorders or (2) did not use LLMs; studies that used only natural language processing or long short-term memory models were also excluded. Relevant information on application details and performance metrics was extracted during the data charting of eligible articles. RESULTS A total of 95 articles were drawn from 4859 studies using LLMs for mental health tasks. The applications were categorized into 3 key areas: screening or detection of mental disorders (67/95, 71%), supporting clinical treatments and interventions (31/95, 33%), and assisting in mental health counseling and education (11/95, 12%). Most studies used LLMs for depression detection and classification (33/95, 35%), clinical treatment support and intervention (14/95, 15%), and suicide risk prediction (12/95, 13%). Compared with nontransformer models and humans, LLMs demonstrate higher capabilities in information acquisition and analysis and efficiently generating natural language responses. Various series of LLMs also have different advantages and disadvantages in addressing mental health tasks. CONCLUSIONS This scoping review synthesizes the applications, processes, performance, and challenges of LLMs in the mental health field. These findings highlight the substantial potential of LLMs to augment mental health research, diagnostics, and intervention strategies, underscoring the imperative for ongoing development and ethical deliberation in clinical settings.
Collapse
Affiliation(s)
- Yu Jin
- Department of Statistics, Faculty of Arts and Sciences, Beijing Normal University, Beijing, China
| | - Jiayi Liu
- Department of Statistics, Faculty of Arts and Sciences, Beijing Normal University, Beijing, China
| | - Pan Li
- Department of Statistics, Faculty of Arts and Sciences, Beijing Normal University, Beijing, China
| | - Baosen Wang
- Department of Statistics, Faculty of Arts and Sciences, Beijing Normal University, Beijing, China
| | - Yangxinyu Yan
- School of Psychology, Center for Studies of Psychological Application, and Guangdong Key Laboratory of Mental Health and Cognitive Science, Key Laboratory of Brain, Cognition and Education Sciences, Ministry of Education, South China Normal University, Guangzhou, Guangdong, China
| | - Huilin Zhang
- School of Statistics, Beijing Normal University, Beijing, China
| | - Chenhao Ni
- School of Statistics, Beijing Normal University, Beijing, China
| | - Jing Wang
- Faculty of Computer Science, Guangdong Polytechnic Normal University, Guangzhou, Guangdong, China
| | - Yi Li
- The People's Hospital of Pingbian County, Honghe, Yunnan, China
| | - Yajun Bu
- The People's Hospital of Pingbian County, Honghe, Yunnan, China
| | - Yuanyuan Wang
- School of Psychology, Center for Studies of Psychological Application, and Guangdong Key Laboratory of Mental Health and Cognitive Science, Key Laboratory of Brain, Cognition and Education Sciences, Ministry of Education, South China Normal University, Guangzhou, Guangdong, China
| |
Collapse
|
10
|
Berkstresser AM, Hanchard SEL, Iacaboni D, McMilian K, Duong D, Solomon BD, Waikel RL. Artificial intelligence in clinical genetics: current practice and attitudes among the clinical genetics workforce. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2025:2025.04.30.25326673. [PMID: 40343038 PMCID: PMC12060961 DOI: 10.1101/2025.04.30.25326673] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/11/2025]
Abstract
Purpose Artificial intelligence (AI) applications for clinical genetics hold the potential to improve patient care through supporting diagnostics and management as well as automating administrative tasks, thus enhancing and potentially enabling clinician/patient interactions. While the introduction of AI into clinical genetics is increasing, there remain unclear questions about risks and benefits, and the readiness of the workforce. Methods To assess the current clinical genetics workforce's use, knowledge, and attitudes toward available medical AI applications, we conducted a survey involving 215 US-based genetics clinicians and trainees. Results Over half (51.2%) of participants report little to no knowledge of AI in clinical genetics and 64.3% reported no formal training in AI applications. Formal training directly correlated with self-reported knowledge of AI in clinical genetics, with 69.3% of respondents with formal training reporting intermediate to extensive knowledge of AI vs. 37.5% without formal training. Most participants reported that they lacked sufficient knowledge of clinical AI (83.4%) and agreed that there should be more education in this area (97.6%) and would take a course if offered (89.3%). The majority (51.6%) of clinician participants said they never used AI applications in the clinic. However, after a tutorial describing clinical AI applications, 75.8% reported some use of AI applications in the clinic. When asked specifically about clinical AI application usage, the majority of clinician participants used facial diagnostic applications (54.9%) and AI-generated genomic testing results (62.1%), whereas other applications such as chatbots, large language models (LLMs), pedigree or medical summary generators, and risk assessment were only used by a fraction of the clinicians, ranging from 11.1 to 12.5%. Nearly all participants (94.6%) reported clinical genetics professionals as being overburdened. Conclusion Further clinician education is both desired and needed to optimally utilize clinical AI applications with the potential to enhance patient care and alleviate the current strain on genetics clinics.
Collapse
Affiliation(s)
- Amanda M Berkstresser
- Genetic Counseling Program, School of Health & Natural Sciences, Bay Path University, Longmeadow, Massachusetts, United States of America
| | | | - Daniela Iacaboni
- Genetic Counseling Program, School of Health & Natural Sciences, Bay Path University, Longmeadow, Massachusetts, United States of America
| | - Kevin McMilian
- Cumoratek Consulting, Kansas City, Missouri, United States of America
| | - Dat Duong
- Medical Genetics Branch, National Human Genome Research Institute, Bethesda, Maryland, United States of America
| | - Benjamin D. Solomon
- Medical Genetics Branch, National Human Genome Research Institute, Bethesda, Maryland, United States of America
| | - Rebekah L. Waikel
- Medical Genetics Branch, National Human Genome Research Institute, Bethesda, Maryland, United States of America
| |
Collapse
|
11
|
Ngoc Nguyen O, Amin D, Bennett J, Hetlevik Ø, Malik S, Tout A, Vornhagen H, Vellinga A. GP or ChatGPT? Ability of large language models (LLMs) to support general practitioners when prescribing antibiotics. J Antimicrob Chemother 2025; 80:1324-1330. [PMID: 40079276 PMCID: PMC12046391 DOI: 10.1093/jac/dkaf077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2024] [Accepted: 02/28/2025] [Indexed: 03/15/2025] Open
Abstract
INTRODUCTION Large language models (LLMs) are becoming ubiquitous and widely implemented. LLMs could also be used for diagnosis and treatment. National antibiotic prescribing guidelines are customized and informed by local laboratory data on antimicrobial resistance. METHODS Based on 24 vignettes with information on type of infection, gender, age group and comorbidities, GPs and LLMs were prompted to provide a treatment. Four countries (Ireland, UK, USA and Norway) were included and a GP from each country and six LLMs (ChatGPT, Gemini, Copilot, Mistral AI, Claude and Llama 3.1) were provided with the vignettes, including their location (country). Responses were compared with the country's national prescribing guidelines. In addition, limitations of LLMs such as hallucination, toxicity and data leakage were assessed. RESULTS GPs' answers to the vignettes showed high accuracy in relation to diagnosis (96%-100%) and yes/no antibiotic prescribing (83%-92%). GPs referenced (100%) and prescribed (58%-92%) according to national guidelines, but dose/duration of treatment was less accurate (50%-75%). Overall, the GPs' accuracy had a mean of 74%. LLMs scored high in relation to diagnosis (92%-100%), antibiotic prescribing (88%-100%) and the choice of antibiotic (59%-100%) but correct referencing often failed (38%-96%), in particular for the Norwegian guidelines (0%-13%). Data leakage was shown to be an issue as personal information was repeated in the models' responses to the vignettes. CONCLUSIONS LLMs may be safe to guide antibiotic prescribing in general practice. However, to interpret vignettes, apply national guidelines and prescribe the right dose and duration, GPs remain best placed.
Collapse
Affiliation(s)
- Oanh Ngoc Nguyen
- CARA Network, School of Public Health, Physiotherapy and Sports Science, University College Dublin, Dublin, Ireland
| | - Doaa Amin
- CARA Network, School of Public Health, Physiotherapy and Sports Science, University College Dublin, Dublin, Ireland
| | - James Bennett
- NIHR In Practice Fellow, Hull York Medical School, University of Hull, Hull HU6 7RX, UK
| | - Øystein Hetlevik
- Department of Global Public Health and Primary Care, University of Bergen, Bergen, Norway
| | - Sara Malik
- Midleton Medi Center, Midleton, Co Cork, Ireland
| | - Andrew Tout
- Division of General Internal Medicine, Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| | - Heike Vornhagen
- CARA Network, Insight Centre for Data Analytics, University of Galway, Galway, Ireland
| | - Akke Vellinga
- CARA Network, School of Public Health, Physiotherapy and Sports Science, University College Dublin, Dublin, Ireland
| |
Collapse
|
12
|
Singh R, Hamouda M, Chamberlin JH, Tóth A, Munford J, Silbergleit M, Baruah D, Burt JR, Kabakus IM. ChatGPT vs. Gemini: Comparative accuracy and efficiency in Lung-RADS score assignment from radiology reports. Clin Imaging 2025; 121:110455. [PMID: 40090067 DOI: 10.1016/j.clinimag.2025.110455] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2024] [Revised: 11/16/2024] [Accepted: 03/10/2025] [Indexed: 03/18/2025]
Abstract
OBJECTIVE To evaluate the accuracy of large language models (LLMs) in generating Lung-RADS scores based on lung cancer screening low-dose computed tomography radiology reports. MATERIAL AND METHODS A retrospective cross-sectional analysis was performed on 242 consecutive LDCT radiology reports generated by cardiothoracic fellowship-trained radiologists at a tertiary center. LLMs evaluated included ChatGPT-3.5, ChatGPT-4o, Google Gemini, and Google Gemini Advanced. Each LLM was used to assign Lung-RADS scores based on the findings section of each report. No domain-specific fine-tuning was applied. Accuracy was determined by comparing the LLM-assigned scores to radiologist-assigned scores. Efficiency was assessed by measuring response times for each LLM. RESULTS ChatGPT-4o achieved the highest accuracy (83.6 %) in assigning Lung-RADS scores compared to other models, with ChatGPT-3.5 reaching 70.1 %. Gemini and Gemini Advanced had similar accuracy (70.9 % and 65.1 %, respectively). ChatGPT-3.5 had the fastest response time (median 4 s), while ChatGPT-4o was slower (median 10 s). Higher Lung-RADS categories were associated with marginally longer completion times. ChatGPT-4o demonstrated the greatest agreement with radiologists (κ = 0.836), although it was less than the previously reported human interobserver agreement. CONCLUSION ChatGPT-4o outperformed ChatGPT-3.5, Gemini, and Gemini Advanced in Lung-RADS score assignment accuracy but did not reach the level of human experts. Despite promising results, further work is needed to integrate domain-specific training and ensure LLM reliability for clinical decision-making in lung cancer screening.
Collapse
Affiliation(s)
- Ria Singh
- Osteopathic Medical School, Kansas City University, Kansas, MO, USA
| | - Mohamed Hamouda
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, Medical University of South Carolina, Charleston, SC, USA
| | - Jordan H Chamberlin
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, Medical University of South Carolina, Charleston, SC, USA
| | - Adrienn Tóth
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, Medical University of South Carolina, Charleston, SC, USA
| | - James Munford
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, Medical University of South Carolina, Charleston, SC, USA
| | - Matthew Silbergleit
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, Medical University of South Carolina, Charleston, SC, USA
| | - Dhiraj Baruah
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, Medical University of South Carolina, Charleston, SC, USA
| | - Jeremy R Burt
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, University of Utah School of Medicine, Salt Lake City, UT, USA
| | - Ismail M Kabakus
- Division of Cardiothoracic Imaging, Department of Radiology and Radiological Science, Medical University of South Carolina, Charleston, SC, USA.
| |
Collapse
|
13
|
Busch F, Hoffmann L, Dos Santos DP, Makowski MR, Saba L, Prucker P, Hadamitzky M, Navab N, Kather JN, Truhn D, Cuocolo R, Adams LC, Bressem KK. Large language models for structured reporting in radiology: past, present, and future. Eur Radiol 2025; 35:2589-2602. [PMID: 39438330 PMCID: PMC12021971 DOI: 10.1007/s00330-024-11107-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Revised: 06/29/2024] [Accepted: 09/01/2024] [Indexed: 10/25/2024]
Abstract
Structured reporting (SR) has long been a goal in radiology to standardize and improve the quality of radiology reports. Despite evidence that SR reduces errors, enhances comprehensiveness, and increases adherence to guidelines, its widespread adoption has been limited. Recently, large language models (LLMs) have emerged as a promising solution to automate and facilitate SR. Therefore, this narrative review aims to provide an overview of LLMs for SR in radiology and beyond. We found that the current literature on LLMs for SR is limited, comprising ten studies on the generative pre-trained transformer (GPT)-3.5 (n = 5) and/or GPT-4 (n = 8), while two studies additionally examined the performance of Perplexity and Bing Chat or IT5. All studies reported promising results and acknowledged the potential of LLMs for SR, with six out of ten studies demonstrating the feasibility of multilingual applications. Building upon these findings, we discuss limitations, regulatory challenges, and further applications of LLMs in radiology report processing, encompassing four main areas: documentation, translation and summarization, clinical evaluation, and data mining. In conclusion, this review underscores the transformative potential of LLMs to improve efficiency and accuracy in SR and radiology report processing. KEY POINTS: Question How can LLMs help make SR in radiology more ubiquitous? Findings Current literature leveraging LLMs for SR is sparse but shows promising results, including the feasibility of multilingual applications. Clinical relevance LLMs have the potential to transform radiology report processing and enable the widespread adoption of SR. However, their future role in clinical practice depends on overcoming current limitations and regulatory challenges, including opaque algorithms and training data.
Collapse
Affiliation(s)
- Felix Busch
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany.
| | - Lena Hoffmann
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Daniel Pinto Dos Santos
- Institute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, Cologne, Germany
- Institute of Diagnostic and Interventional Radiology, University Hospital of Frankfurt, Frankfurt, Germany
| | - Marcus R Makowski
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Luca Saba
- Department of Radiology, Azienda Ospedaliero Universitaria (A.O.U.), Cagliari, Italy
| | - Philipp Prucker
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Martin Hadamitzky
- School of Medicine and Health, Institute for Cardiovascular Radiology and Nuclear Medicine, German Heart Center Munich, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Nassir Navab
- Chair for Computer Aided Medical Procedures & Augmented Reality, TUM School of Computation, Information and Technology, Technical University of Munich, Munich, Germany
| | - Jakob Nikolas Kather
- Department of Medical Oncology, National Center for Tumor Diseases (NCT), Heidelberg University Hospital, Heidelberg, Germany
- Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, Technical University Dresden, Dresden, Germany
| | - Daniel Truhn
- Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen, Germany
| | - Renato Cuocolo
- Department of Medicine, Surgery and Dentistry, University of Salerno, Baronissi, Italy
| | - Lisa C Adams
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Keno K Bressem
- School of Medicine and Health, Institute for Cardiovascular Radiology and Nuclear Medicine, German Heart Center Munich, TUM University Hospital, Technical University of Munich, Munich, Germany
| |
Collapse
|
14
|
Sumner J, Wang Y, Tan SY, Chew EHH, Wenjun Yip A. Perspectives and Experiences With Large Language Models in Health Care: Survey Study. J Med Internet Res 2025; 27:e67383. [PMID: 40310666 DOI: 10.2196/67383] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2024] [Revised: 01/14/2025] [Accepted: 01/15/2025] [Indexed: 05/02/2025] Open
Abstract
BACKGROUND Large language models (LLMs) are transforming how data is used, including within the health care sector. However, frameworks including the Unified Theory of Acceptance and Use of Technology highlight the importance of understanding the factors that influence technology use for successful implementation. OBJECTIVE This study aimed to (1) investigate users' uptake, perceptions, and experiences regarding LLMs in health care and (2) contextualize survey responses by demographics and professional profiles. METHODS An electronic survey was administered to elicit stakeholder perspectives of LLMs (health care providers and support functions), their experiences with LLMs, and their potential impact on functional roles. Survey domains included: demographics (6 questions), user experiences of LLMs (8 questions), motivations for using LLMs (6 questions), and perceived impact on functional roles (4 questions). The survey was launched electronically, targeting health care providers or support staff, health care students, and academics in health-related fields. Respondents were adults (>18 years) aware of LLMs. RESULTS Responses were received from 1083 individuals, of which 845 were analyzable. Of the 845 respondents, 221 had yet to use an LLM. Nonusers were more likely to be health care workers (P<.001), older (P<.001), and female (P<.01). Users primarily adopted LLMs for speed, convenience, and productivity. While 75% (470/624) agreed that the user experience was positive, 46% (294/624) found the generated content unhelpful. Regression analysis showed that the experience with LLMs is more likely to be positive if the user is male (odds ratio [OR] 1.62, CI 1.06-2.48), and increasing age was associated with a reduced likelihood of reporting LLM output as useful (OR 0.98, CI 0.96-0.99). Nonusers compared to LLM users were less likely to report LLMs meeting unmet needs (45%, 99/221 vs 65%, 407/624; OR 0.48, CI 0.35-0.65), and males were more likely to report that LLMs do address unmet needs (OR 1.64, CI 1.18-2.28). Furthermore, nonusers compared to LLM users were less likely to agree that LLMs will improve functional roles (63%, 140/221 vs 75%, 469/624; OR 0.60, CI 0.43-0.85). Free-text opinions highlighted concerns regarding autonomy, outperformance, and reduced demand for care. Respondents also predicted changes to human interactions, including fewer but higher quality interactions and a change in consumer needs as LLMs become more common, which would require provider adaptation. CONCLUSIONS Despite the reported benefits of LLMs, nonusers-primarily health care workers, older individuals, and females-appeared more hesitant to adopt these tools. These findings underscore the need for targeted education and support to address adoption barriers and ensure the successful integration of LLMs in health care. Anticipated role changes, evolving human interactions, and the risk of the digital divide further emphasize the need for careful implementation and ongoing evaluation of LLMs in health care to ensure equity and sustainability.
Collapse
Affiliation(s)
- Jennifer Sumner
- Alexandra Research Centre for Healthcare in a Virtual Environment, Alexandra Hospital, Singapore, Singapore
| | - Yuchen Wang
- School of Computing, National University of Singapore, Singapore, Singapore
| | - Si Ying Tan
- Alexandra Research Centre for Healthcare in a Virtual Environment, Alexandra Hospital, Singapore, Singapore
| | - Emily Hwee Hoon Chew
- Alexandra Research Centre for Healthcare in a Virtual Environment, Alexandra Hospital, Singapore, Singapore
| | - Alexander Wenjun Yip
- Alexandra Research Centre for Healthcare in a Virtual Environment, Alexandra Hospital, Singapore, Singapore
| |
Collapse
|
15
|
Jabal MS, Warman P, Zhang J, Gupta K, Jain A, Mazurowski M, Wiggins W, Magudia K, Calabrese E. Open-Weight Language Models and Retrieval-Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports: Assessment of Approaches and Parameters. Radiol Artif Intell 2025; 7:e240551. [PMID: 40072216 DOI: 10.1148/ryai.240551] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/01/2025]
Abstract
Purpose To develop and evaluate an automated system for extracting structured clinical information from unstructured radiology and pathology reports using open-weight language models (LMs) and retrieval-augmented generation (RAG) and to assess the effects of model configuration variables on extraction performance. Materials and Methods This retrospective study used two datasets: 7294 radiology reports annotated for Brain Tumor Reporting and Data System (BT-RADS) scores and 2154 pathology reports annotated for IDH mutation status (January 2017-July 2021). An automated pipeline was developed to benchmark the performance of various LMs and RAG configurations for accuracy of structured data extraction from reports. The effect of model size, quantization, prompting strategies, output formatting, and inference parameters on model accuracy was systematically evaluated. Results The best-performing models achieved up to 98% accuracy in extracting BT-RADS scores from radiology reports and greater than 90% accuracy for extraction of IDH mutation status from pathology reports. The best model was medical fine-tuned Llama 3. Larger, newer, and domain fine-tuned models consistently outperformed older and smaller models (mean accuracy, 86% vs 75%; P < .001). Model quantization had minimal effect on performance. Few-shot prompting significantly improved accuracy (mean [±SD] increase, 32% ± 32; P = .02). RAG improved performance for complex pathology reports by a mean of 48% ± 11 (P = .001) but not for shorter radiology reports (-8% ± 31; P = .39). Conclusion This study demonstrates the potential of open LMs in automated extraction of structured clinical data from unstructured clinical reports with local privacy-preserving application. Careful model selection, prompt engineering, and semiautomated optimization using annotated data are critical for optimal performance. Keywords: Large Language Models, Retrieval-Augmented Generation, Radiology, Pathology, Health Care Reports Supplemental material is available for this article. © RSNA, 2025 See also commentary by Tejani and Rauschecker in this issue.
Collapse
Affiliation(s)
- Mohamed Sobhi Jabal
- Department of Radiology, Duke University Hospital, 2301 Erwin Rd, Durham, NC 27710
| | | | - Jikai Zhang
- Department of Electrical and Computer Engineering, Duke University, Durham, NC
- Duke Center for Artificial Intelligence in Radiology, Duke University, Durham, NC
| | - Kartikeye Gupta
- Department of Radiology, Duke University Medical Center, Durham, NC
| | - Ayush Jain
- Department of Radiology, Duke University Medical Center, Durham, NC
| | - Maciej Mazurowski
- Department of Radiology, Duke University Hospital, 2301 Erwin Rd, Durham, NC 27710
- Duke University School of Medicine, Durham, NC
- Department of Electrical and Computer Engineering, Duke University, Durham, NC
| | - Walter Wiggins
- Department of Radiology, Duke University Hospital, 2301 Erwin Rd, Durham, NC 27710
| | - Kirti Magudia
- Department of Radiology, Duke University Hospital, 2301 Erwin Rd, Durham, NC 27710
| | - Evan Calabrese
- Department of Radiology, Duke University Hospital, 2301 Erwin Rd, Durham, NC 27710
- Department of Radiology, Duke University Medical Center, Durham, NC
| |
Collapse
|
16
|
Puniya BL. Artificial-intelligence-driven innovations in mechanistic computational modeling and digital twins for biomedical applications. J Mol Biol 2025:169181. [PMID: 40316010 DOI: 10.1016/j.jmb.2025.169181] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2025] [Revised: 04/09/2025] [Accepted: 04/27/2025] [Indexed: 05/04/2025]
Abstract
Understanding of complex biological systems remains a significant challenge due to their high dimensionality, nonlinearity, and context-specific behavior. Artificial intelligence (AI) and mechanistic modeling are becoming essential tools for studying such complex systems. Mechanistic modeling can facilitate the construction of simulatable models that are interpretable but often struggle with scalability and parameters estimation. AI can integrate multi-omics data to create predictive models, but it lacks interpretability. The gap between these two modeling methods limits our ability to develop comprehensive and predictive models for biomedical applications. This article reviews the most recent advancements in the integration of AI and mechanistic modeling to fill this gap. Recently, with omics availability, AI has led to new discoveries in mechanistic computational modeling. The mechanistic models can also help in getting insight into the mechanism for prediction made by AI models. This integration is helpful in modeling complex systems, estimating the parameters that are hard to capture in experiments, and creating surrogate models to reduce computational costs because of expensive mechanistic model simulations. This article focuses on advancements in mechanistic computational models and AI models and their integration for scientific discoveries in biology, pharmacology, drug discovery and diseases. The mechanistic models with AI integration can facilitate biological discoveries to advance our understanding of disease mechanisms, drug development, and personalized medicine. The article also highlights the role of AI and mechanistic model integration in the development of more advanced models in the biomedical domain, such as medical digital twins and virtual patients for pharmacological discoveries.
Collapse
Affiliation(s)
- Bhanwar Lal Puniya
- Department of Biochemistry, University of Nebraska-Lincoln, Lincoln, NE 68588, United States
| |
Collapse
|
17
|
Tripathi S, Alkhulaifat D, Muppuri M, Elahi A, Dako F. Large Language Models for Global Health Clinics: Opportunities and Challenges. J Am Coll Radiol 2025:S1546-1440(25)00205-4. [PMID: 40204164 DOI: 10.1016/j.jacr.2025.04.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2024] [Revised: 03/22/2025] [Accepted: 04/02/2025] [Indexed: 04/11/2025]
Abstract
Large language models (LLMs) have emerged as a new wave of artificial intelligence, and their applications could emerge as a pivotal resource capable of reshaping health care communication, research, and informed decision-making processes. These models offer unprecedented potential to swiftly disseminate critical health information and transcend linguistic barriers. However, their integration into health care systems presents formidable challenges, including inherent biases in training data, privacy vulnerabilities, and disparities in digital literacy. Despite these obstacles, LLMs possess unparalleled analytic prowess to inform evidence-based health care policies and clinical practices. Addressing these challenges necessitates the formulation of robust ethical frameworks, bias mitigation strategies, and educational initiatives to ensure equitable access to health care resources globally. By navigating these complexities with meticulous attention and foresight, LLMs stand poised to catalyze substantial advancements in global health outcomes, promoting health equity and improving population health worldwide.
Collapse
Affiliation(s)
- Satvik Tripathi
- Center for Global and Population Health Research in Radiology, Department of Radiology, Perelman School of Medicine at University of Pennsylvania, Philadelphia, Pennsylvania; Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - Dana Alkhulaifat
- Department of Radiology, Children's Hospital of Philadelphia, Philadelphia, Pennsylvania
| | - Meghana Muppuri
- Athinoula A. Martinos Center for Biomedical Imaging, Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts
| | - Ameena Elahi
- Department of Information Services, University of Pennsylvania, Philadelphia, Pennsylvania; IS Application Manager, Penn Medicine
| | - Farouk Dako
- Director, Center for Global and Population Health Research in Radiology, Department of Radiology, Perelman School of Medicine at University of Pennsylvania, Philadelphia, Pennsylvania.
| |
Collapse
|
18
|
Anderson KD, Davis CA, Pickett SM, Pohlen MS. Evaluating Large Language Models on Aerospace Medicine Principles. Wilderness Environ Med 2025:10806032251330628. [PMID: 40289627 DOI: 10.1177/10806032251330628] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/30/2025]
Abstract
IntroductionLarge language models (LLMs) hold immense potential to serve as clinical decision-support tools for Earth-independent medical operations. However, the generation of incorrect information may be misleading or even harmful when applied to care in this setting.MethodTo better understand this risk, this work tested two publicly available LLMs, ChatGPT-4 and Google Gemini Advanced (1.0 Ultra), as well as a custom Retrieval-Augmented Generation (RAG) LLM on factual knowledge and clinical reasoning in accordance with published material in aerospace medicine. We also evaluated the consistency of the two public LLMs when answering self-generated board-style questions.ResultsWhen queried with 857 free-response questions from Aerospace Medicine Boards Questions and Answers, ChatGPT-4 had a mean reader score from 4.23 to 5.00 (Likert scale of 1-5) across chapters, whereas Gemini Advanced and the RAG LLM scored 3.30 to 4.91 and 4.69 to 5.00, respectively. When queried with 20 multiple-choice aerospace medicine board questions provided by the American College of Preventive Medicine, ChatGPT-4 and Gemini Advanced responded correctly 70% and 55% of the time, respectively, while the RAG LLM answered 85% correctly. Despite this quantitative measure of high performance, the LLMs tested still exhibited gaps in factual knowledge that potentially could be harmful, a degree of clinical reasoning that may not pass the aerospace medicine board exam, and some inconsistency when answering self-generated questions.ConclusionThere is considerable promise for LLM use in autonomous medical operations in spaceflight given the anticipated continued rapid pace of development, including advancements in model training, data quality, and fine-tuning methods.
Collapse
Affiliation(s)
- Kyle D Anderson
- Department of Orthopaedics, Emory Sports Performance and Research Center, Emory School of Medicine, Flowery Branch, GA
| | - Cole A Davis
- Louisiana State University Health New Orleans School of Medicine, New Orleans, LA
| | | | - Michael S Pohlen
- Department of Radiology, Stanford University School of Medicine, Stanford, CA
| |
Collapse
|
19
|
Reichenpfader D, Knupp J, von Däniken SU, Gaio R, Dennstädt F, Cereghetti GM, Sander A, Hiltbrunner H, Nairz K, Denecke K. Enhancing Bidirectional Encoder Representations From Transformers (BERT) With Frame Semantics to Extract Clinically Relevant Information From German Mammography Reports: Algorithm Development and Validation. J Med Internet Res 2025; 27:e68427. [PMID: 40279645 PMCID: PMC12064967 DOI: 10.2196/68427] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2024] [Revised: 02/20/2025] [Accepted: 03/16/2025] [Indexed: 04/27/2025] Open
Abstract
BACKGROUND Structured reporting is essential for improving the clarity and accuracy of radiological information. Despite its benefits, the European Society of Radiology notes that it is not widely adopted. For example, while structured reporting frameworks such as the Breast Imaging Reporting and Data System provide standardized terminology and classification for mammography findings, radiology reports still mostly comprise free-text sections. This variability complicates the systematic extraction of key clinical data. Moreover, manual structuring of reports is time-consuming and prone to inconsistencies. Recent advancements in large language models have shown promise for clinical information extraction by enabling models to understand contextual nuances in medical text. However, challenges such as domain adaptation, privacy concerns, and generalizability remain. To address these limitations, frame semantics offers an approach to information extraction grounded in computational linguistics, allowing a structured representation of clinically relevant concepts. OBJECTIVE This study explores the combination of Bidirectional Encoder Representations from Transformers (BERT) architecture with the linguistic concept of frame semantics to extract and normalize information from free-text mammography reports. METHODS After creating an annotated corpus of 210 German reports for fine-tuning, we generate several BERT model variants by applying 3 pretraining strategies to hospital data. Afterward, a fact extraction pipeline is built, comprising an extractive question-answering model and a sequence labeling model. We quantitatively evaluate all model variants using common evaluation metrics (model perplexity, Stanford Question Answering Dataset 2.0 [SQuAD_v2], seqeval) and perform a qualitative clinician evaluation of the entire pipeline on a manually generated synthetic dataset of 21 reports, as well as a comparison with a generative approach following best practice prompting techniques using the open-source Llama 3.3 model (Meta). RESULTS Our system is capable of extracting 14 fact types and 40 entities from the clinical findings section of mammography reports. Further pretraining on hospital data reduced model perplexity, although it did not significantly impact the 2 downstream tasks. We achieved average F1-scores of 90.4% and 81% for question answering and sequence labeling, respectively (best pretraining strategy). Qualitative evaluation of the pipeline based on synthetic data shows an overall precision of 96.1% and 99.6% for facts and entities, respectively. In contrast, generative extraction shows an overall precision of 91.2% and 87.3% for facts and entities, respectively. Hallucinations and extraction inconsistencies were observed. CONCLUSIONS This study demonstrates that frame semantics provides a robust and interpretable framework for automating structured reporting. By leveraging frame semantics, the approach enables customizable information extraction and supports generalization to diverse radiological domains and clinical contexts with additional annotation efforts. Furthermore, the BERT-based model architecture allows for efficient, on-premise deployment, ensuring data privacy. Future research should focus on validating the model's generalizability across external datasets and different report types to ensure its broader applicability in clinical practice.
Collapse
Affiliation(s)
- Daniel Reichenpfader
- Institute for Patient-Centered Digital Health, School of Engineering and Computer Science, Bern University of Applied Sciences, Biel/Bienne, Switzerland
- PhD School of Life Sciences, Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | | | - Sandro Urs von Däniken
- Department of Diagnostic, Interventional, and Pediatric Radiology, Bern University Hospital, University of Bern, Bern, Switzerland
| | - Roberto Gaio
- Department of Radiation Oncology, Bern University Hospital, University of Bern, Bern, Switzerland
| | - Fabio Dennstädt
- Department of Radiation Oncology, Bern University Hospital, University of Bern, Bern, Switzerland
| | - Grazia Maria Cereghetti
- Department of Diagnostic, Interventional, and Pediatric Radiology, Bern University Hospital, University of Bern, Bern, Switzerland
| | | | - Hans Hiltbrunner
- Department of Diagnostic, Interventional, and Pediatric Radiology, Bern University Hospital, University of Bern, Bern, Switzerland
| | - Knud Nairz
- Department of Diagnostic, Interventional, and Pediatric Radiology, Bern University Hospital, University of Bern, Bern, Switzerland
| | - Kerstin Denecke
- Institute for Patient-Centered Digital Health, School of Engineering and Computer Science, Bern University of Applied Sciences, Biel/Bienne, Switzerland
| |
Collapse
|
20
|
Yin Y, Zeng M, Wang H, Yang H, Zhou C, Jiang F, Wu S, Huang T, Yuan S, Lin J, Tang M, Chen J, Dong B, Yuan J, Xie D. A clinician-based comparative study of large language models in answering medical questions: the case of asthma. Front Pediatr 2025; 13:1461026. [PMID: 40352607 PMCID: PMC12062090 DOI: 10.3389/fped.2025.1461026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/10/2024] [Accepted: 04/07/2025] [Indexed: 05/14/2025] Open
Abstract
Objective This study aims to evaluate and compare the performance of four major large language models (GPT-3.5, GPT-4.0, YouChat, and Perplexity) in answering 32 common asthma-related questions. Materials and methods Seventy-five clinicians from various tertiary hospitals participated in this study. Each clinician was tasked with evaluating the responses generated by the four large language models (LLMs) to 32 common clinical questions related to pediatric asthma. Based on predefined criteria, participants subjectively assessed the accuracy, correctness, completeness, and practicality of the LLMs' answers. The participants provided precise scores to determine the performance of each language model in answering pediatric asthma-related questions. Results GPT-4.0 performed the best across all dimensions, while YouChat performed the worst in all dimensions. Both GPT-3.5 and GPT-4.0 outperformed the other two models, but there was no significant difference in performance between GPT-3.5 and GPT-4.0 or between YouChat and Perplexity. Conclusion GPT and other large language models can answer medical questions with a certain degree of completeness and accuracy. However, clinical physicians should critically assess internet information, distinguishing between true and false data, and should not blindly accept the outputs of these models. With advancements in key technologies, LLMs may one day become a safe option for doctors seeking information.
Collapse
Affiliation(s)
- Yong Yin
- Department of Respiratory Medicine, Hainan Branch, Shanghai Children's Medical Center, School of Medicine, Shanghai Jiao Tong University, Sanya, China
- Department of Respiratory Medicine, Shanghai Children’s Medical Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China
- Pediatric AI Clinical Application and Research Center, Shanghai Children’s Medical Center, Shanghai, China
- Shanghai Engineering Research Center of Intelligence Pediatrics (SERCIP), Shanghai, China
| | - Mei Zeng
- Department of Respiratory Medicine, Shanghai Children’s Medical Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Hansong Wang
- Department of Performance, Shanghai Children’s Medical Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Haibo Yang
- Department of Respiratory Medicine, Hainan Branch, Shanghai Children's Medical Center, School of Medicine, Shanghai Jiao Tong University, Sanya, China
| | - Caijing Zhou
- Department of Respiratory Medicine, Hainan Branch, Shanghai Children's Medical Center, School of Medicine, Shanghai Jiao Tong University, Sanya, China
| | - Feng Jiang
- Department of Respiratory Medicine, Hainan Branch, Shanghai Children's Medical Center, School of Medicine, Shanghai Jiao Tong University, Sanya, China
| | - Shufan Wu
- Department of Respiratory Medicine, Hainan Branch, Shanghai Children's Medical Center, School of Medicine, Shanghai Jiao Tong University, Sanya, China
| | - Tingyue Huang
- Department of Respiratory Medicine, Hainan Branch, Shanghai Children's Medical Center, School of Medicine, Shanghai Jiao Tong University, Sanya, China
| | - Shuahua Yuan
- Department of Respiratory Medicine, Shanghai Children’s Medical Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Jilei Lin
- Department of Respiratory Medicine, Shanghai Children’s Medical Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China
- Pediatric AI Clinical Application and Research Center, Shanghai Children’s Medical Center, Shanghai, China
- Shanghai Engineering Research Center of Intelligence Pediatrics (SERCIP), Shanghai, China
| | - Mingyu Tang
- Department of Respiratory Medicine, Shanghai Children’s Medical Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Jiande Chen
- Department of Respiratory Medicine, Shanghai Children’s Medical Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Bin Dong
- Pediatric AI Clinical Application and Research Center, Shanghai Children’s Medical Center, Shanghai, China
- Shanghai Engineering Research Center of Intelligence Pediatrics (SERCIP), Shanghai, China
- Department of Discipline Inspection and Supervision, Shanghai Children’s Medical Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Jiajun Yuan
- Pediatric AI Clinical Application and Research Center, Shanghai Children’s Medical Center, Shanghai, China
- Shanghai Engineering Research Center of Intelligence Pediatrics (SERCIP), Shanghai, China
- Medical Department of Shanghai Children’s Medical Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Dan Xie
- Department of Respiratory Medicine, Hainan Branch, Shanghai Children's Medical Center, School of Medicine, Shanghai Jiao Tong University, Sanya, China
| |
Collapse
|
21
|
Koller P, Clement C, van Eijk A, Seifert R, Zhang J, Prenosil G, Sathekge MM, Herrmann K, Baum R, Weber WA, Rominger A, Shi K. Optimizing theranostics chatbots with context-augmented large language models. Theranostics 2025; 15:5693-5704. [PMID: 40365297 PMCID: PMC12068303 DOI: 10.7150/thno.107757] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2024] [Accepted: 04/03/2025] [Indexed: 05/15/2025] Open
Abstract
Introduction: Nuclear medicine theranostics is rapidly emerging, as an interdisciplinary therapy option with multi-dimensional considerations. Healthcare Professionals do not have the time to do in depth research on every therapy option. Personalized Chatbots might help to educate them. Chatbots using Large Language Models (LLMs), such as ChatGPT, are gaining interest addressing these challenges. However, chatbot performances often fall short in specific domains, which is critical in healthcare applications. Methods: This study develops a framework to examine the use of contextual augmentation to improve the performance of medical theranostic chatbots to create the first theranostic chatbot. Contextual augmentation involves providing additional relevant information to LLMs to improve their responses. We evaluate five state-of-the-art LLMs on questions translated into English and German. We compare answers generated with and without contextual augmentation, where the LLMs access pre-selected research papers via Retrieval Augmented Generation (RAG). We are using two RAG techniques: Naïve RAG and Advanced RAG. Results: A user study and LLM-based evaluation assess answer quality across different metrics. Results show that Advanced RAG techniques considerably enhance LLM performance. Among the models, the best-performing variants are CLAUDE 3 OPUS and GPT-4O. These models consistently achieve the highest scores, indicating robust integration and utilization of contextual information. The most notable improvements between Naive RAG and Advanced RAG are observed in the GEMINI 1.5 and COMMAND R+ variants. Conclusion: This study demonstrates that contextual augmentation addresses the complexities inherent in theranostics. Despite promising results, key limitations include the biased selection of questions focusing primarily on PRRT, the need for comprehensive context documents. Future research should include a broader range of theranostics questions, explore additional RAG methods and aim to compare human and LLM evaluations more directly to enhance LLM performance further.
Collapse
Affiliation(s)
- Pia Koller
- Informatics, Ludwig-Maximilians-University, Geschwister-Scholl-Platz 1, Munich, 80539, Germany
- ITM Radiopharma, Walther-Von-Dyck Str. 4, Garching, 85748, Bavaria, Germany
| | - Christoph Clement
- Department of Nuclear Medicine, Bern University Hospital, University of Bern, Freiburgstrasse 20, Bern, 3010, Switzerland
| | - Albert van Eijk
- ITM Radiopharma, Walther-Von-Dyck Str. 4, Garching, 85748, Bavaria, Germany
| | - Robert Seifert
- Department of Nuclear Medicine, Bern University Hospital, University of Bern, Freiburgstrasse 20, Bern, 3010, Switzerland
| | - Jingjing Zhang
- Yong Loo Lin School of Medicine, National University of Singapore, 10 Medical Dr, Singapore, 117597, Singapore
| | - George Prenosil
- Department of Nuclear Medicine, Bern University Hospital, University of Bern, Freiburgstrasse 20, Bern, 3010, Switzerland
| | - Mike M. Sathekge
- Nuclear Medicine, University of Pretoria, Private Bag x 20, 0028, Hatfield, South Africa
| | - Ken Herrmann
- Department of Nuclear Medicine, University of Duisburg-Essen, and German Cancer Consortium (DKTK)-University Hospital Essen, Essen, Germany
- National Center for Tumor Diseases (NCT), NCT West, Germany
| | - Richard Baum
- International Centers for Precision Oncology (ICPO), CURANOSTICUM Wiesbaden-Frankfurt at DKD Helios Klinik, Aukammallee 33, Wiesbaden, 65191, Germany
| | - Wolfgang A. Weber
- Department of Nuclear Medicine, TUM University Hospital, Technical University Munich, Bavarian Cancer Research Center, Ismaningerstr. 22, Munich, 81675, Germany
| | - Axel Rominger
- Department of Nuclear Medicine, Bern University Hospital, University of Bern, Freiburgstrasse 20, Bern, 3010, Switzerland
| | - Kuangyu Shi
- Department of Nuclear Medicine, Bern University Hospital, University of Bern, Freiburgstrasse 20, Bern, 3010, Switzerland
- Department of Nuclear Medicine, TUM University Hospital, Technical University Munich, Bavarian Cancer Research Center, Ismaningerstr. 22, Munich, 81675, Germany
- Chair for Computer-aided Medical Procedure, School of Computation, Information & Technology, Technical University Munich, Boltzmannstr. 3, Garching, 85748, Germany
| |
Collapse
|
22
|
Choi A, Kim HG, Choi MH, Ramasamy SK, Kim Y, Jung SE. Performance of GPT-4 Turbo and GPT-4o in Korean Society of Radiology In-Training Examinations. Korean J Radiol 2025; 26:26.e46. [PMID: 40288896 DOI: 10.3348/kjr.2024.1096] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2024] [Revised: 03/06/2025] [Accepted: 03/07/2025] [Indexed: 04/29/2025] Open
Abstract
OBJECTIVE Despite the potential of large language models for radiology training, their ability to handle image-based radiological questions remains poorly understood. This study aimed to evaluate the performance of the GPT-4 Turbo and GPT-4o in radiology resident examinations, to analyze differences across question types, and to compare their results with those of residents at different levels. MATERIALS AND METHODS A total of 776 multiple-choice questions from the Korean Society of Radiology In-Training Examinations were used, forming two question sets: one originally written in Korean and the other translated into English. We evaluated the performance of GPT-4 Turbo (gpt-4-turbo-2024-04-09) and GPT-4o (gpt-4o-2024-11-20) on these questions with the temperature set to zero, determining the accuracy based on the majority vote from five independent trials. We analyzed their results using the question type (text-only vs. image-based) and benchmarked them against nationwide radiology residents' performance. The impact of the input language (Korean or English) on model performance was examined. RESULTS GPT-4o outperformed GPT-4 Turbo for both image-based (48.2% vs. 41.8%, P = 0.002) and text-only questions (77.9% vs. 69.0%, P = 0.031). On image-based questions, GPT-4 Turbo and GPT-4o showed comparable performance to that of 1st-year residents (41.8% and 48.2%, respectively, vs. 43.3%, P = 0.608 and 0.079, respectively) but lower performance than that of 2nd- to 4th-year residents (vs. 56.0%-63.9%, all P ≤ 0.005). For text-only questions, GPT-4 Turbo and GPT-4o performed better than residents across all years (69.0% and 77.9%, respectively, vs. 44.7%-57.5%, all P ≤ 0.039). Performance on the English- and Korean-version questions showed no significant differences for either model (all P ≥ 0.275). CONCLUSION GPT-4o outperformed the GPT-4 Turbo in all question types. On image-based questions, both models' performance matched that of 1st-year residents but was lower than that of higher-year residents. Both models demonstrated superior performance compared to residents for text-only questions. The models showed consistent performances across English and Korean inputs.
Collapse
Affiliation(s)
- Arum Choi
- Department of Radiology, Eunpyeong St. Mary's Hospital, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea
| | - Hyun Gi Kim
- Department of Radiology, Eunpyeong St. Mary's Hospital, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea.
| | - Moon Hyung Choi
- Department of Radiology, Eunpyeong St. Mary's Hospital, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea
| | - Shakthi Kumaran Ramasamy
- Department of Radiology, Molecular Imaging Program at Stanford, Stanford University School of Medicine, Stanford, CA, USA
| | - Youme Kim
- Department of Diagnostic Radiology, Dankook University Hospital, Cheonan, Republic of Korea
| | - Seung Eun Jung
- Department of Radiology, Eunpyeong St. Mary's Hospital, College of Medicine, The Catholic University of Korea, Seoul, Republic of Korea
| |
Collapse
|
23
|
Li C, Zhao Y, Bai Y, Zhao B, Tola YO, Chan CW, Zhang M, Fu X. Unveiling the Potential of Large Language Models in Transforming Chronic Disease Management: Mixed Methods Systematic Review. J Med Internet Res 2025; 27:e70535. [PMID: 40239198 PMCID: PMC12044321 DOI: 10.2196/70535] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2024] [Revised: 01/29/2025] [Accepted: 03/19/2025] [Indexed: 04/18/2025] Open
Abstract
BACKGROUND Chronic diseases are a major global health burden, accounting for nearly three-quarters of the deaths worldwide. Large language models (LLMs) are advanced artificial intelligence systems with transformative potential to optimize chronic disease management; however, robust evidence is lacking. OBJECTIVE This review aims to synthesize evidence on the feasibility, opportunities, and challenges of LLMs across the disease management spectrum, from prevention to screening, diagnosis, treatment, and long-term care. METHODS Following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analysis) guidelines, 11 databases (Cochrane Central Register of Controlled Trials, CINAHL, Embase, IEEE Xplore, MEDLINE via Ovid, ProQuest Health & Medicine Collection, ScienceDirect, Scopus, Web of Science Core Collection, China National Knowledge Internet, and SinoMed) were searched on April 17, 2024. Intervention and simulation studies that examined LLMs in the management of chronic diseases were included. The methodological quality of the included studies was evaluated using a rating rubric designed for simulation-based research and the risk of bias in nonrandomized studies of interventions tool for quasi-experimental studies. Narrative analysis with descriptive figures was used to synthesize the study findings. Random-effects meta-analyses were conducted to assess the pooled effect estimates of the feasibility of LLMs in chronic disease management. RESULTS A total of 20 studies examined general-purpose (n=17) and retrieval-augmented generation-enhanced LLMs (n=3) for the management of chronic diseases, including cancer, cardiovascular diseases, and metabolic disorders. LLMs demonstrated feasibility across the chronic disease management spectrum by generating relevant, comprehensible, and accurate health recommendations (pooled accurate rate 71%, 95% CI 0.59-0.83; I2=88.32%) with retrieval-augmented generation-enhanced LLMs having higher accuracy rates compared to general-purpose LLMs (odds ratio 2.89, 95% CI 1.83-4.58; I2=54.45%). LLMs facilitated equitable information access; increased patient awareness regarding ailments, preventive measures, and treatment options; and promoted self-management behaviors in lifestyle modification and symptom coping. Additionally, LLMs facilitate compassionate emotional support, social connections, and health care resources to improve the health outcomes of chronic diseases. However, LLMs face challenges in addressing privacy, language, and cultural issues; undertaking advanced tasks, including diagnosis, medication, and comorbidity management; and generating personalized regimens with real-time adjustments and multiple modalities. CONCLUSIONS LLMs have demonstrated the potential to transform chronic disease management at the individual, social, and health care levels; however, their direct application in clinical settings is still in its infancy. A multifaceted approach that incorporates robust data security, domain-specific model fine-tuning, multimodal data integration, and wearables is crucial for the evolution of LLMs into invaluable adjuncts for health care professionals to transform chronic disease management. TRIAL REGISTRATION PROSPERO CRD42024545412; https://www.crd.york.ac.uk/PROSPERO/view/CRD42024545412.
Collapse
Affiliation(s)
- Caixia Li
- The Department of Nursing, The Eighth Affiliated Hospital, Sun Yat-sen University, Shenzhen, China
| | - Yina Zhao
- The Department of Nursing, The Eighth Affiliated Hospital, Sun Yat-sen University, Shenzhen, China
| | - Yang Bai
- The School of Nursing, Sun Yat-sen University, Guangzhou, China
| | - Baoquan Zhao
- The School of Artificial Intelligence, Sun Yat-sen University, Guangzhou, China
| | | | - Carmen Wh Chan
- The Nethersole School of Nursing, The Chinese University of Hong Kong, Hong Kong, China
| | - Meifen Zhang
- The School of Nursing, Sun Yat-sen University, Guangzhou, China
| | - Xia Fu
- The Department of Nursing, The Eighth Affiliated Hospital, Sun Yat-sen University, Shenzhen, China
| |
Collapse
|
24
|
Liu R, Liu J, Yang J, Sun Z, Yan H. Comparative analysis of ChatGPT-4o mini, ChatGPT-4o and Gemini Advanced in the treatment of postmenopausal osteoporosis. BMC Musculoskelet Disord 2025; 26:369. [PMID: 40241048 PMCID: PMC12001388 DOI: 10.1186/s12891-025-08601-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/23/2025] [Accepted: 03/31/2025] [Indexed: 04/18/2025] Open
Abstract
BACKGROUND Osteoporosis is a sex-specific disease. Postmenopausal osteoporosis (PMOP) has been the focus of public health research worldwide. The purpose of this study is to evaluate the quality and readability of artificial intelligence large-scale language models (AI-LLMs): ChatGPT-4o mini, ChatGPT-4o and Gemini Advanced for responses generated in response to questions related to PMOP. METHODS We collected 48 PMOP frequently asked questions (FAQs) through offline counseling and online medical community forums. We also prepared 24 specific questions about PMOP based on the Management of Postmenopausal Osteoporosis: 2022 ACOG Clinical Practice Guideline No. 2 (2022 ACOG-PMOP Guideline). In this project, the FAQs were imported into the AI-LLMs (ChatGPT-4o mini, ChatGPT-4o, Gemini Advanced) and randomly assigned to four professional orthopedic surgeons, who independently rated the satisfaction of each response via a 5-point Likert scale. Furthermore, a Flesch Reading Ease (FRE) score was calculated for each of the LLMs' responses to assess the readability of the text generated by each LLM. RESULTS When it comes to addressing questions related to PMOP and the 2022 ACOG-PMOP guidelines, ChatGPT-4o and Gemini Advanced provide more concise answers than ChatGPT-4o mini. In terms of the overall FAQs of PMOP, ChatGPT-4o has a significantly higher accuracy rate than ChatGPT-4o mini and Gemini Advanced. When answering questions related to the 2022 ACOG-PMOP guidelines, ChatGPT-4o mini vs. ChatGPT-4o have significantly higher response accuracy than Gemini Advanced. ChatGPT-4o mini, ChatGPT-4o, and Gemini Advanced all have good levels of self-correction. CONCLUSIONS Our research shows that Gemini Advanced and ChatGPT-4o provide more concise and intuitive answers. ChatGPT-4o responds better in answering frequently asked questions related to PMOP. When answering questions related to the 2022 ACOG-PMOP guidelines, ChatGPT-4o mini and ChatGPT-4o responded significantly better than Gemini Advanced. ChatGPT-4o mini, ChatGPT-4o, and Gemini Advanced have demonstrated a strong ability to self-correct. CLINICAL TRIAL NUMBER Not applicable.
Collapse
Affiliation(s)
- Rui Liu
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, China
| | - Jian Liu
- College of Computer Science, Nankai University, Tianjin, 300350, China
| | - Jia Yang
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, China
| | - Zhiming Sun
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, China.
| | - Hua Yan
- Clinical College of Neurology, Neurosurgery and Neurorehabilitation, Tianjin Medical University, Tianjin, China.
| |
Collapse
|
25
|
Feng Y, Zhou Y, Xu J, Lu X, Gu R, Qiao Z. Integrating generative AI with neurophysiological methods in psychiatric practice. Asian J Psychiatr 2025; 108:104499. [PMID: 40262408 DOI: 10.1016/j.ajp.2025.104499] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/09/2024] [Revised: 04/11/2025] [Accepted: 04/12/2025] [Indexed: 04/24/2025]
Abstract
This paper explores the potential integration of generative AI (e.g., large language models) with neuroscientific and physiological approaches in psychiatric practice. Renowned for its advanced natural language processing capabilities, generative AI has shown promise in psychological counseling, emotional support, and clinical interventions. However, its application alongside neuroscience and physiology in psychiatry remains underexplored. We propose that generative AI can facilitate translations and adaptive explanations, streamline experimental preparation, enhance multi-modal data analysis, and improve clinical applications through real-time communication, content generation, and data synthesis. Furthermore, we examine how generative AI, as a specialized application of deep learning, can identify new biomarkers and construct neurophysiological models of psychiatric symptoms. We also discuss the synergistic relationship between neuroscience and AI development, particularly in improving AI's emotional recognition and learning mechanisms. While acknowledging the potential benefits, we address the challenges and risks associated with generative AI in psychiatry, including data reliability, privacy concerns, and resource constraints. This perspective advocates for a balanced approach to leveraging AI's capabilities while safeguarding mental health.
Collapse
Affiliation(s)
- Yi Feng
- Mental Health Center, Central University of Finance and Economics, Beijing 100081, China
| | - Yuan Zhou
- State Key Laboratory of Cognitive Science and Mental Health, Institute of Psychology, Chinese Academy of Sciences, Beijing 100101, China; Department of Psychology, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Jian Xu
- Research Center of Brain and Cognitive Neuroscience, Liaoning Normal University, Dalian 116029, China
| | - Xinquan Lu
- State Key Laboratory of Cognitive Science and Mental Health, Institute of Psychology, Chinese Academy of Sciences, Beijing 100101, China; Department of Psychology, University of Chinese Academy of Sciences, Beijing 100049, China
| | - Ruolei Gu
- State Key Laboratory of Cognitive Science and Mental Health, Institute of Psychology, Chinese Academy of Sciences, Beijing 100101, China; Department of Psychology, University of Chinese Academy of Sciences, Beijing 100049, China.
| | - Zhihong Qiao
- Beijing Key Laboratory of Applied Experimental Psychology, National Demonstration Center for Experimental Psychology Education (Beijing Normal University), Faculty of Psychology, Beijing Normal University, Beijing, China.
| |
Collapse
|
26
|
Nair RAS, Hartung M, Heinisch P, Jaskolski J, Starke-Knäusel C, Veríssimo S, Schmidt DM, Cimiano P. Summarizing Online Patient Conversations Using Generative Language Models: Experimental and Comparative Study. JMIR Med Inform 2025; 13:e62909. [PMID: 40228244 PMCID: PMC12038288 DOI: 10.2196/62909] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2024] [Revised: 12/05/2024] [Accepted: 12/20/2024] [Indexed: 04/16/2025] Open
Abstract
BACKGROUND Social media is acknowledged by regulatory bodies (eg, the Food and Drug Administration) as an important source of patient experience data to learn about patients' unmet needs, priorities, and preferences. However, current methods rely either on manual analysis and do not scale, or on automatic processing, yielding mainly quantitative insights. Methods that can automatically summarize texts and yield qualitative insights at scale are missing. OBJECTIVE The objective of this study was to evaluate to what extent state-of-the-art large language models can appropriately summarize posts shared by patients in web-based forums and health communities. Specifically, the goal was to compare the performance of different language models and prompting strategies on the task of summarizing documents reflecting the experiences of individual patients. METHODS In our experimental and comparative study, we applied 3 different language models (Flan-T5, Generative Pretrained Transformer [GPT], GPT-3, and GPT-3.5) in combination with various prompting strategies to the task of summarizing posts from patients in online communities. The generated summaries were evaluated with respect to 124 manually created summaries as a ground-truth reference. As evaluation metrics, we used 2 standard metrics from the field of text generation, namely, Recall-Oriented Understudy for Gisting Evaluation (ROUGE) and BERTScore, to compare the automatically generated summaries to the manually created reference summaries. RESULTS Among the zero-shot prompting-based large language models investigated, GPT-3.5 performed better than the other models with respect to the ROUGE metrics, as well as with respect to BERTScore. While zero-shot prompting seems to be a good prompting strategy, overall GPT-3.5 in combination with directional stimulus prompting in a 3-shot setting had the best results with respect to the aforementioned metrics. A manual investigation of the summarization of the best-performing method showed that the generated summaries were accurate and plausible compared to the manual summaries. CONCLUSIONS Taken together, our results suggest that state-of-the-art pretrained language models are a valuable tool to provide qualitative insights about the patient experience to better understand unmet needs, patient priorities, and how a disease impacts daily functioning and quality of life to inform processes aimed at improving health care delivery and ensure that drug development focuses more on the actual priorities and unmet needs of patients. The key limitations of our work are the small data sample as well as the fact that the manual summaries were created by 1 annotator only. Furthermore, the results hold only for the examined models and prompting strategies, potentially not generalizing to other models and strategies.
Collapse
Affiliation(s)
| | | | - Philipp Heinisch
- Cognitive Interaction Technology Center, Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | | | | | | | - David Maria Schmidt
- Cognitive Interaction Technology Center, Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Philipp Cimiano
- Cognitive Interaction Technology Center, Faculty of Technology, Bielefeld University, Bielefeld, Germany
- Semalytix GmbH, Bielefeld, Germany
| |
Collapse
|
27
|
Socrates V, Wright DS, Huang T, Fereydooni S, Dien C, Chi L, Albano J, Patterson B, Sasidhar Kanaparthy N, Wright CX, Loza A, Chartash D, Iscoe M, Taylor RA. Identifying Deprescribing Opportunities With Large Language Models in Older Adults: Retrospective Cohort Study. JMIR Aging 2025; 8:e69504. [PMID: 40215480 PMCID: PMC12032504 DOI: 10.2196/69504] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2024] [Revised: 01/16/2025] [Accepted: 02/21/2025] [Indexed: 04/29/2025] Open
Abstract
BACKGROUND Polypharmacy, the concurrent use of multiple medications, is prevalent among older adults and associated with increased risks for adverse drug events including falls. Deprescribing, the systematic process of discontinuing potentially inappropriate medications, aims to mitigate these risks. However, the practical application of deprescribing criteria in emergency settings remains limited due to time constraints and criteria complexity. OBJECTIVE This study aims to evaluate the performance of a large language model (LLM)-based pipeline in identifying deprescribing opportunities for older emergency department (ED) patients with polypharmacy, using 3 different sets of criteria: Beers, Screening Tool of Older People's Prescriptions, and Geriatric Emergency Medication Safety Recommendations. The study further evaluates LLM confidence calibration and its ability to improve recommendation performance. METHODS We conducted a retrospective cohort study of older adults presenting to an ED in a large academic medical center in the Northeast United States from January 2022 to March 2022. A random sample of 100 patients (712 total oral medications) was selected for detailed analysis. The LLM pipeline consisted of two steps: (1) filtering high-yield deprescribing criteria based on patients' medication lists, and (2) applying these criteria using both structured and unstructured patient data to recommend deprescribing. Model performance was assessed by comparing model recommendations to those of trained medical students, with discrepancies adjudicated by board-certified ED physicians. Selective prediction, a method that allows a model to abstain from low-confidence predictions to improve overall reliability, was applied to assess the model's confidence and decision-making thresholds. RESULTS The LLM was significantly more effective in identifying deprescribing criteria (positive predictive value: 0.83; negative predictive value: 0.93; McNemar test for paired proportions: χ21=5.985; P=.02) relative to medical students, but showed limitations in making specific deprescribing recommendations (positive predictive value=0.47; negative predictive value=0.93). Adjudication revealed that while the model excelled at identifying when there was a deprescribing criterion related to one of the patient's medications, it often struggled with determining whether that criterion applied to the specific case due to complex inclusion and exclusion criteria (54.5% of errors) and ambiguous clinical contexts (eg, missing information; 39.3% of errors). Selective prediction only marginally improved LLM performance due to poorly calibrated confidence estimates. CONCLUSIONS This study highlights the potential of LLMs to support deprescribing decisions in the ED by effectively filtering relevant criteria. However, challenges remain in applying these criteria to complex clinical scenarios, as the LLM demonstrated poor performance on more intricate decision-making tasks, with its reported confidence often failing to align with its actual success in these cases. The findings underscore the need for clearer deprescribing guidelines, improved LLM calibration for real-world use, and better integration of human-artificial intelligence workflows to balance artificial intelligence recommendations with clinician judgment.
Collapse
Affiliation(s)
- Vimig Socrates
- Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, United States
- Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT, United States
| | - Donald S Wright
- Department of Emergency Medicine, School of Medicine, Yale University, New Haven, CT, United States
- VA Connecticut Healthcare System, US Department of Veterans Affairs, West Haven, CT, United States
| | - Thomas Huang
- Department of Emergency Medicine, School of Medicine, Yale University, New Haven, CT, United States
| | - Soraya Fereydooni
- Department of Emergency Medicine, School of Medicine, Yale University, New Haven, CT, United States
| | - Christine Dien
- Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT, United States
| | - Ling Chi
- Department of Emergency Medicine, School of Medicine, Yale University, New Haven, CT, United States
| | - Jesse Albano
- Department of Pharmacy, Yale New Haven Hospital, New Haven, CT, United States
| | - Brian Patterson
- BerbeeWalsh Department of Emergency Medicine, University of Wisconsin-Madison, Madison, WI, United States
| | - Naga Sasidhar Kanaparthy
- Department of Emergency Medicine, School of Medicine, Yale University, New Haven, CT, United States
- VA Connecticut Healthcare System, US Department of Veterans Affairs, West Haven, CT, United States
| | - Catherine X Wright
- Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, New Haven, CT, United States
| | - Andrew Loza
- Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, United States
- Department of Emergency Medicine, School of Medicine, Yale University, New Haven, CT, United States
| | - David Chartash
- Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, United States
- School of Medicine, University College Dublin, Dublin, Ireland
| | - Mark Iscoe
- Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, United States
- Department of Emergency Medicine, School of Medicine, Yale University, New Haven, CT, United States
| | - Richard Andrew Taylor
- Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, United States
- Department of Emergency Medicine, School of Medicine, Yale University, New Haven, CT, United States
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT, United States
| |
Collapse
|
28
|
Spagl KT, Watson EW, Jatowt A, Weidmann AE. Evaluating a customised large language model (DELSTAR) and its ability to address medication-related questions associated with delirium: a quantitative exploratory study. Int J Clin Pharm 2025:10.1007/s11096-025-01900-8. [PMID: 40208398 DOI: 10.1007/s11096-025-01900-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2024] [Accepted: 03/06/2025] [Indexed: 04/11/2025]
Abstract
BACKGROUND A customised large language model (LLM) could serve as a next-generation clinical pharmacy research assistant to prevent medication-associated delirium. Comprehensive evaluation strategies are still missing. AIM This quantitative exploratory study aimed to develop an approach to comprehensively assess the domain-specific customised delirium LLM (DELSTAR) ability, quality and performance to accurately address complex clinical and practice research questions on delirium that typically require extensive literature searches and meta-analyses. METHOD DELSTAR, focused on delirium-associated medications, was implemented as a 'Custom GPT' for quality assessment and as a Python-based software pipeline for performance testing on closed and leading open models. Quality metrics included statement accuracy and data credibility; performance metrics covered F1-Score, sensitivity/specificity, precision, AUC, and AUC-ROC curves. RESULTS DELSTAR demonstrated more accurate and comprehensive information compared to information retrieved by traditional systematic literature reviews (SLRs) (p < 0.05) and accessed Application Programmer Interfaces (API), private databases, and high-quality sources despite mainly relying on less reliable internet sources. GPT-3.5 and GPT-4o emerged as the most reliable foundation models. In Dataset 2, GPT-4o (F1-Score: 0.687) and Llama3-70b (F1-Score: 0.655) performed best, while in Dataset 3, GPT-3.5 (F1-Score: 0.708) and GPT-4o (F1-Score: 0.665) led. None consistently met desired threshold values across all metrics. CONCLUSION DELSTAR demonstrated potential as a clinical pharmacy research assistant, surpassing traditional SLRs in quality. Improvements are needed in high-quality data use, citation, and performance optimisation. GPT-4o, GPT-3.5, and Llama3-70b were the most suitable foundation models, but fine-tuning DELSTAR is essential to enhance sensitivity, especially critical in pharmaceutical contexts.
Collapse
Affiliation(s)
- Katharina Teresa Spagl
- Department of Clinical Pharmacy, Institute of Pharmacy, Innsbruck University, Innrain 80, 6020, Innsbruck, Austria
| | - Edward William Watson
- Department of Media and Learning Technology, Innsbruck University, Innrain 52, 6020, Innsbruck, Austria
| | - Adam Jatowt
- Department of Computer Science and Digital Science Centre, Innsbruck University, Technikerstraße 21a, 6020, Innsbruck, Austria
| | - Anita Elaine Weidmann
- Department of Clinical Pharmacy, Institute of Pharmacy, Innsbruck University, Innrain 80, 6020, Innsbruck, Austria.
| |
Collapse
|
29
|
Omar M, Soffer S, Agbareia R, Bragazzi NL, Apakama DU, Horowitz CR, Charney AW, Freeman R, Kummer B, Glicksberg BS, Nadkarni GN, Klang E. Sociodemographic biases in medical decision making by large language models. Nat Med 2025:10.1038/s41591-025-03626-6. [PMID: 40195448 DOI: 10.1038/s41591-025-03626-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2024] [Accepted: 03/03/2025] [Indexed: 04/09/2025]
Abstract
Large language models (LLMs) show promise in healthcare, but concerns remain that they may produce medically unjustified clinical care recommendations reflecting the influence of patients' sociodemographic characteristics. We evaluated nine LLMs, analyzing over 1.7 million model-generated outputs from 1,000 emergency department cases (500 real and 500 synthetic). Each case was presented in 32 variations (31 sociodemographic groups plus a control) while holding clinical details constant. Compared to both a physician-derived baseline and each model's own control case without sociodemographic identifiers, cases labeled as Black or unhoused or identifying as LGBTQIA+ were more frequently directed toward urgent care, invasive interventions or mental health evaluations. For example, certain cases labeled as being from LGBTQIA+ subgroups were recommended mental health assessments approximately six to seven times more often than clinically indicated. Similarly, cases labeled as having high-income status received significantly more recommendations (P < 0.001) for advanced imaging tests such as computed tomography and magnetic resonance imaging, while low- and middle-income-labeled cases were often limited to basic or no further testing. After applying multiple-hypothesis corrections, these key differences persisted. Their magnitude was not supported by clinical reasoning or guidelines, suggesting that they may reflect model-driven bias, which could eventually lead to health disparities rather than acceptable clinical variation. Our findings, observed in both proprietary and open-source models, underscore the need for robust bias evaluation and mitigation strategies to ensure that LLM-driven medical advice remains equitable and patient centered.
Collapse
Affiliation(s)
- Mahmud Omar
- The Windreich Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai and the Mount Sinai Health System, New York, NY, USA.
- The Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai and the Mount Sinai Health System, New York, NY, USA.
- Maccabi Healthcare Services, Tel Aviv, Israel.
| | - Shelly Soffer
- Institute of Hematology, Davidoff Cancer Center, Rabin Medical Center, Petah Tikva, Israel
| | - Reem Agbareia
- Ophthalmology Department, Hadassah Medical Center, Jerusalem, Israel
| | - Nicola Luigi Bragazzi
- Institute for Stroke and Dementia Research (ISD), University Hospital, Ludwig-Maximilians-University (LMU) Munich, Munich, Germany
| | - Donald U Apakama
- The Windreich Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai and the Mount Sinai Health System, New York, NY, USA
- The Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai and the Mount Sinai Health System, New York, NY, USA
- Institute for Health Equity Research, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Carol R Horowitz
- Institute for Health Equity Research, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Alexander W Charney
- The Windreich Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai and the Mount Sinai Health System, New York, NY, USA
- The Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai and the Mount Sinai Health System, New York, NY, USA
| | - Robert Freeman
- The Windreich Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai and the Mount Sinai Health System, New York, NY, USA
- The Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai and the Mount Sinai Health System, New York, NY, USA
| | - Benjamin Kummer
- The Windreich Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai and the Mount Sinai Health System, New York, NY, USA
- The Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai and the Mount Sinai Health System, New York, NY, USA
- Department of Neurology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Benjamin S Glicksberg
- The Windreich Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai and the Mount Sinai Health System, New York, NY, USA
- The Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai and the Mount Sinai Health System, New York, NY, USA
| | - Girish N Nadkarni
- The Windreich Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai and the Mount Sinai Health System, New York, NY, USA.
- The Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai and the Mount Sinai Health System, New York, NY, USA.
| | - Eyal Klang
- The Windreich Department of Artificial Intelligence and Human Health, Icahn School of Medicine at Mount Sinai and the Mount Sinai Health System, New York, NY, USA.
- The Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai and the Mount Sinai Health System, New York, NY, USA.
| |
Collapse
|
30
|
Shinn K, Henderson CS, Schenone AL, Goonewardena SN, Shore S, Murthy VL, Madamanchi C. Can ChatGPT answer patients' questions about nuclear stress tests and 18F-Flurodeoxyglucose PET for myocardial inflammation? J Nucl Cardiol 2025:102174. [PMID: 40194756 DOI: 10.1016/j.nuclcard.2025.102174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2024] [Revised: 02/11/2025] [Accepted: 02/27/2025] [Indexed: 04/09/2025]
Abstract
BACKGROUND Several modalities are used for stress testing and require specific patient preparation. 18F-Flurodeoxyglucose positron emission tomography (FDG PET) is an important tool in the diagnosis and risk stratification of patients suspected of cardiac sarcoidosis and endocarditis [1-3]. There is a need for improved patient access to questions regarding cardiac testing to ensure proper adherence to instructions. We sought to evaluate the effectiveness of ChatGPT in answering questions about stress testing and cardiac FDG PET inflammation scans. METHODS AND RESULTS We generated fifty-eight questions about stress testing and cardiac FDG PET inflammation scans. OpenAI ChatGPT-3.5 and -4o were used to answer the questions. The answers were graded by three nuclear cardiologists as the following categories: 1 = correct and complete, 2 = somewhat correct/somewhat complete, 3 = incorrect: no benefit, or 4 = incorrect: harmful/misleading information (Table I). Of the 174 grades assigned to responses to the questions from ChatGPT-3.5, 62/174 (36%) were correct and complete, 93/174 (53%) were somewhat correct/somewhat complete, 12/174 (7%) were incorrect: no benefit, 7/174 (4%) were incorrect: harmful/misleading information. Of the grades assigned to responses to questions from ChatGPT-4o, 107/174 (61%) were correct and complete, 62/174 (36%) were somewhat correct/somewhat complete, 3/174 (2%) were incorrect: no benefit, and 2/174 (1%) were incorrect: harmful/misleading information. CONCLUSIONS ChatGPT can provide some accurate responses to patient questions regarding stress tests and cardiac FDG PET inflammation studies, and its accuracy has improved over time; however, it is not suitable as a primary resource for clinical care at this stage of development.
Collapse
Affiliation(s)
- Kaitlin Shinn
- Department of Internal Medicine, University of Michigan, Ann Arbor, MI, USA
| | - Cory S Henderson
- Division of Cardiology, Department of Internal Medicine, Boston Medical Center, Boston, MA, USA
| | - Aldo L Schenone
- Division of Cardiology, Montefiore Medical Center/Albert Einstein College of Medicine, Bronx, NY, USA
| | - Sascha N Goonewardena
- Frankel Cardiovascular Center, Division of Cardiovascular Medicine, Department of Internal Medicine, University of Michigan, Ann Arbor, MI, USA; Division of Cardiovascular Medicine, VA Ann Arbor Health System, Ann Arbor, MI, USA
| | - Supriya Shore
- Frankel Cardiovascular Center, Division of Cardiovascular Medicine, Department of Internal Medicine, University of Michigan, Ann Arbor, MI, USA
| | - Venkatesh L Murthy
- Frankel Cardiovascular Center, Division of Cardiovascular Medicine, Department of Internal Medicine, University of Michigan, Ann Arbor, MI, USA
| | - Chaitanya Madamanchi
- Frankel Cardiovascular Center, Division of Cardiovascular Medicine, Department of Internal Medicine, University of Michigan, Ann Arbor, MI, USA.
| |
Collapse
|
31
|
Gunesli I, Aksun S, Fathelbab J, Yildiz BO. Comparative evaluation of ChatGPT-4, ChatGPT-3.5 and Google Gemini on PCOS assessment and management based on recommendations from the 2023 guideline. Endocrine 2025; 88:315-322. [PMID: 39623241 DOI: 10.1007/s12020-024-04121-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/25/2024] [Accepted: 11/23/2024] [Indexed: 03/25/2025]
Abstract
CONTEXT Artificial intelligence (AI) is increasingly utilized in healthcare, with models like ChatGPT and Google Gemini gaining global popularity. Polycystic ovary syndrome (PCOS) is a prevalent condition that requires both lifestyle modifications and medical treatment, highlighting the critical need for effective patient education. This study compares the responses of ChatGPT-4, ChatGPT-3.5 and Gemini to PCOS-related questions using the latest guideline. Evaluating AI's integration into patient education necessitates assessing response quality, reliability, readability and effectiveness in managing PCOS. PURPOSE To evaluate the accuracy, quality, readability and tendency to hallucinate of ChatGPT-4, ChatGPT-3.5 and Gemini's responses to questions about PCOS, its assessment and management based on recommendations from the current international PCOS guideline. METHODS This cross-sectional study assessed ChatGPT-4, ChatGPT-3.5, and Gemini's responses to PCOS-related questions created by endocrinologists using the latest guidelines and common patient queries. Experts evaluated the responses for accuracy, quality and tendency to hallucinate using Likert scales, while readability was analyzed using standard formulas. RESULTS ChatGPT-4 and ChatGPT-3.5 attained higher scores in accuracy and quality compared to Gemini (p = 0.001, p < 0.001 and p = 0.007, p < 0.001 respectively). However, Gemini obtained a higher readability score compared to the other chatbots (p < 0.001). There was a significant difference between the tendency to hallucinate scores, which were due to the lower scores in Gemini (p = 0.003). CONCLUSION The high accuracy and quality of responses provided by ChatGPT-4 and 3.5 to questions about PCOS suggest that they could be supportive in clinical practice. Future technological advancements may facilitate the use of artificial intelligence in both educating patients with PCOS and supporting the management of the disorder.
Collapse
Affiliation(s)
- Irmak Gunesli
- Hacettepe University School of Medicine, Department of Internal Medicine, Ankara, Turkey
| | - Seren Aksun
- Hacettepe University School of Medicine, Department of Internal Medicine, Ankara, Turkey
- Hacettepe University School of Medicine, Division of Endocrinology and Metabolism, Ankara, Turkey
| | | | - Bulent Okan Yildiz
- Hacettepe University School of Medicine, Department of Internal Medicine, Ankara, Turkey.
- Hacettepe University School of Medicine, Division of Endocrinology and Metabolism, Ankara, Turkey.
| |
Collapse
|
32
|
Zambrano Chaves JM, Huang SC, Xu Y, Xu H, Usuyama N, Zhang S, Wang F, Xie Y, Khademi M, Yang Z, Awadalla H, Gong J, Hu H, Yang J, Li C, Gao J, Gu Y, Wong C, Wei M, Naumann T, Chen M, Lungren MP, Chaudhari A, Yeung-Levy S, Langlotz CP, Wang S, Poon H. A clinically accessible small multimodal radiology model and evaluation metric for chest X-ray findings. Nat Commun 2025; 16:3108. [PMID: 40169573 PMCID: PMC11962106 DOI: 10.1038/s41467-025-58344-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2024] [Accepted: 03/19/2025] [Indexed: 04/03/2025] Open
Abstract
Large foundation models show promise in biomedicine but face challenges in clinical use due to performance gaps, accessibility, cost, and lack of scalable evaluation. Here we show that open-source small multimodal models can bridge these gaps in radiology by generating free-text findings from chest X-ray images. Our data-centric approach leverages 697K curated radiology image-text pairs to train a specialized, domain-adapted chest X-ray encoder. We integrate this encoder with pre-trained language models via a lightweight adapter that aligns image and text modalities. To enable robust, clinically relevant evaluation, we develop and validate CheXprompt, a GPT-4-based metric for assessing factual accuracy aligned with radiologists' evaluations. Benchmarked with CheXprompt and other standard factuality metrics, LLaVA-Rad (7B) achieves state-of-the-art performance, outperforming much larger models like GPT-4V and Med-PaLM M (84B). While not immediately ready for real-time clinical deployment, LLaVA-Rad is a scalable, privacy-preserving and cost-effective step towards clinically adaptable multimodal AI for radiology.
Collapse
Affiliation(s)
| | | | - Yanbo Xu
- Microsoft Research, Redmond, WA, USA
| | - Hanwen Xu
- University of Washington, Seattle, WA, USA
| | | | | | - Fei Wang
- University of Southern California, Los Angeles, CA, USA
| | - Yujia Xie
- Microsoft Research, Redmond, WA, USA
| | | | - Ziyi Yang
- Microsoft Research, Redmond, WA, USA
| | | | | | | | | | | | | | - Yu Gu
- Microsoft Research, Redmond, WA, USA
| | | | - Mu Wei
- Microsoft Research, Redmond, WA, USA
| | | | - Muhao Chen
- University of California, Davis, CA, USA
| | - Matthew P Lungren
- Microsoft Research, Redmond, WA, USA
- Stanford University, Stanford, CA, USA
- University of California, San Francisco, CA, USA
| | | | | | | | - Sheng Wang
- University of Washington, Seattle, WA, USA.
| | | |
Collapse
|
33
|
Weicken E, Mittermaier M, Hoeren T, Kliesch J, Wiegand T, Witzenrath M, Ballhausen M, Karagiannidis C, Sander LE, Gröschel MI. [Focus: artificial intelligence in medicine-Legal aspects of using large language models in clinical practice]. INNERE MEDIZIN (HEIDELBERG, GERMANY) 2025; 66:436-441. [PMID: 40085197 PMCID: PMC11965224 DOI: 10.1007/s00108-025-01861-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Accepted: 01/29/2025] [Indexed: 03/16/2025]
Abstract
BACKGROUND The use of artificial intelligence (AI) and natural language processing (NLP) methods in medicine, particularly large language models (LLMs), offers opportunities to advance the healthcare system and patient care in Germany. LLMs have recently gained importance, but their practical application in hospitals and practices has so far been limited. Research and implementation are hampered by a complex legal situation. It is essential to research LLMs in clinical studies in Germany and to develop guidelines for users. OBJECTIVE How can foundations for the data protection-compliant use of LLMs, particularly cloud-based LLMs, be established in the German healthcare system? The aim of this work is to present the data protection aspects of using cloud-based LLMs in clinical research and patient care in Germany and the European Union (EU); to this end, key statements of a legal opinion on this matter are considered. Insofar as the requirements for use are regulated by state laws (vs. federal laws), the legal situation in Berlin is used as a basis. MATERIALS AND METHODS As part of a research project, a legal opinion was commissioned to clarify the data protection aspects of the use of LLMs with cloud-based solutions at the Charité - University Hospital Berlin, Germany. Specific questions regarding the processing of personal data were examined. RESULTS The legal framework varies depending on the type of data processing and the relevant federal state (Bundesland). For anonymous data, data protection requirements need not apply. Where personal data is processed, it should be pseudonymized if possible. In the research context, patient consent is usually required to process their personal data, and data processing agreements must be concluded with the providers. Recommendations originating from LLMs must always be reviewed by medical doctors. CONCLUSIONS The use of cloud-based LLMs is possible as long as data protection requirements are observed. The legal framework is complex and requires transparency from providers. Future developments could increase the potential of AI and particularly LLMs in everyday clinical practice; however, clear legal and ethical guidelines are necessary.
Collapse
Affiliation(s)
- Eva Weicken
- Fraunhofer-Institut für Nachrichtentechnik, Heinrich-Hertz-Institut, Einsteinufer 37, 10587, Berlin, Deutschland
- Fächerverbund für Infektiologie, Pneumologie, und Intensivmedizin, Charité - Universitätsmedizin Berlin, Südring 9, 13353, Berlin, Deutschland
| | - Mirja Mittermaier
- Fächerverbund für Infektiologie, Pneumologie, und Intensivmedizin, Charité - Universitätsmedizin Berlin, Südring 9, 13353, Berlin, Deutschland
- Berlin Institute of Health at Charité, Anna-Louisa-Karsch-Str. 2, 10178, Berlin, Deutschland
| | - Thomas Hoeren
- Institut für Informations‑, Telekommunikations- und Medienrecht (ITM), Universität Münster, Leonardo-Campus 9, 48149, Münster, Deutschland
| | - Juliana Kliesch
- Bird & Bird LLP, Am Sandtorkai 50, 20457, Hamburg, Deutschland
| | - Thomas Wiegand
- Fraunhofer-Institut für Nachrichtentechnik, Heinrich-Hertz-Institut, Einsteinufer 37, 10587, Berlin, Deutschland
- Technische Universität Berlin, Straße des 17. Juni 135, 10623, Berlin, Deutschland
| | - Martin Witzenrath
- Fächerverbund für Infektiologie, Pneumologie, und Intensivmedizin, Charité - Universitätsmedizin Berlin, Südring 9, 13353, Berlin, Deutschland
| | | | - Christian Karagiannidis
- Lungenklinik Köln-Merheim, Abteilung für Pneumologie und Intensivmedizin, ARDS and ECMO Zentrum, Ostmerheimer Str. 200, 51109, Köln, Deutschland
- Universitätsklinikum Witten/Herdecke, Alfred-Herrhausen-Straße 50, 58455, Witten, Deutschland
| | - Leif Erik Sander
- Fächerverbund für Infektiologie, Pneumologie, und Intensivmedizin, Charité - Universitätsmedizin Berlin, Südring 9, 13353, Berlin, Deutschland
| | - Matthias I Gröschel
- Fächerverbund für Infektiologie, Pneumologie, und Intensivmedizin, Charité - Universitätsmedizin Berlin, Südring 9, 13353, Berlin, Deutschland.
| |
Collapse
|
34
|
Halfmann MC, Mildenberger P, Jorg T. [Artificial intelligence in radiology : Literature overview and reading recommendations]. RADIOLOGIE (HEIDELBERG, GERMANY) 2025; 65:266-270. [PMID: 39904811 DOI: 10.1007/s00117-025-01419-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 01/13/2025] [Indexed: 02/06/2025]
Abstract
BACKGROUND Due to the ongoing rapid advancement of artificial intelligence (AI), including large language models (LLMs), radiologists will soon face the challenge of the responsible clinical integration of these models. OBJECTIVES The aim of this work is to provide an overview of current developments regarding LLMs, potential applications in radiology, and their (future) relevance and limitations. MATERIALS AND METHODS This review analyzes publications on LLMs for specific applications in medicine and radiology. Additionally, literature related to the challenges of clinical LLM use was reviewed and summarized. RESULTS In addition to a general overview of current literature on radiological applications of LLMs, several particularly noteworthy studies on the subject are recommended. CONCLUSIONS In order to facilitate the forthcoming clinical integration of LLMs, radiologists need to engage with the topic, understand various application areas, and be aware of potential limitations in order to address challenges related to patient safety, ethics, and data protection.
Collapse
Affiliation(s)
- Moritz C Halfmann
- Klinik und Poliklinik für diagnostische und interventionelle Radiologie, Universitätsmedizin, Johannes-Gutenberg-Universität Mainz, Langenbeckstraße 1, 55131, Mainz, Deutschland
| | - Peter Mildenberger
- Klinik und Poliklinik für diagnostische und interventionelle Radiologie, Universitätsmedizin, Johannes-Gutenberg-Universität Mainz, Langenbeckstraße 1, 55131, Mainz, Deutschland
| | - Tobias Jorg
- Klinik und Poliklinik für diagnostische und interventionelle Radiologie, Universitätsmedizin, Johannes-Gutenberg-Universität Mainz, Langenbeckstraße 1, 55131, Mainz, Deutschland.
| |
Collapse
|
35
|
Bergling K, Wang LC, Shivakumar O, Nandorine Ban A, Moore LW, Ginsberg N, Kooman J, Duncan N, Kotanko P, Zhang H. From bytes to bites: application of large language models to enhance nutritional recommendations. Clin Kidney J 2025; 18:sfaf082. [PMID: 40226366 PMCID: PMC11992566 DOI: 10.1093/ckj/sfaf082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2024] [Indexed: 04/15/2025] Open
Abstract
Large language models (LLMs) such as ChatGPT are increasingly positioned to be integrated into various aspects of daily life, with promising applications in healthcare, including personalized nutritional guidance for patients with chronic kidney disease (CKD). However, for LLM-powered nutrition support tools to reach their full potential, active collaboration of healthcare professionals, patients, caregivers and LLM experts is crucial. We conducted a comprehensive review of the literature on the use of LLMs as tools to enhance nutrition recommendations for patients with CKD, curated by our expertise in the field. Additionally, we considered relevant findings from adjacent fields, including diabetes and obesity management. Currently, the application of LLMs for CKD-specific nutrition support remains limited and has room for improvement. Although LLMs can generate recipe ideas, their nutritional analyses often underestimate critical food components such as electrolytes and calories. Anticipated advancements in LLMs and other generative artificial intelligence (AI) technologies are expected to enhance these capabilities, potentially enabling accurate nutritional analysis, the generation of visual aids for cooking and identification of kidney-healthy options in restaurants. While LLM-based nutritional support for patients with CKD is still in its early stages, rapid advancements are expected in the near future. Engagement from the CKD community, including healthcare professionals, patients and caregivers, will be essential to harness AI-driven improvements in nutritional care with a balanced perspective that is both critical and optimistic.
Collapse
Affiliation(s)
- Karin Bergling
- Artificial Intelligence Translational Innovation Hub, Renal Research Institute, New York, NY, USA
| | - Lin-Chun Wang
- Fresenius Medical Care, Clinical Research, New York, NY, USA
| | - Oshini Shivakumar
- West London Renal and Transplant Centre, Hammersmith Hospital, Imperial College Healthcare NHS Trust, London, UK
| | - Andrea Nandorine Ban
- Artificial Intelligence Translational Innovation Hub, Renal Research Institute, New York, NY, USA
| | - Linda W Moore
- Department of Surgery, Houston Methodist Hospital, Houston, TX, USA
| | - Nancy Ginsberg
- Nutrition Services, Fresenius Medical Care North America, Waltham, MA, USA
| | - Jeroen Kooman
- Department of Internal Medicine, Maastricht University Medical Center, Maastricht, The Netherlands
| | - Neill Duncan
- West London Renal and Transplant Centre, Hammersmith Hospital, Imperial College Healthcare NHS Trust, London, UK
| | - Peter Kotanko
- Artificial Intelligence Translational Innovation Hub, Renal Research Institute, New York, NY, USA
- Department of Nephrology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Hanjie Zhang
- Artificial Intelligence Translational Innovation Hub, Renal Research Institute, New York, NY, USA
| |
Collapse
|
36
|
Lee SH, Reaume M, Fung C, MacLeod KK. Reflections on the value of Canadian multiculturalism in health care delivery: Systems-level imperative for new Canadians. CANADIAN FAMILY PHYSICIAN MEDECIN DE FAMILLE CANADIEN 2025; 71:238-240. [PMID: 40228868 PMCID: PMC12007631 DOI: 10.46747/cfp.7104238] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/16/2025]
Affiliation(s)
| | | | - Celeste Fung
- Family physician and Medical Director at St Patrick's Home of Ottawa
| | - Krystal Kehoe MacLeod
- Principal investigator and Director of the Centre for Care Access and Equity Research in Ottawa
| |
Collapse
|
37
|
Zhang Y, Abousamra S, Hasan M, Torre-Healy L, Krichevsky S, Shrestha S, Bremer E, Oldridge DA, Rech AJ, Furth EE, Bocklage TJ, Levens JS, Hands I, Durbin EB, Samaras D, Kurc T, Saltz JH, Gupta R. Pathomics Image Analysis of Tumor Infiltrating Lymphocytes (TILs) in Colon Cancer. RESEARCH SQUARE 2025:rs.3.rs-6173056. [PMID: 40235501 PMCID: PMC11998795 DOI: 10.21203/rs.3.rs-6173056/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/17/2025]
Abstract
We developed a deep learning Pathomics image analysis workflow to generate spatial Tumor-TIL maps to visualize and quantify the abundance and spatial distribution of tumor infiltrating lymphocytes (TILs) in colon cancer. Colon cancer and lymphocyte detection in hematoxylin and eosin (H&E) stained whole slide images (WSIs) has revealed complex immuno-oncologic interactions that form TIL-rich and TIL-poor tumor habitats, which are unique in each patient sample. We compute Tumor%, total lymphocyte%, and TILs% as the proportion of the colon cancer microenvironment occupied by intratumoral lymphocytes for each WSI. Kaplan-Meier survival analyses and multivariate Cox regression were utilized to evaluate the prognostic significance of TILs% as a Pathomics biomarker. High TILs% was associated with improved overall survival (OS) and progression-free interval (PFI) in localized and metastatic colon cancer and other clinicopathologic variables, supporting the routine use of Pathomics Tumor-TIL mapping in biomedical research, clinical trials, laboratory medicine, and precision oncology.
Collapse
|
38
|
Bicknell BT, Rivers NJ, Skelton A, Sheehan D, Hodges C, Fairburn SC, Greene BJ, Panuganti B. Domain-Specific Customization for Language Models in Otolaryngology: The ENT GPT Assistant. OTO Open 2025; 9:e70125. [PMID: 40331108 PMCID: PMC12051367 DOI: 10.1002/oto2.70125] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2025] [Accepted: 04/18/2025] [Indexed: 05/08/2025] Open
Abstract
Objective To develop and evaluate the effectiveness of domain-specific customization in large language models (LLMs) by assessing the performance of the ENT GPT Assistant (E-GPT-A), a model specifically tailored for otolaryngology. Study Design Comparative analysis using multiple-choice questions (MCQs) from established otolaryngology resources. Setting Tertiary care academic hospital. Methods Two hundred forty clinical-vignette style MCQs were sourced from BoardVitals Otolaryngology and OTOQuest, covering a range of otolaryngology subspecialties (n = 40 for each). The E-GPT-A was developed using targeted instructions and customized to otolaryngology. The performance of E-GPT-A was compared against top-performing and widely used artificial intelligence (AI) LLMs, including GPT-3.5, GPT-4, Claude 2.0, and Claude 2.1. Accuracy was assessed across subspecialties, varying question difficulty tiers, and in diagnostics and management. Results E-GPT-A achieved an overall accuracy of 74.6%, outperforming GPT-3.5 (60.4%), Claude 2.0 (61.7%), Claude 2.1 (60.8%), and GPT-4 (68.3%). The model performed best in allergy and rhinology (85.0%) and laryngology (82.5%), whereas showing lower accuracy in pediatrics (62.5%) and facial plastics/reconstructive surgery (67.5%). Accuracy also declined as question difficulty increased. The average correct response percentage among otolaryngologists and otolaryngology trainees was 71.1% in the question set. Conclusion This pilot study using the E-GPT-A demonstrates the potential benefits of domain-specific customizations of language models for otolaryngology. However, further development, continuous updates, and continued real-world validation are needed to fully assess the capabilities of LLMs in otolaryngology.
Collapse
Affiliation(s)
- Brenton T. Bicknell
- UAB Heersink School of MedicineUniversity of Alabama at BirminghamBirminghamAlabamaUSA
| | - Nicholas J. Rivers
- Department of Otolaryngology–Head and Neck SurgeryUniversity of Alabama at BirminghamBirminghamAlabamaUSA
| | - Adam Skelton
- UAB Heersink School of MedicineUniversity of Alabama at BirminghamBirminghamAlabamaUSA
| | - Delaney Sheehan
- Department of Otolaryngology–Head and Neck SurgeryUniversity of Alabama at BirminghamBirminghamAlabamaUSA
| | - Charis Hodges
- UAB Heersink School of MedicineUniversity of Alabama at BirminghamBirminghamAlabamaUSA
| | - Stevan C. Fairburn
- UAB Heersink School of MedicineUniversity of Alabama at BirminghamBirminghamAlabamaUSA
| | - Benjamin J. Greene
- Department of Otolaryngology–Head and Neck SurgeryUniversity of Alabama at BirminghamBirminghamAlabamaUSA
| | - Bharat Panuganti
- Department of Otolaryngology–Head and Neck SurgeryWashington University in St. LouisSt. LouisMissouriUSA
| |
Collapse
|
39
|
Kapoor MC. Navigating the Impact of Artificial Intelligence on Medical Writing. Ann Card Anaesth 2025; 28:105-106. [PMID: 40237655 PMCID: PMC12058053 DOI: 10.4103/aca.aca_14_25] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2025] [Accepted: 01/18/2025] [Indexed: 04/18/2025] Open
Affiliation(s)
- Mukul Chandra Kapoor
- Department of Anesthesiology and Critical Care, Amrita School of Medicine and Amrita Institute of Medical Sciences, Amrita Vishwavidyapeeetham, Faridabad, Haryana, India
| |
Collapse
|
40
|
Lee SH, Reaume M, Fung C, MacLeod KK. Réflexions sur la valeur du multiculturalisme canadien dans la prestation des soins de santé. CANADIAN FAMILY PHYSICIAN MEDECIN DE FAMILLE CANADIEN 2025; 71:246-248. [PMID: 40228888 PMCID: PMC12007636 DOI: 10.46747/cfp.7104246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/16/2025]
Affiliation(s)
| | | | - Celeste Fung
- Médecin de famille et directrice médicale au St Patrick's Home of Ottawa
| | - Krystal Kehoe MacLeod
- Chercheuse principale et directrice du Centre for Care Access and Equity Research à Ottawa
| |
Collapse
|
41
|
Bereuter JP, Geissler ME, Klimova A, Steiner RP, Pfeiffer K, Kolbinger FR, Wiest IC, Muti HS, Kather JN. Benchmarking Vision Capabilities of Large Language Models in Surgical Examination Questions. JOURNAL OF SURGICAL EDUCATION 2025; 82:103442. [PMID: 39923296 DOI: 10.1016/j.jsurg.2025.103442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/14/2024] [Revised: 11/11/2024] [Accepted: 01/20/2025] [Indexed: 02/11/2025]
Abstract
OBJECTIVE Recent studies investigated the potential of large language models (LLMs) for clinical decision making and answering exam questions based on text input. Recent developments of LLMs have extended these models with vision capabilities. These image processing LLMs are called vision-language models (VLMs). However, there is limited investigation on the applicability of VLMs and their capabilities of answering exam questions with image content. Therefore, the aim of this study was to examine the performance of publicly accessible LLMs in 2 different surgical question sets consisting of text and image questions. DESIGN Original text and image exam questions from 2 different surgical question subsets from the German Medical Licensing Examination (GMLE) and United States Medical Licensing Examination (USMLE) were collected and answered by publicly available LLMs (GPT-4, Claude-3 Sonnet, Gemini-1.5). LLM outputs were benchmarked for their accuracy in answering text and image questions. Additionally, the LLMs' performance was compared to students' performance based on their average historical performance (AHP) in these exams. Moreover, variations of LLM performance were analyzed in relation to question difficulty and respective image type. RESULTS Overall, all LLMs achieved scores equivalent to passing grades (≥60%) on surgical text questions across both datasets. On image-based questions, only GPT-4 exceeded the score required to pass, significantly outperforming Claude-3 and Gemini-1.5 (GPT: 78% vs. Claude-3: 58% vs. Gemini-1.5: 57.3%; p < 0.001). Additionally, GPT-4 outperformed students on both text (GPT: 83.7% vs. AHP students: 67.8%; p < 0.001) and image questions (GPT: 78% vs. AHP students: 67.4%; p < 0.001). CONCLUSION GPT-4 demonstrated substantial capabilities in answering surgical text and image exam questions. Therefore, it holds considerable potential for the use in surgical decision making and education of students and trainee surgeons.
Collapse
Affiliation(s)
- Jean-Paul Bereuter
- Department of Visceral, Thoracic and Vascular Surgery, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany.
| | - Mark Enrik Geissler
- Department of Visceral, Thoracic and Vascular Surgery, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany; Else Kroener Fresenius Center for Digital Health, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany
| | - Anna Klimova
- Institute for Medical Informatics and Biometry, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany
| | - Robert-Patrick Steiner
- Institute of Pharmacology and Toxicology, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany
| | - Kevin Pfeiffer
- Else Kroener Fresenius Center for Digital Health, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany
| | - Fiona R Kolbinger
- Department of Visceral, Thoracic and Vascular Surgery, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany; Weldon School of Biomedical Engineering, Purdue University, West Lafayette, Indiana
| | - Isabella C Wiest
- Else Kroener Fresenius Center for Digital Health, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany; Department of Medicine II, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
| | - Hannah Sophie Muti
- Department of Visceral, Thoracic and Vascular Surgery, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany; Else Kroener Fresenius Center for Digital Health, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany; Medical Oncology, National Center for Tumor Diseases, University Hospital Heidelberg, Heidelberg, Germany
| | - Jakob Nikolas Kather
- Else Kroener Fresenius Center for Digital Health, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany; Medical Oncology, National Center for Tumor Diseases, University Hospital Heidelberg, Heidelberg, Germany; Department of Medicine I, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany
| |
Collapse
|
42
|
Grothey B, Odenkirchen J, Brkic A, Schömig-Markiefka B, Quaas A, Büttner R, Tolkach Y. Comprehensive testing of large language models for extraction of structured data in pathology. COMMUNICATIONS MEDICINE 2025; 5:96. [PMID: 40164789 PMCID: PMC11958830 DOI: 10.1038/s43856-025-00808-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2024] [Accepted: 03/13/2025] [Indexed: 04/02/2025] Open
Abstract
BACKGROUND Pathology departments generate large volumes of unstructured data as free-text diagnostic reports. Converting these reports into structured formats for analytics or artificial intelligence projects requires substantial manual effort by specialized personnel. While recent studies show promise in using advanced language models for structuring pathology data, they primarily rely on proprietary models, raising cost and privacy concerns. Additionally, important aspects such as prompt engineering and model quantization for deployment on consumer-grade hardware remain unaddressed. METHODS We created a dataset of 579 annotated pathology reports in German and English versions. Six language models (proprietary: GPT-4; open-source: Llama2 13B, Llama2 70B, Llama3 8B, Llama3 70B, and Qwen2.5 7B) were evaluated for their ability to extract eleven key parameters from these reports. Additionally, we investigated model performance across different prompt engineering strategies and model quantization techniques to assess practical deployment scenarios. RESULTS Here we show that open-source language models extract structured data from pathology reports with high precision, matching the accuracy of proprietary GPT-4 model. The precision varies significantly across different models and configurations. These variations depend on specific prompt engineering strategies and quantization methods used during model deployment. CONCLUSIONS Open-source language models demonstrate comparable performance to proprietary solutions in structuring pathology report data. This finding has significant implications for healthcare institutions seeking cost-effective, privacy-preserving data structuring solutions. The variations in model performance across different configurations provide valuable insights for practical deployment in pathology departments. Our publicly available bilingual dataset serves as both a benchmark and a resource for future research.
Collapse
Affiliation(s)
- Bastian Grothey
- Institute of Pathology, University Hospital Cologne, Cologne, Germany.
| | | | - Adnan Brkic
- Institute of Pathology, University Hospital Cologne, Cologne, Germany
| | | | - Alexander Quaas
- Institute of Pathology, University Hospital Cologne, Cologne, Germany
| | - Reinhard Büttner
- Institute of Pathology, University Hospital Cologne, Cologne, Germany
| | - Yuri Tolkach
- Institute of Pathology, University Hospital Cologne, Cologne, Germany.
| |
Collapse
|
43
|
Guo L, Zuo Y, Yisha Z, Liu J, Gu A, Yushan R, Liu G, Li S, Liu T, Wang X. Diagnostic performance of advanced large language models in cystoscopy: evidence from a retrospective study and clinical cases. BMC Urol 2025; 25:64. [PMID: 40158093 PMCID: PMC11954320 DOI: 10.1186/s12894-025-01740-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2024] [Accepted: 03/11/2025] [Indexed: 04/01/2025] Open
Abstract
PURPOSE To evaluate the diagnostic capabilities of advanced large language models (LLMs) in interpreting cystoscopy images for the identification of common urological conditions. MATERIALS AND METHODS A retrospective analysis was conducted on 603 cystoscopy images obtained from 101 procedures. Two advanced LLMs, both at the forefront of artificial intelligence technology, were employed to interpret these images. The diagnostic interpretations generated by these LLMs were systematically compared against standard clinical diagnostic assessments. The study's primary outcome measure was the overall diagnostic accuracy of the LLMs. Secondary outcomes focused on evaluating condition-specific accuracies across various urological conditions. RESULTS The combined diagnostic accuracy of both LLMs was 89.2%, with ChatGPT-4 V and Claude 3.5 Sonnet achieving accuracies of 82.8% and 79.8%, respectively. Condition-specific accuracies varied considerably, for specific urological disorders: bladder tumors (ChatGPT-4 V: 92.2%, Claude 3.5 Sonnet: 80.9%), BPH (35.3%, 32.4%), cystitis (94.5%, 98.9%), bladder diverticula (92.3%, 53.8%), and bladder trabeculae (55.8%, 59.6%). As for normal anatomical structures: ureteral orifice (ChatGPT-4 V: 48.8%, Claude 3.5 Sonnet: 61.0%), bladder neck (97.9%, 93.8%), and prostatic urethra (64.3%,57.1%). CONCLUSIONS Advanced language models demonstrated varying levels of diagnostic accuracy in cystoscopy image interpretation, excelling in cystitis detection while showing lower accuracy for other conditions, notably benign prostatic hyperplasia. These findings suggest promising potential for LLMs as supportive tools in urological diagnosis, particularly for urologists in training or early career stages. This study underscores the need for continued research and development to optimize these AI-driven tools, with the ultimate goal of improving diagnostic accuracy and efficiency in urological practice. CLINICAL TRIAL NUMBER Not applicable.
Collapse
Affiliation(s)
- Linfa Guo
- Department of Urology, Zhongnan Hospital of Wuhan University, Wuhan, China
| | - Yingtong Zuo
- Department of Urology, Zhongnan Hospital of Wuhan University, Wuhan, China
| | - Zuhaer Yisha
- Department of Epidemiology and Biostatistics, School of Public Health, Peking University, Beijing, China
| | - Jiuling Liu
- Department of Urology, Zhongnan Hospital of Wuhan University, Wuhan, China
| | - Aodun Gu
- Department of Urology, Zhongnan Hospital of Wuhan University, Wuhan, China
| | - Refate Yushan
- Department of Urology, Zhongnan Hospital of Wuhan University, Wuhan, China
| | - Guiyong Liu
- Department of Urology, Qianjiang Central Hospital of Hubei Province, Qianjiang, China
| | - Sheng Li
- Department of Urology, Zhongnan Hospital of Wuhan University, Wuhan, China
- Hubei Key Laboratory of Urological Diseases, Wuhan University, Wuhan, China
- Hubei Medical Quality Control Center for Laparoscopic/Endoscopic Urologic Surgery, Zhongnan Hospital of Wuhan University, Wuhan, China
- Wuhan Clinical Research Center for Urogenital Tumors, Zhongnan Hospital of Wuhan University, Wuhan, China
- Cancer Precision Diagnosis and Treatment and Translational Medicine Hubei Engineering Research Center, Zhongnan Hospital of Wuhan University, Wuhan, China
| | - Tongzu Liu
- Department of Urology, Zhongnan Hospital of Wuhan University, Wuhan, China.
- Hubei Key Laboratory of Urological Diseases, Wuhan University, Wuhan, China.
- Hubei Clinical Research Center for Laparoscopic/Endoscopic Urologic Surgery, Zhongnan Hospital of Wuhan University, Wuhan, China.
- Institute of Urology, Wuhan University, Wuhan, China.
- Hubei Medical Quality Control Center for Laparoscopic/Endoscopic Urologic Surgery, Zhongnan Hospital of Wuhan University, Wuhan, China.
- Wuhan Clinical Research Center for Urogenital Tumors, Zhongnan Hospital of Wuhan University, Wuhan, China.
- Cancer Precision Diagnosis and Treatment and Translational Medicine Hubei Engineering Research Center, Zhongnan Hospital of Wuhan University, Wuhan, China.
| | - Xiaolong Wang
- Department of Urology, Zhongnan Hospital of Wuhan University, Wuhan, China.
- Hubei Key Laboratory of Urological Diseases, Wuhan University, Wuhan, China.
- Institute of Urology, Wuhan University, Wuhan, China.
| |
Collapse
|
44
|
Logan JA, Sadhu S, Hazlewood C, Denton M, Burke SE, Simone-Soule CA, Black C, Ciaverelli C, Stulb J, Nourzadeh H, Vinogradskiy Y, Leader A, Dicker AP, Choi W, Simone NL. Bridging Gaps in Cancer Care: Utilizing Large Language Models for Accessible Dietary Recommendations. Nutrients 2025; 17:1176. [PMID: 40218934 PMCID: PMC11990115 DOI: 10.3390/nu17071176] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2025] [Revised: 03/25/2025] [Accepted: 03/26/2025] [Indexed: 04/14/2025] Open
Abstract
Background/Objectives: Weight management is directly linked to cancer recurrence and survival, but unfortunately, nutritional oncology counseling is not typically covered by insurance, creating a disparity for patients without nutritional education and food access. Novel ways of imparting personalized nutrition advice are needed to address this issue. Large language models (LLMs) offer a promising path toward tailoring dietary advice to individual patients. This study aimed to assess the capacity of LLMs to offer personalized dietary advice to patients with breast cancer. Methods: Thirty-one prompt templates were designed to evaluate dietary recommendations generated by ChatGPT and Gemini with variations within eight categorical variables: cancer stage, comorbidity, location, culture, age, dietary guideline, budget, and store. Seven prompts were selected for four board-certified oncology dietitians to also respond to. Responses were evaluated based on nutritional content and qualitative observations. A quantitative comparison of the calories and macronutrients of the LLM- and dietitian-generated meal plans via the Acceptable Macronutrient Distribution Ranges and United States Department of Agriculture's estimated calorie needs was performed. Conclusions: The LLMs generated personalized grocery lists and meal plans adapting to location, culture, and budget but not age, disease stage, comorbidities, or dietary guidelines. Gemini provided more comprehensive responses, including visuals and specific prices. While the dietitian-generated diets offered more adherent total daily calorie contents to the United States Department of Agriculture's estimated calorie needs, ChatGPT and Gemini offered more adherent macronutrient ratios to the Acceptable Macronutrient Distribution Range. Overall, the meal plans were not significantly different between the LLMs and dietitians. LLMs can provide personalized dietary advice to cancer patients who may lack access to this care. Grocery lists and meal plans generated by LLMs are applicable to patients with variable food access, socioeconomic means, and cultural preferences and can be a tool to increase health equity.
Collapse
Affiliation(s)
- Julia A. Logan
- Department of Radiation Oncology, Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA 19107, USA (S.S.); (S.E.B.); (A.P.D.); (W.C.)
| | - Sriya Sadhu
- Department of Radiation Oncology, Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA 19107, USA (S.S.); (S.E.B.); (A.P.D.); (W.C.)
| | - Cameo Hazlewood
- Department of Radiation Oncology, Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA 19107, USA (S.S.); (S.E.B.); (A.P.D.); (W.C.)
| | - Melissa Denton
- Sidney Kimmel Comprehensive Cancer Center, Thomas Jefferson University Hospitals, Philadelphia, PA 19107, USA
| | - Sara E. Burke
- Department of Radiation Oncology, Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA 19107, USA (S.S.); (S.E.B.); (A.P.D.); (W.C.)
| | - Christina A. Simone-Soule
- Department of Radiation Oncology, Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA 19107, USA (S.S.); (S.E.B.); (A.P.D.); (W.C.)
| | - Caroline Black
- Sidney Kimmel Comprehensive Cancer Center, Thomas Jefferson University Hospitals, Philadelphia, PA 19107, USA
| | - Corey Ciaverelli
- Sidney Kimmel Comprehensive Cancer Center, Thomas Jefferson University Hospitals, Philadelphia, PA 19107, USA
| | - Jacqueline Stulb
- Sidney Kimmel Comprehensive Cancer Center, Thomas Jefferson University Hospitals, Philadelphia, PA 19107, USA
| | - Hamidreza Nourzadeh
- Department of Radiation Oncology, Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA 19107, USA (S.S.); (S.E.B.); (A.P.D.); (W.C.)
| | - Yevgeniy Vinogradskiy
- Department of Radiation Oncology, Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA 19107, USA (S.S.); (S.E.B.); (A.P.D.); (W.C.)
| | - Amy Leader
- Department of Medical Oncology, Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA 19107, USA
| | - Adam P. Dicker
- Department of Radiation Oncology, Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA 19107, USA (S.S.); (S.E.B.); (A.P.D.); (W.C.)
| | - Wookjin Choi
- Department of Radiation Oncology, Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA 19107, USA (S.S.); (S.E.B.); (A.P.D.); (W.C.)
| | - Nicole L. Simone
- Department of Radiation Oncology, Sidney Kimmel Medical College, Thomas Jefferson University, Philadelphia, PA 19107, USA (S.S.); (S.E.B.); (A.P.D.); (W.C.)
| |
Collapse
|
45
|
Takita H, Kabata D, Walston SL, Tatekawa H, Saito K, Tsujimoto Y, Miki Y, Ueda D. A systematic review and meta-analysis of diagnostic performance comparison between generative AI and physicians. NPJ Digit Med 2025; 8:175. [PMID: 40121370 PMCID: PMC11929846 DOI: 10.1038/s41746-025-01543-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2024] [Accepted: 02/26/2025] [Indexed: 03/25/2025] Open
Abstract
While generative artificial intelligence (AI) has shown potential in medical diagnostics, comprehensive evaluation of its diagnostic performance and comparison with physicians has not been extensively explored. We conducted a systematic review and meta-analysis of studies validating generative AI models for diagnostic tasks published between June 2018 and June 2024. Analysis of 83 studies revealed an overall diagnostic accuracy of 52.1%. No significant performance difference was found between AI models and physicians overall (p = 0.10) or non-expert physicians (p = 0.93). However, AI models performed significantly worse than expert physicians (p = 0.007). Several models demonstrated slightly higher performance compared to non-experts, although the differences were not significant. Generative AI demonstrates promising diagnostic capabilities with accuracy varying by model. Although it has not yet achieved expert-level reliability, these findings suggest potential for enhancing healthcare delivery and medical education when implemented with appropriate understanding of its limitations.
Collapse
Affiliation(s)
- Hirotaka Takita
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Daijiro Kabata
- Center for Mathematical and Data Science, Kobe University, Kobe, Japan
| | - Shannon L Walston
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
- Department of Artificial Intelligence, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Hiroyuki Tatekawa
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Kenichi Saito
- Center for Digital Transformation of Health Care, Graduate School of Medicine, Kyoto University, Kyoto, Japan
| | - Yasushi Tsujimoto
- Oku Medical Clinic, Osaka, Japan
- Department of Health Promotion and Human Behavior, Kyoto University Graduate School of Medicine/School of Public Health, Kyoto University, Kyoto, Japan
- Scientific Research WorkS Peer Support Group (SRWS-PSG), Osaka, Japan
| | - Yukio Miki
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan
| | - Daiju Ueda
- Department of Diagnostic and Interventional Radiology, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan.
- Department of Artificial Intelligence, Graduate School of Medicine, Osaka Metropolitan University, Osaka, Japan.
- Center for Health Science Innovation, Osaka Metropolitan University, Osaka, Japan.
| |
Collapse
|
46
|
Zhou H, Chow LS, Harnack L, Panda S, Manoogian EN, Li M, Xiao Y, Zhang R. NutriRAG: Unleashing the Power of Large Language Models for Food Identification and Classification through Retrieval Methods. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2025:2025.03.19.25324268. [PMID: 40166577 PMCID: PMC11957177 DOI: 10.1101/2025.03.19.25324268] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 04/02/2025]
Abstract
Objective This study explores the use of advanced Natural Language Processing (NLP) techniques to enhance food classification and dietary analysis using raw text input from a diet tracking app. Materials and Methods The study was conducted in three stages: data collection, framework development, and application. Data were collected via the myCircadianClock app, where participants logged their meals in free-text format. Only de-identified food-related entries were used. We developed the NutriRAG framework, an NLP framework utilizing a Retrieval-Augmented Generation (RAG) approach to retrieve examples and incorporating large language models such as GPT-4 and Llama-2-70b. NutriRAG was designed to identify and classify user-recorded food items into predefined categories and analyzed dietary patterns from free-text entries in a 12-week randomized clinical trial (RCT: NCT04259632). The RCT compared three groups of obese participants: those following time-restricted eating (TRE, 8-hour eating window), caloric restriction (CR, 15% reduction), and unrestricted eating (UR). Results NutriRAG significantly enhanced classification accuracy and effectively identified nutritional content and analyzed dietary patterns, as noted by the retrieval-augmented GPT-4 model achieving a Micro F1 score of 82.24. Both interventions showed dietary alterations: CR participants ate fewer snacks and sugary foods, while TRE participants reduced nighttime eating. Conclusion By using AI, NutriRAG marks a substantial advancement in food classification and dietary analysis of nutritional assessments. The findings highlight NLP's potential to personalize nutrition and manage diet-related health issues, suggesting further research to expand these models for wider use.
Collapse
Affiliation(s)
- Huixue Zhou
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Lisa S. Chow
- Division of Diabetes, Endocrinology and Metabolism, Department of Medicine University of Minnesota, Minneapolis, Minnesota, USA
| | | | | | | | - Minchen Li
- Division of Computational Health Sciences, Department of Surgery, University of Minnesota, Minneapolis, Minnesota, USA
| | - Yongkang Xiao
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Rui Zhang
- Division of Computational Health Sciences, Department of Surgery, University of Minnesota, Minneapolis, Minnesota, USA
| |
Collapse
|
47
|
Suga T, Uehara O, Abiko Y, Toyofuku A. Evaluating Large Language Models for Burning Mouth Syndrome Diagnosis. J Pain Res 2025; 18:1387-1405. [PMID: 40124539 PMCID: PMC11930279 DOI: 10.2147/jpr.s509845] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2024] [Accepted: 02/26/2025] [Indexed: 03/25/2025] Open
Abstract
Introduction Large language models have been proposed as diagnostic aids across various medical fields, including dentistry. Burning mouth syndrome, characterized by burning sensations in the oral cavity without identifiable cause, poses diagnostic challenges. This study explores the diagnostic accuracy of large language models in identifying burning mouth syndrome, hypothesizing potential limitations. Materials and Methods Clinical vignettes of 100 synthesized burning mouth syndrome cases were evaluated using three large language models (ChatGPT-4o, Gemini Advanced 1.5 Pro, and Claude 3.5 Sonnet). Each vignette included patient demographics, symptoms, and medical history. Large language models were prompted to provide a primary diagnosis, differential diagnoses, and their reasoning. Accuracy was determined by comparing their responses with expert evaluations. Results ChatGPT and Claude achieved an accuracy rate of 99%, while Gemini's accuracy was 89% (p < 0.001). Misdiagnoses included Persistent Idiopathic Facial Pain and combined diagnoses with inappropriate conditions. Differences were also observed in reasoning patterns and additional data requests across the large language models. Discussion Despite high overall accuracy, the models exhibited variations in reasoning approaches and occasional errors, underscoring the importance of clinician oversight. Limitations include the synthesized nature of vignettes, potential over-reliance on exclusionary criteria, and challenges in differentiating overlapping disorders. Conclusion Large language models demonstrate strong potential as supplementary diagnostic tools for burning mouth syndrome, especially in settings lacking specialist expertise. However, their reliability depends on thorough patient assessment and expert verification. Integrating large language models into routine diagnostics could enhance early detection and management, ultimately improving clinical decision-making for dentists and specialists alike.
Collapse
Affiliation(s)
- Takayuki Suga
- Department of Psychosomatic Dentistry, Graduate School of Medical and Dental Sciences, Institute of Science Tokyo, Tokyo, Japan
| | - Osamu Uehara
- Division of Disease Control and Molecular Epidemiology, Department of Oral Growth and Development, School of Dentistry, Health Sciences University of Hokkaido, Ishikari-Tobetsu, Hokkaido, Japan
| | - Yoshihiro Abiko
- Division of Oral Medicine and Pathology, Department of Human Biology and Pathophysiology, School of Dentistry, Health Sciences University of Hokkaido, Ishikari-Tobetsu, Hokkaido, Japan
| | - Akira Toyofuku
- Department of Psychosomatic Dentistry, Graduate School of Medical and Dental Sciences, Institute of Science Tokyo, Tokyo, Japan
| |
Collapse
|
48
|
Ayik G, Kolac UC, Aksoy T, Yilmaz A, Sili MV, Tokgozoglu M, Huri G. Exploring the role of artificial intelligence in Turkish orthopedic progression exams. ACTA ORTHOPAEDICA ET TRAUMATOLOGICA TURCICA 2025; 59:18-26. [PMID: 40337975 PMCID: PMC11992947 DOI: 10.5152/j.aott.2025.24090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/12/2024] [Accepted: 01/03/2025] [Indexed: 05/09/2025]
Abstract
Objective The aim of this study was to evaluate and compare the performance of the artificial intelligence (AI) models ChatGPT-3.5, ChatGPT-4, and Gemini on the Turkish Specialization Training and Development Examination (UEGS) to determine their utility in medical education and their potential to improve patient care. Methods This retrospective study analyzed responses of ChatGPT-3.5, ChatGPT-4, and Gemini to 1000 true or false questions from UEGS administered over 5 years (2018-2023). Questions, encompassing 9 orthopedic subspecialties, were categorized by 2 independent residents, with discrepancies resolved by a senior author. Artificial intelligence models were restarted for each query to prevent data retention. Performance was evaluated by calculating net scores and comparing them to orthopedic resident scores obtained from the Turkish Orthopedics and Traumatology Education Council (TOTEK) database. Statistical analyses included chi-squared tests, Bonferroni-adjusted Z tests, Cochran's Q test, and receiver operating characteristic (ROC) analysis to determine the optimal question length for AI accuracy. All AI responses were generated independently without retaining prior information. Results Significant di!erences in AI tool accuracy were observed across di!erent years and subspecialties (P < .001). ChatGPT-4 consistently outperformed other models, achieving the highest overall accuracy (95% in specific subspecialties). Notably, ChatGPT-4 demonstrated superior performance in Basic and General Orthopedics and Foot and Ankle Surgery, while Gemini and ChatGPT-3.5 showed variability in accuracy across topics and years. Receiver operating characteristic analysis revealed a significant relationship between shorter letter counts and higher accuracy for ChatGPT-4 (P=.002). ChatGPT-4 showed significant negative correlations between letter count and accuracy across all years (r="0.099, P=.002), outperformed residents in basic and general orthopedics (P=.015) and trauma (P=.012), unlike other AI models. Conclusion The findings underscore the advancing role of AI in the medical field, with ChatGPT-4 demonstrating significant potential as a tool for medical education and clinical decision-making. Continuous evaluation and refinement of AI technologies are essential to enhance their educational and clinical impact.
Collapse
Affiliation(s)
- Gokhan Ayik
- Department of Orthopedics and Traumatology, Yuksek Ihtisas University Faculty of Medicine, Ankara, Türkiye
| | - Ulas Can Kolac
- Department of Orthopedics and Traumatology, Hacettepe University Faculty Of Medicine, Ankara, Türkiye
| | - Taha Aksoy
- Department of Orthopedics and Traumatology, Hacettepe University Faculty Of Medicine, Ankara, Türkiye
| | - Abdurrahman Yilmaz
- Department of Orthopedics and Traumatology, Hacettepe University Faculty Of Medicine, Ankara, Türkiye
| | - Mazlum Veysel Sili
- Department of Orthopedics and Traumatology, Hacettepe University Faculty Of Medicine, Ankara, Türkiye
| | - Mazhar Tokgozoglu
- Department of Orthopedics and Traumatology, Hacettepe University Faculty Of Medicine, Ankara, Türkiye
| | - Gazi Huri
- Department of Orthopedics and Traumatology, Hacettepe University Faculty Of Medicine, Ankara, Türkiye
- Aspetar, Orthopedic and Sports Medicine Hospital, FIFA Medical Center of Excellence, Doha, Qatar
| |
Collapse
|
49
|
Dimmick AA, Su CC, Rafiuddin HS, Cicero DC. Evaluating ChatGPT for neurocognitive disorder diagnosis: a multicenter study. Clin Neuropsychol 2025:1-16. [PMID: 40091262 DOI: 10.1080/13854046.2025.2475567] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2024] [Accepted: 03/02/2025] [Indexed: 03/19/2025]
Abstract
Objective: To evaluate the accuracy and reliability of ChatGPT 4 Omni in diagnosing neurocognitive disorders using comprehensive clinical data and compare its performance to previous versions of ChatGPT. Method: This project utilized a two-part design: Study 1 examined diagnostic agreement between ChatGPT 4 Omni and clinicians using a few-shot prompt approach, and Study 2 compared the diagnostic performance of ChatGPT models using a zero-shot prompt approach using data from the National Alzheimer's Coordinating Center (NACC) Uniform Data Set 3. Study 1 included 12,922 older adults (Mage = 69.13, SD = 9.87), predominantly female (57%) and White (80%). Study 2 involved 537 older adults (Mage = 67.88, SD = 9.52), majority female (57%) and White (81%). Diagnoses included no cognitive impairment, amnestic mild cognitive impairment (MCI), nonamnestic MCI, and dementia. Results: In Study 1, ChatGPT 4 Omni showed fair association with clinician diagnoses (χ2 (9) = 6021.96, p < .001; κ = .33). Notable predictive measures of agreement included the MoCA and memory recall tests. ChatGPT 4 Omni demonstrated high internal reliability (α = .96). In Study 2, no significant diagnostic agreement was found between ChatGPT versions and clinicians. Conclusions: Although ChatGPT 4 Omni shows potential in aligning with clinician diagnoses, its diagnostic accuracy is insufficient for clinical application without human oversight. Continued refinement and comprehensive training of AI models are essential to enhance their utility in neuropsychological assessment. With rapidly developing technological innovations, integrating AI tools in clinical practice could soon improve diagnostic efficiency and accessibility to neuropsychological services.
Collapse
Affiliation(s)
- A Andrew Dimmick
- Department of Psychology, University of North Texas, Denton, TX, USA
- Michael E. DeBakey VA Medical Center
| | - Charlie C Su
- Department of Psychology, University of North Texas, Denton, TX, USA
| | - Hanan S Rafiuddin
- Department of Psychology, University of North Texas, Denton, TX, USA
| | - David C Cicero
- Department of Psychology, University of North Texas, Denton, TX, USA
| |
Collapse
|
50
|
Vrdoljak J, Boban Z, Vilović M, Kumrić M, Božić J. A Review of Large Language Models in Medical Education, Clinical Decision Support, and Healthcare Administration. Healthcare (Basel) 2025; 13:603. [PMID: 40150453 PMCID: PMC11942098 DOI: 10.3390/healthcare13060603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2024] [Revised: 02/07/2025] [Accepted: 03/06/2025] [Indexed: 03/29/2025] Open
Abstract
Background/Objectives: Large language models (LLMs) have shown significant potential to transform various aspects of healthcare. This review aims to explore the current applications, challenges, and future prospects of LLMs in medical education, clinical decision support, and healthcare administration. Methods: A comprehensive literature review was conducted, examining the applications of LLMs across the three key domains. The analysis included their performance, challenges, and advancements, with a focus on techniques like retrieval-augmented generation (RAG). Results: In medical education, LLMs show promise as virtual patients, personalized tutors, and tools for generating study materials. Some models have outperformed junior trainees in specific medical knowledge assessments. Concerning clinical decision support, LLMs exhibit potential in diagnostic assistance, treatment recommendations, and medical knowledge retrieval, though performance varies across specialties and tasks. In healthcare administration, LLMs effectively automate tasks like clinical note summarization, data extraction, and report generation, potentially reducing administrative burdens on healthcare professionals. Despite their promise, challenges persist, including hallucination mitigation, addressing biases, and ensuring patient privacy and data security. Conclusions: LLMs have transformative potential in medicine but require careful integration into healthcare settings. Ethical considerations, regulatory challenges, and interdisciplinary collaboration between AI developers and healthcare professionals are essential. Future advancements in LLM performance and reliability through techniques such as RAG, fine-tuning, and reinforcement learning will be critical to ensuring patient safety and improving healthcare delivery.
Collapse
Affiliation(s)
- Josip Vrdoljak
- Department for Pathophysiology, School of Medicine, University of Split, 21000 Split, Croatia; (J.V.); (M.V.); (M.K.)
| | - Zvonimir Boban
- Department for Medical Physics, School of Medicine, University of Split, 21000 Split, Croatia;
| | - Marino Vilović
- Department for Pathophysiology, School of Medicine, University of Split, 21000 Split, Croatia; (J.V.); (M.V.); (M.K.)
| | - Marko Kumrić
- Department for Pathophysiology, School of Medicine, University of Split, 21000 Split, Croatia; (J.V.); (M.V.); (M.K.)
| | - Joško Božić
- Department for Pathophysiology, School of Medicine, University of Split, 21000 Split, Croatia; (J.V.); (M.V.); (M.K.)
| |
Collapse
|