1
|
Pamuk E, Bilen YE, Külekçi Ç, Kuşcu O. ChatGPT-4 vs. multi-disciplinary tumor board decisions for the therapeutic management of primary laryngeal cancer. Acta Otolaryngol 2025:1-6. [PMID: 40358250 DOI: 10.1080/00016489.2025.2502563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2025] [Accepted: 05/01/2025] [Indexed: 05/15/2025]
Abstract
BACKGROUND Artificial intelligence-based clinical decision support systems are promising tools for addressing the increasing complexity of oncological data and treatment. However, the integration and validation of models such as ChatGPT within multidisciplinary decision-making processes for head and neck cancers remain limited. OBJECTIVE To evaluate the performance of ChatGPT-4 in the management of primary laryngeal cancer and compare it with multidisciplinary tumor board (MDT) decisions. METHODS The medical records of 25 patients with untreated laryngeal cancer were evaluated using ChatGPT-4 for therapeutic recommendations. The coherence of responses was graded from Grade 1 (totally coherent) to Grade 4 (totally incoherent) and compared with actual MDT decisions. The association between patient features and response grades was also assessed. RESULTS ChatGPT-4 provided totally coherent (Grade 1) responses consistent with MDT decisions in 72% of the patients. The rates of Grade 2 and Grade 3 coherent responses were 20% and 8%, respectively. There were no totally incoherent responses. There was no significant association between the grade of coherence and T stage, N stage, tumor localization, differentiation, or age (p = 0.106, p = 0.588, p = 0.271, p = 0.677, p = 0.506, respectively). CONCLUSION With further improvements, ChatGPT-4 can be a promising adjunct tool for clinicians in decision-making for primary laryngeal cancer.
Collapse
Affiliation(s)
- Erim Pamuk
- Department of Otorhinolaryngology, Hacettepe University
| | | | - Çağrı Külekçi
- Department of Otorhinolaryngology, Hacettepe University
| | - Oğuz Kuşcu
- Department of Otorhinolaryngology, Hacettepe University
| |
Collapse
|
2
|
Ozkan E, Tekin A, Ozkan MC, Cabrera D, Niven A, Dong Y. Global Health care Professionals' Perceptions of Large Language Model Use In Practice: Cross-Sectional Survey Study. JMIR MEDICAL EDUCATION 2025; 11:e58801. [PMID: 40354644 DOI: 10.2196/58801] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Revised: 04/11/2025] [Accepted: 04/19/2025] [Indexed: 05/14/2025]
Abstract
Background ChatGPT is a large language model-based chatbot developed by OpenAI. ChatGPT has many potential applications to health care, including enhanced diagnostic accuracy and efficiency, improved treatment planning, and better patient outcomes. However, health care professionals' perceptions of ChatGPT and similar artificial intelligence tools are not well known. Understanding these attitudes is important to inform the best approaches to exploring their use in medicine. Objective Our aim was to evaluate the health care professionals' awareness and perceptions regarding potential applications of ChatGPT in the medical field, including potential benefits and challenges of adoption. Methods We designed a 33-question online survey that was distributed among health care professionals via targeted emails and professional Twitter and LinkedIn accounts. The survey included a range of questions to define respondents' demographic characteristics, familiarity with ChatGPT, perceptions of this tool's usefulness and reliability, and opinions on its potential to improve patient care, research, and education efforts. Results One hundred and fifteen health care professionals from 21 countries responded to the survey, including physicians, nurses, researchers, and educators. Of these, 101 (87.8%) had heard of ChatGPT, mainly from peers, social media, and news, and 77 (76.2%) had used ChatGPT at least once. Participants found ChatGPT to be helpful for writing manuscripts (n=31, 45.6%), emails (n=25, 36.8%), and grants (n=12, 17.6%); accessing the latest research and evidence-based guidelines (n=21, 30.9%); providing suggestions on diagnosis or treatment (n=15, 22.1%); and improving patient communication (n=12, 17.6%). Respondents also felt that the ability of ChatGPT to access and summarize research articles (n=22, 46.8%), provide quick answers to clinical questions (n=15, 31.9%), and generate patient education materials (n=10, 21.3%) was helpful. However, there are concerns regarding the use of ChatGPT, for example, the accuracy of responses (n=14, 29.8%), limited applicability in specific practices (n=18, 38.3%), and legal and ethical considerations (n=6, 12.8%), mainly related to plagiarism or copyright violations. Participants stated that safety protocols such as data encryption (n=63, 62.4%) and access control (n=52, 51.5%) could assist in ensuring patient privacy and data security. Conclusions Our findings show that ChatGPT use is widespread among health care professionals in daily clinical, research, and educational activities. The majority of our participants found ChatGPT to be useful; however, there are concerns about patient privacy, data security, and its legal and ethical issues as well as the accuracy of its information. Further studies are required to understand the impact of ChatGPT and other large language models on clinical, educational, and research outcomes, and the concerns regarding its use must be addressed systematically and through appropriate methods.
Collapse
Affiliation(s)
- Ecem Ozkan
- Department of Medicine, Jersey Shore University Medical Center, 1945 NJ-33, Neptune, NJ, 07753, United States, 1 5078843064
| | - Aysun Tekin
- Department of Anesthesiology, Mayo Clinic College of Medicine, Rochester, MN, United States
| | - Mahmut Can Ozkan
- Department of Medicine, Jersey Shore University Medical Center, 1945 NJ-33, Neptune, NJ, 07753, United States, 1 5078843064
| | - Daniel Cabrera
- Department of Emergency Medicine, Mayo Clinic College of Medicine, Rochester, MN, United States
| | - Alexander Niven
- Department of Pulmonary and Critical Care Medicine, Mayo Clinic College of Medicine, Rochester, MN, United States
| | - Yue Dong
- Department of Anesthesiology, Mayo Clinic College of Medicine, Rochester, MN, United States
| |
Collapse
|
3
|
Tajirian T, Lo B, Strudwick G, Tasca A, Kendell E, Poynter B, Kumar S, Chang PYB, Kung C, Schachter D, Zai G, Kiang M, Hoppe T, Ling S, Haider U, Rabel K, Coombe N, Jankowicz D, Sockalingam S. Assessing the Impact on Electronic Health Record Burden After Five Years of Physician Engagement in a Canadian Mental Health Organization: Mixed-Methods Study. JMIR Hum Factors 2025; 12:e65656. [PMID: 40344205 DOI: 10.2196/65656] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2024] [Revised: 02/26/2025] [Accepted: 04/04/2025] [Indexed: 05/11/2025] Open
Abstract
Background The burden caused by the use of electronic health record (EHR) systems continues to be an important issue for health care organizations, especially given human resource shortages in health care systems globally. As physicians report spending 2 hours documenting for every hour of patient care, there has been strong interest from many organizations to understand and address the root causes of physician burnout due to EHR burden. Objective This study focuses on evaluating physician burnout related to EHR usage and the impact of a physician engagement strategy at a Canadian mental health organization 5 years after implementation. Methods A cross-sectional survey was conducted to assess the perceived impact of the physician engagement strategy on burnout associated with EHR use. Physicians were invited to participate in a web-based survey that included the Mini-Z Burnout questionnaire, along with questions about their perceptions of the EHR and the effectiveness of the initiatives within the physician engagement strategy. Descriptive statistics were applied to analyze the quantitative data, while thematic analysis was used for the qualitative data. Results Of the 254 physicians invited, 128 completed the survey, resulting in a 50% response rate. Among the respondents, 26% (33/128) met the criteria for burnout according to the Mini-Z questionnaire, with 61% (20/33) of these attributing their burnout to EHR use. About 52% of participants indicated that the EHR improves communication (67/128) and 38% agreed that the EHR enables high-quality care (49/128). Regarding the physician engagement strategy initiatives, 39% (50/128) agreed that communication through the strategy is efficient, and 75% (96/128) felt more proficient in using the EHR. However, additional areas for improvement within the EHR were identified, including (1) medication reconciliation and prescription processes; (2) chart navigation and information retrieval; (3) longitudinal medication history; and (4) technology infrastructure challenges. Conclusions This study highlights the potential impact of EHRs on physician burnout and the effectiveness of a unique physician engagement strategy in fostering positive perceptions and improving EHR usability among physicians. By evaluating this initiative in a real-world setting, the study contributes to the broader literature on strategies aimed at enhancing physician experience following large-scale EHR implementation. However, the findings indicate a continued need for system-level improvements to maximize the value and usage of EHRs. The physician engagement strategy demonstrates the potential to enhance physicians' EHR experience. Future efforts should prioritize system-level advancements to increase the EHR's impact on quality of care and develop standardized approaches for engaging physicians on a broader Canadian scale.
Collapse
Affiliation(s)
- Tania Tajirian
- Department of Family and Community Medicine, University of Toronto, Toronto, ON, Canada
- Centre for Addiction and Mental Health, Office 6168G, 100 Stokes Street, Toronto, ON, Canada, 1 (416) 535-8501 ext 30515
| | - Brian Lo
- Centre for Addiction and Mental Health, Office 6168G, 100 Stokes Street, Toronto, ON, Canada, 1 (416) 535-8501 ext 30515
- Institute for Health Policy, Management and Evaluation, University of Toronto, Toronto, ON, Canada
- Information Technology, Unity Health Toronto, Toronto, ON, Canada
| | - Gillian Strudwick
- Centre for Addiction and Mental Health, Office 6168G, 100 Stokes Street, Toronto, ON, Canada, 1 (416) 535-8501 ext 30515
- Institute for Health Policy, Management and Evaluation, University of Toronto, Toronto, ON, Canada
| | - Adam Tasca
- Centre for Addiction and Mental Health, Office 6168G, 100 Stokes Street, Toronto, ON, Canada, 1 (416) 535-8501 ext 30515
- Department of Psychiatry, University of Toronto, Toronto, ON, Canada
| | - Emily Kendell
- Department of Family and Community Medicine, University of Toronto, Toronto, ON, Canada
- Centre for Addiction and Mental Health, Office 6168G, 100 Stokes Street, Toronto, ON, Canada, 1 (416) 535-8501 ext 30515
| | - Brittany Poynter
- Centre for Addiction and Mental Health, Office 6168G, 100 Stokes Street, Toronto, ON, Canada, 1 (416) 535-8501 ext 30515
- Department of Psychiatry, University of Toronto, Toronto, ON, Canada
| | - Sanjeev Kumar
- Centre for Addiction and Mental Health, Office 6168G, 100 Stokes Street, Toronto, ON, Canada, 1 (416) 535-8501 ext 30515
- Department of Psychiatry, University of Toronto, Toronto, ON, Canada
| | - Po-Yen Brian Chang
- Department of Family and Community Medicine, University of Toronto, Toronto, ON, Canada
- Centre for Addiction and Mental Health, Office 6168G, 100 Stokes Street, Toronto, ON, Canada, 1 (416) 535-8501 ext 30515
| | - Candice Kung
- Centre for Addiction and Mental Health, Office 6168G, 100 Stokes Street, Toronto, ON, Canada, 1 (416) 535-8501 ext 30515
- Department of Psychiatry, University of Toronto, Toronto, ON, Canada
| | - Debbie Schachter
- Centre for Addiction and Mental Health, Office 6168G, 100 Stokes Street, Toronto, ON, Canada, 1 (416) 535-8501 ext 30515
- Department of Psychiatry, University of Toronto, Toronto, ON, Canada
| | - Gwyneth Zai
- Centre for Addiction and Mental Health, Office 6168G, 100 Stokes Street, Toronto, ON, Canada, 1 (416) 535-8501 ext 30515
- Department of Psychiatry, University of Toronto, Toronto, ON, Canada
| | - Michael Kiang
- Centre for Addiction and Mental Health, Office 6168G, 100 Stokes Street, Toronto, ON, Canada, 1 (416) 535-8501 ext 30515
- Department of Psychiatry, University of Toronto, Toronto, ON, Canada
| | - Tamara Hoppe
- Centre for Addiction and Mental Health, Office 6168G, 100 Stokes Street, Toronto, ON, Canada, 1 (416) 535-8501 ext 30515
- Department of Psychiatry, University of Toronto, Toronto, ON, Canada
| | - Sara Ling
- Centre for Addiction and Mental Health, Office 6168G, 100 Stokes Street, Toronto, ON, Canada, 1 (416) 535-8501 ext 30515
- Department of Psychiatry, University of Toronto, Toronto, ON, Canada
- Lawrence Bloomberg Faculty of Nursing, University of Toronto, Toronto, ON, Canada
| | - Uzma Haider
- Centre for Addiction and Mental Health, Office 6168G, 100 Stokes Street, Toronto, ON, Canada, 1 (416) 535-8501 ext 30515
| | - Kavini Rabel
- Centre for Addiction and Mental Health, Office 6168G, 100 Stokes Street, Toronto, ON, Canada, 1 (416) 535-8501 ext 30515
| | - Noelle Coombe
- Centre for Addiction and Mental Health, Office 6168G, 100 Stokes Street, Toronto, ON, Canada, 1 (416) 535-8501 ext 30515
| | - Damian Jankowicz
- Centre for Addiction and Mental Health, Office 6168G, 100 Stokes Street, Toronto, ON, Canada, 1 (416) 535-8501 ext 30515
- Institute for Health Policy, Management and Evaluation, University of Toronto, Toronto, ON, Canada
- Information Technology, Unity Health Toronto, Toronto, ON, Canada
| | - Sanjeev Sockalingam
- Centre for Addiction and Mental Health, Office 6168G, 100 Stokes Street, Toronto, ON, Canada, 1 (416) 535-8501 ext 30515
- Department of Psychiatry, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
4
|
Dietrich N, Bradbury NC, Loh C. Prompt Engineering for Large Language Models in Interventional Radiology. AJR Am J Roentgenol 2025. [PMID: 40334089 DOI: 10.2214/ajr.25.32956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/09/2025]
Abstract
Prompt engineering plays a crucial role in optimizing artificial intelligence (AI) and large language model (LLM) outputs by refining input structure, a key factor in medical applications where precision and reliability are paramount. This Clinical Perspective provides an overview of prompt engineering techniques and their relevance to interventional radiology (IR). It explores key strategies, including zero-shot, one- or few-shot, chain-of-thought, tree-of-thought, self-consistency, and directional stimulus prompting, demonstrating their application in IR-specific contexts. Practical examples illustrate how these techniques can be effectively structured for workplace and clinical use. Additionally, the article discusses best practices for designing effective prompts and addresses challenges in the clinical use of generative AI, including data privacy and regulatory concerns. It concludes with an outlook on the future of generative AI in IR, highlighting advances including retrieval-augmented generation, domain-specific LLMs, and multimodal models.
Collapse
Affiliation(s)
- Nicholas Dietrich
- Temerty Faculty of Medicine, University of Toronto, 1 King's College Cir, Toronto, Ontario, Canada M5S 1A8
| | - Nicholas C Bradbury
- University of North Dakota School of Medicine and Health Sciences, 1301 N Columbia Rd, Grand Forks, ND, USA 58203
| | - Christopher Loh
- University of North Dakota School of Medicine and Health Sciences, 1301 N Columbia Rd, Grand Forks, ND, USA 58203
| |
Collapse
|
5
|
Değerli Yİ, Özata Değerli MN. Using ChatGPT as a tool during occupational therapy intervention: A case report in mild cognitive impairment. Assist Technol 2025; 37:165-174. [PMID: 39446069 DOI: 10.1080/10400435.2024.2416495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/07/2024] [Indexed: 10/25/2024] Open
Abstract
This case report examined the impact of computer programmed assistive technology developed, using ChatGPT as a tool when designing an occupational therapy intervention on a client's independence in activities of daily living. A 66-year-old female client with mild cognitive impairment consulted an occupational therapist due to difficulties with activities of daily living. The occupational therapist developed two activity assistance computer programs using ChatGPT as a resource. The client did not interact directly with ChatGPT; instead, the occupational therapist used the technology to design and implement the intervention. A computer programmed assistive technology-based occupational therapy intervention was completed for eight weeks. The occupational therapist trained the client to use these programs in the clinical setting and at home. As a result of the intervention, the client's performance and independence in daily activities improved. The results of this study emphasize that ChatGPT may help occupational therapists as a tool to design simple computer programmed assistive technology interventions without requiring additional professional input.
Collapse
Affiliation(s)
- Yusuf İslam Değerli
- Kızılcahamam Vocational School of Health Services, Ankara University, Ankara, Turkey
| | | |
Collapse
|
6
|
Syryca F, Gräßer C, Trenkwalder T, Nicol P. Automated generation of echocardiography reports using artificial intelligence: a novel approach to streamlining cardiovascular diagnostics. Int J Cardiovasc Imaging 2025; 41:967-977. [PMID: 40159559 PMCID: PMC12075404 DOI: 10.1007/s10554-025-03382-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/28/2024] [Accepted: 03/12/2025] [Indexed: 04/02/2025]
Abstract
Accurate interpretation of echocardiography measurements is essential for diagnosing cardiovascular diseases and guiding clinical management. The emergence of large language models (LLMs) like ChatGPT presents a novel opportunity to automate the generation of echocardiography reports and provide clinical recommendations. This study aimed to evaluate the ability of an LLM (ChatGPT) to 1) generate comprehensive echocardiography reports based solely on provided echocardiographic measurements, and when enriched with clinical information 2) formulate accurate diagnoses, along with appropriate recommendations for further tests, treatment, and follow-up. Echocardiographic data from n = 13 fictional cases (Group 1) and n = 8 clinical cases (Group 2) were input into the LLM. The model's outputs were compared against standard clinical assessments conducted by experienced cardiologists. Using a dedicated scoring system, the LLM's performance was evaluated and stratified based on its accuracy in report generation, diagnostic precision, and the appropriateness of its recommendations. Patterns, frequency and examples of misinterpretations by LLM were analysed. Across all cases, mean total score was 6.86 (SD = 1.12). Group 1 had a mean total score of 6.54 (SD = 1.13) and accuracy of 3.92 (SD = 0.86), while Group 2 scored 7.38 (SD = 0.92) and 4.38 (SD = 0.92), respectively. Recommendations were 2.62 (SD = 0.51) for Group 1 and 3.00 (SD = 0.00) for Group 2, with no significant differences (p = 0.096). Fully acceptable reports were 85.7%, borderline acceptable 14.3%, and none were not acceptable. Of 299 parameters, 5.3% were misinterpreted. The LLM demonstrated a high level of accuracy in generating detailed echocardiography reports, mostly correctly identifying normal and abnormal findings, and making accurate diagnoses across a range of cardiovascular conditions. ChatGPT, as an LLM, shows significant potential in automating the interpretation of echocardiographic data, offering accurate diagnostic insights and clinical recommendations. These findings suggest that LLMs could serve as valuable tools in clinical practice, assisting and streamlining clinical workflow.
Collapse
Affiliation(s)
- Finn Syryca
- Department of Cardiovascular Diseases, German Heart Centre Munich, School of Medicine and Health, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Christian Gräßer
- Department of Cardiovascular Diseases, German Heart Centre Munich, School of Medicine and Health, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Teresa Trenkwalder
- Department of Cardiovascular Diseases, German Heart Centre Munich, School of Medicine and Health, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Philipp Nicol
- Department of Cardiovascular Diseases, German Heart Centre Munich, School of Medicine and Health, TUM University Hospital, Technical University of Munich, Munich, Germany.
- MVZ Med 360 Grad Alter Hof Kardiologe Und Nuklearmedizin, Dienerstraße 12, 80331, Munich, Germany.
| |
Collapse
|
7
|
Gim H, Cook B, Le J, Stretton B, Gao C, Gupta A, Kovoor J, Guo C, Arnold M, Gheihman G, Bacchi S. Large language model-supported interactive case-based learning: a pilot study. Intern Med J 2025; 55:852-855. [PMID: 40125598 DOI: 10.1111/imj.70030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2024] [Accepted: 02/16/2025] [Indexed: 03/25/2025]
Abstract
Large language models (LLMs) have been proposed as a means to augment case-based learning but are prone to generating factually incorrect content. In this study, an LLM-based tool was developed, and its performance evaluated. In response to student-generated questions, the LLM adhered to the provided screenplay in 832/857 (97.1%) instances, and in the remaining instances, it was medically appropriate in 24/25 (96.0%) cases. Use of LLM appears to be feasible for this purpose, and further studies are required to examine their educational impact.
Collapse
Affiliation(s)
- Haelynn Gim
- Harvard Medical School, Boston, Massachusetts, USA
| | - Benjamin Cook
- Adelaide Medical School, The University of Adelaide, Adelaide, South Australia, Australia
| | - Jasmin Le
- Adelaide Medical School, The University of Adelaide, Adelaide, South Australia, Australia
| | - Brandon Stretton
- Adelaide Medical School, The University of Adelaide, Adelaide, South Australia, Australia
| | - Christina Gao
- Adelaide Medical School, The University of Adelaide, Adelaide, South Australia, Australia
| | - Aashray Gupta
- Adelaide Medical School, The University of Adelaide, Adelaide, South Australia, Australia
| | - Joshua Kovoor
- Adelaide Medical School, The University of Adelaide, Adelaide, South Australia, Australia
- Ballarat Base Hospital, Ballarat, Victoria, Australia
| | - Christina Guo
- The Alfred, Melbourne, Victoria, Australia
- Johns Hopkins, Baltimore, Maryland, USA
| | - Matthew Arnold
- Adelaide Medical School, The University of Adelaide, Adelaide, South Australia, Australia
| | - Galina Gheihman
- Harvard Medical School, Boston, Massachusetts, USA
- Mass General Brigham, Boston, Massachusetts, USA
| | - Stephen Bacchi
- Harvard Medical School, Boston, Massachusetts, USA
- Adelaide Medical School, The University of Adelaide, Adelaide, South Australia, Australia
- Massachusetts General Hospital, Boston, Massachusetts, USA
- Flinders University, Adelaide, South Australia, Australia
- Lyell McEwin Hospital, Adelaide, South Australia, Australia
| |
Collapse
|
8
|
Lin C, Kuo CF. Roles and Potential of Large Language Models in Healthcare: A Comprehensive Review. Biomed J 2025:100868. [PMID: 40311872 DOI: 10.1016/j.bj.2025.100868] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2024] [Revised: 04/14/2025] [Accepted: 04/28/2025] [Indexed: 05/03/2025] Open
Abstract
Large Language Models (LLMs) are capable of transforming healthcare by demonstrating remarkable capabilities in language understanding and generation. They have matched or surpassed human performance in standardized medical examinations and assisted in diagnostics across specialties like dermatology, radiology, and ophthalmology. LLMs can enhance patient education by providing accurate, readable, and empathetic responses, and they can streamline clinical workflows through efficient information extraction from unstructured data such as clinical notes. Integrating LLM into clinical practice involves user interface design, clinician training, and effective collaboration between Artificial Intelligence (AI) systems and healthcare professionals. Users must possess a solid understanding of generative AI and domain knowledge to assess the generated content critically. Ethical considerations to ensure patient privacy, data security, mitigating biases, and maintaining transparency are critical for responsible deployment. Future directions for LLMs in healthcare include interdisciplinary collaboration, developing new benchmarks that incorporate safety and ethical measures, advancing multimodal LLMs that integrate text and imaging data, creating LLM-based medical agents capable of complex decision-making, addressing underrepresented specialties like rare diseases, and integrating LLMs with robotic systems to enhance precision in procedures. Emphasizing patient safety, ethical integrity, and human-centered implementation is essential for maximizing the benefits of LLMs, while mitigating potential risks, thereby helping to ensure that these AI tools enhance rather than replace human expertise and compassion in healthcare.
Collapse
Affiliation(s)
- Chihung Lin
- Center for Artificial Intelligence in Medicine, Chang Gung Memorial Hospital, Taoyuan, Taiwan
| | - Chang-Fu Kuo
- Center for Artificial Intelligence in Medicine, Chang Gung Memorial Hospital, Taoyuan, Taiwan; Division of Rheumatology, Allergy, and Immunology, Chang Gung Memorial Hospital, Taoyuan, Taiwan; Division of Rheumatology, Orthopaedics and Dermatology, School of Medicine, University of Nottingham, Nottingham, UK.
| |
Collapse
|
9
|
Shan G, Chen X, Wang C, Liu L, Gu Y, Jiang H, Shi T. Comparing Diagnostic Accuracy of Clinical Professionals and Large Language Models: Systematic Review and Meta-Analysis. JMIR Med Inform 2025; 13:e64963. [PMID: 40279517 PMCID: PMC12047852 DOI: 10.2196/64963] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2024] [Revised: 03/19/2025] [Accepted: 03/25/2025] [Indexed: 04/27/2025] Open
Abstract
Background With the rapid development of artificial intelligence (AI) technology, especially generative AI, large language models (LLMs) have shown great potential in the medical field. Through massive medical data training, it can understand complex medical texts and can quickly analyze medical records and provide health counseling and diagnostic advice directly, especially in rare diseases. However, no study has yet compared and extensively discussed the diagnostic performance of LLMs with that of physicians. Objective This study systematically reviewed the accuracy of LLMs in clinical diagnosis and provided reference for further clinical application. Methods We conducted searches in CNKI (China National Knowledge Infrastructure), VIP Database, SinoMed, PubMed, Web of Science, Embase, and CINAHL (Cumulative Index to Nursing and Allied Health Literature) from January 1, 2017, to the present. A total of 2 reviewers independently screened the literature and extracted relevant information. The risk of bias was assessed using the Prediction Model Risk of Bias Assessment Tool (PROBAST), which evaluates both the risk of bias and the applicability of included studies. Results A total of 30 studies involving 19 LLMs and a total of 4762 cases were included. The quality assessment indicated a high risk of bias in the majority of studies, primary cause is known case diagnosis. For the optimal model, the accuracy of the primary diagnosis ranged from 25% to 97.8%, while the triage accuracy ranged from 66.5% to 98%. Conclusions LLMs have demonstrated considerable diagnostic capabilities and significant potential for application across various clinical cases. Although their accuracy still falls short of that of clinical professionals, if used cautiously, they have the potential to become one of the best intelligent assistants in the field of human health care.
Collapse
Affiliation(s)
- Guxue Shan
- Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, China
| | - Xiaonan Chen
- Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, China
| | - Chen Wang
- Nanjing Drum Tower Hospital Clinical College of Nanjing University of Chinese Medicine, Nanjing, China
| | - Li Liu
- Jiangsu Province Hospital of Chinese Medicine, Affiliated Hospital of Nanjing University of Chinese Medicine, Nanjing, China
| | - Yuanjing Gu
- Department of Emergency, Nanjing Drum Tower Hospital, Nanjing, China
| | - Huiping Jiang
- Department of Nursing, Nanjing Drum Tower Hospital, Nanjing, China
| | - Tingqi Shi
- Department of Quality Management, Nanjing Drum Tower Hospital, Affiliated Hospital of Medical School, Nanjing University, 321 Zhongshan Road, Gulou District, Nanjing, 210008, China, 86 1-391-299-6998
| |
Collapse
|
10
|
Schubert MC, Soyka S, Wick W, Venkataramani V. Guideline-Incorporated Large Language Model-Driven Evaluation of Medical Records Using MedCheckLLM. JMIR Form Res 2025; 9:e53335. [PMID: 40272831 PMCID: PMC12045122 DOI: 10.2196/53335] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2023] [Revised: 11/02/2024] [Accepted: 11/17/2024] [Indexed: 05/04/2025] Open
Abstract
Unlabelled The study introduces MedCheckLLM, a large language model-driven framework that enhances medical record evaluation through a guideline-in-the-loop approach by integrating evidence-based guidelines.
Collapse
Affiliation(s)
- Marc Cicero Schubert
- Department of Neurology, University Hospital Heidelberg, Im Neuenheimer Feld 400, Heidelberg, 69120, Germany, 49 6221548630
| | - Stella Soyka
- Department of Neurology, University Hospital Heidelberg, Im Neuenheimer Feld 400, Heidelberg, 69120, Germany, 49 6221548630
| | - Wolfgang Wick
- Department of Neurology, University Hospital Heidelberg, Im Neuenheimer Feld 400, Heidelberg, 69120, Germany, 49 6221548630
| | - Varun Venkataramani
- Department of Neurology, University Hospital Heidelberg, Im Neuenheimer Feld 400, Heidelberg, 69120, Germany, 49 6221548630
| |
Collapse
|
11
|
Zou Y, Ye R, Gao Y, Zhou J, Li Y, Chen W, Zha F, Wang Y. Comparison of triage performance among DRP tool, ChatGPT, and outpatient rehabilitation doctors. Sci Rep 2025; 15:14084. [PMID: 40269240 PMCID: PMC12019411 DOI: 10.1038/s41598-025-99216-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2024] [Accepted: 04/17/2025] [Indexed: 04/25/2025] Open
Abstract
With the increasing need for rehabilitation, efficient triage is crucial. This study aims to explore the performance of the distributing rehabilitation patients (DRP) tool, ChatGPT, and outpatient doctors in rehabilitation patient triage, and to compare their strengths and limitations. This is a multicenter cross-sectional study. A total of 300 rehabilitation patients were selected from 27 medical institutions in 15 cities. Patients assessed by three methods: doctor assessment, ChatGPT, and the DRP tool. Three groups were set according to different methods: Doctor Group, ChatGPT Group, and Tool Group. Triage outcomes: outpatient rehabilitation, primary healthcare institutions inpatient, secondary hospital inpatient, tertiary hospital inpatient, and nursing homes or long-term care institutions inpatient. The consistency of triage was analyzed. Significant differences were observed between Doctor Group and both ChatGPT Group and Tool Group (P < 0.01; P < 0.01), no significant difference was observed between ChatGPT Group and Tool Group (P = 0.29). Consistency analysis revealed fair consistency between Doctor Group and both Tool Group and ChatGPT Group. Tool Group and ChatGPT Group showed good consistency. Percentage consistency showed that Tool Group and ChatGPT Group achieved the highest rate of agreement, at 63.67%. This study revealed differences between the DRP tool, ChatGPT, and traditional triage methods. The DRP tool may be more suitable for rehabilitation triage.
Collapse
Affiliation(s)
- Yucong Zou
- Department of Rehabilitation, Shenzhen Second People's Hospital, First Affiliated Hospital of Shenzhen University, 3002 Sungang West Road, Futian District, Shenzhen, 518035, Guangdong Province, China
| | - Ruixue Ye
- Department of Rehabilitation, Shenzhen Second People's Hospital, First Affiliated Hospital of Shenzhen University, 3002 Sungang West Road, Futian District, Shenzhen, 518035, Guangdong Province, China
| | - Yan Gao
- Department of Rehabilitation, Shenzhen Second People's Hospital, First Affiliated Hospital of Shenzhen University, 3002 Sungang West Road, Futian District, Shenzhen, 518035, Guangdong Province, China
| | - Jing Zhou
- Department of Rehabilitation, Shenzhen Second People's Hospital, First Affiliated Hospital of Shenzhen University, 3002 Sungang West Road, Futian District, Shenzhen, 518035, Guangdong Province, China
| | - Yawei Li
- Department of Rehabilitation, Shenzhen Second People's Hospital, First Affiliated Hospital of Shenzhen University, 3002 Sungang West Road, Futian District, Shenzhen, 518035, Guangdong Province, China
| | - Wenshi Chen
- Department of Rehabilitation, Shenzhen Second People's Hospital, First Affiliated Hospital of Shenzhen University, 3002 Sungang West Road, Futian District, Shenzhen, 518035, Guangdong Province, China
| | - Fubing Zha
- Department of Rehabilitation, Shenzhen Second People's Hospital, First Affiliated Hospital of Shenzhen University, 3002 Sungang West Road, Futian District, Shenzhen, 518035, Guangdong Province, China.
| | - Yulong Wang
- Department of Rehabilitation, Shenzhen Second People's Hospital, First Affiliated Hospital of Shenzhen University, 3002 Sungang West Road, Futian District, Shenzhen, 518035, Guangdong Province, China.
| |
Collapse
|
12
|
Grilo A, Marques C, Corte-Real M, Carolino E, Caetano M. Assessing the Quality and Reliability of ChatGPT's Responses to Radiotherapy-Related Patient Queries: Comparative Study With GPT-3.5 and GPT-4. JMIR Cancer 2025; 11:e63677. [PMID: 40239208 PMCID: PMC12017613 DOI: 10.2196/63677] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2024] [Revised: 01/30/2025] [Accepted: 02/27/2025] [Indexed: 04/18/2025] Open
Abstract
Background Patients frequently resort to the internet to access information about cancer. However, these websites often lack content accuracy and readability. Recently, ChatGPT, an artificial intelligence-powered chatbot, has signified a potential paradigm shift in how patients with cancer can access vast amounts of medical information, including insights into radiotherapy. However, the quality of the information provided by ChatGPT remains unclear. This is particularly significant given the general public's limited knowledge of this treatment and concerns about its possible side effects. Furthermore, evaluating the quality of responses is crucial, as misinformation can foster a false sense of knowledge and security, lead to noncompliance, and result in delays in receiving appropriate treatment. Objective This study aims to evaluate the quality and reliability of ChatGPT's responses to common patient queries about radiotherapy, comparing the performance of ChatGPT's two versions: GPT-3.5 and GPT-4. Methods We selected 40 commonly asked radiotherapy questions and entered the queries in both versions of ChatGPT. Response quality and reliability were evaluated by 16 radiotherapy experts using the General Quality Score (GQS), a 5-point Likert scale, with the median GQS determined based on the experts' ratings. Consistency and similarity of responses were assessed using the cosine similarity score, which ranges from 0 (complete dissimilarity) to 1 (complete similarity). Readability was analyzed using the Flesch Reading Ease Score, ranging from 0 to 100, and the Flesch-Kincaid Grade Level, reflecting the average number of years of education required for comprehension. Statistical analyses were performed using the Mann-Whitney test and effect size, with results deemed significant at a 5% level (P=.05). To assess agreement between experts, Krippendorff α and Fleiss κ were used. Results GPT-4 demonstrated superior performance, with a higher GQS and a lower number of scores of 1 and 2, compared to GPT-3.5. The Mann-Whitney test revealed statistically significant differences in some questions, with GPT-4 generally receiving higher ratings. The median (IQR) cosine similarity score indicated substantial similarity (0.81, IQR 0.05) and consistency in the responses of both versions (GPT-3.5: 0.85, IQR 0.04; GPT-4: 0.83, IQR 0.04). Readability scores for both versions were considered college level, with GPT-4 scoring slightly better in the Flesch Reading Ease Score (34.61) and Flesch-Kincaid Grade Level (12.32) compared to GPT-3.5 (32.98 and 13.32, respectively). Responses by both versions were deemed challenging for the general public. Conclusions Both GPT-3.5 and GPT-4 demonstrated having the capability to address radiotherapy concepts, with GPT-4 showing superior performance. However, both models present readability challenges for the general population. Although ChatGPT demonstrates potential as a valuable resource for addressing common patient queries related to radiotherapy, it is imperative to acknowledge its limitations, including the risks of misinformation and readability issues. In addition, its implementation should be supported by strategies to enhance accessibility and readability.
Collapse
Affiliation(s)
- Ana Grilo
- Research Center for Psychological Science of the Faculty of Psychology, University of Lisbon to CICPSI, Faculdade de Psicologia, Universidade de Lisboa, Av. D. João II, Lote 4.69.01, Parque das Nações, Lisboa, 1990-096, Portugal, 351 964371101
| | - Catarina Marques
- Escola Superior de Tecnologia da Saúde, Instituto Politécnico de Lisboa, Lisboa, Portugal
| | - Maria Corte-Real
- Escola Superior de Tecnologia da Saúde, Instituto Politécnico de Lisboa, Lisboa, Portugal
| | - Elisabete Carolino
- Research Center for Psychological Science of the Faculty of Psychology, University of Lisbon to CICPSI, Faculdade de Psicologia, Universidade de Lisboa, Av. D. João II, Lote 4.69.01, Parque das Nações, Lisboa, 1990-096, Portugal, 351 964371101
| | - Marco Caetano
- Escola Superior de Tecnologia da Saúde, Instituto Politécnico de Lisboa, Lisboa, Portugal
| |
Collapse
|
13
|
Salmanpour F, Akpınar M. Performance of Chat Generative Pretrained Transformer-4.0 in determining labiolingual localization of maxillary impacted canine and presence of resorption in incisors through panoramic radiographs: A retrospective study. Am J Orthod Dentofacial Orthop 2025:S0889-5406(25)00104-0. [PMID: 40208160 DOI: 10.1016/j.ajodo.2025.02.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2024] [Revised: 02/01/2025] [Accepted: 02/01/2025] [Indexed: 04/11/2025]
Abstract
INTRODUCTION This study evaluates the diagnostic accuracy of Chat Generative Pretrained Transformer (ChatGPT)-4.0 in determining the labiolingual position of impacted maxillary canines and identifying resorptive changes in adjacent incisors using panoramic radiographs (PRs). METHODS A retrospective analysis was conducted on 105 patients with unilaterally impacted maxillary canine, including 25 patients with root resorption in adjacent incisors. To ensure accurate classification, PRs and cone-beam computed tomography images were independently evaluated by 3 orthodontists, serving as the reference standard for assessing ChatGPT-4.0's performance. Patients were categorized into 3 groups based on canine position: palatal (n = 49), midalveolar (n = 26), and labial (n = 30). For resorption evaluation, a balanced subset of 50 PRs was selected to maintain equal group sizes. Group 1 (with resorption, n = 25) included all available patients with resorption, whereas group 2 (without resorption, n = 25) was randomly selected from 80 patients without resorption. ChatGPT-4.0 analyzed the PRs to determine the labiolingual position of impacted canines and detect resorption in adjacent incisors. The results were recorded by the first researcher. ChatGPT-4.0's performance was evaluated using accuracy, precision, recall, and F1 score for both canine localization and resorption detection. RESULTS The model achieved an overall accuracy of 37.1% in canine localization, with the highest sensitivity (61.2%) and precision (48.4%) observed in palatal patients. However, its performance was considerably lower for midalveolar and labial positions. In detecting resorption, the model achieved an accuracy of 46.0%, performing better in identifying the absence of resorption compared with its presence. CONCLUSIONS ChatGPT-4.0 demonstrated insufficient accuracy in determining the labiolingual position of impacted maxillary canines and detecting resorptive changes based on PRs, indicating its unsuitability for clinical applications.
Collapse
Affiliation(s)
- Farhad Salmanpour
- Department of Orthodontics, Afyonkarahisar Health Sciences University, Afyonkarahisar, Türkiye.
| | - Meryem Akpınar
- Department of Orthodontics, Afyonkarahisar Health Sciences University, Afyonkarahisar, Türkiye
| |
Collapse
|
14
|
Abanmy NO, Al-Ghreimil N, Alsabhan JF, Al-Baity H, Aljadeed R. Evaluating the accuracy of ChatGPT in delivering patient instructions for medications: an exploratory case study. Front Artif Intell 2025; 8:1550591. [PMID: 40235859 PMCID: PMC11996888 DOI: 10.3389/frai.2025.1550591] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2024] [Accepted: 03/19/2025] [Indexed: 04/17/2025] Open
Abstract
Background The use of ChatGPT in healthcare is still in its early stages; however, it has the potential to become a cornerstone in modern healthcare systems. This study aims to assess the accuracy of output of ChatGPT compared with those of CareNotes® in providing patient instructions for three medications: tirzepatide, citalopram, and apixaban. Methods An exploratory case study was conducted using a published questionnaire to evaluate ChatGPT-generated reports against patient instructions from CareNotes®. The evaluation focused on the completeness and correctness of the reports, as well as their potential to cause harm or lead to poor medication adherence. The evaluation was conducted by four pharmacy experts and 33 PharmD interns. Results The evaluators indicated that the ChatGPT reports of tirzepatide, citalopram, and apixaban were correct but lacked completeness. Additionally, ChatGPT reports have the potential to cause harm and may negatively affect medication adherence. Conclusion Although ChatGPT demonstrated promising results, particularly in terms of correctness, it cannot yet be considered a reliable standalone source of patient drug information.
Collapse
Affiliation(s)
- Norah Othman Abanmy
- Department of Clinical Pharmacy, College of Pharmacy, King Saud University, Riyadh, Saudi Arabia
| | - Nadia Al-Ghreimil
- Information Technology Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
| | - Jawza F. Alsabhan
- Department of Clinical Pharmacy, College of Pharmacy, King Saud University, Riyadh, Saudi Arabia
| | - Heyam Al-Baity
- Information Technology Department, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
| | - Rana Aljadeed
- Department of Clinical Pharmacy, College of Pharmacy, King Saud University, Riyadh, Saudi Arabia
| |
Collapse
|
15
|
Cheshire WP, Sandroni P, Shouman K, Cutsforth-Gregory JK, Coon EA, Benarroch EE, Singer W, Low PA. Accuracy of chat-based artificial intelligence for patient education on orthostatic hypotension. Clin Auton Res 2025:10.1007/s10286-025-01125-9. [PMID: 40167938 DOI: 10.1007/s10286-025-01125-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2024] [Accepted: 03/19/2025] [Indexed: 04/02/2025]
Affiliation(s)
- W P Cheshire
- Division of Autonomic Neurology, Department of Neurology, Mayo Clinic, 4500 San Pablo Rd., Jacksonville, FL, 32224, USA.
| | - P Sandroni
- Department of Neurology, Mayo Clinic, Rochester, MN, 55905, USA
| | - K Shouman
- Department of Neurology, Mayo Clinic, Rochester, MN, 55905, USA
| | | | - E A Coon
- Department of Neurology, Mayo Clinic, Rochester, MN, 55905, USA
| | - E E Benarroch
- Department of Neurology, Mayo Clinic, Rochester, MN, 55905, USA
| | - W Singer
- Department of Neurology, Mayo Clinic, Rochester, MN, 55905, USA
| | - P A Low
- Department of Neurology, Mayo Clinic, Rochester, MN, 55905, USA
| |
Collapse
|
16
|
Dihan QA, Brown AD, Chauhan MZ, Alzein AF, Abdelnaem SE, Kelso SD, Rahal DA, Park R, Ashraf M, Azzam A, Morsi M, Warner DB, Sallam AB, Saeed HN, Elhusseiny AM. Leveraging large language models to improve patient education on dry eye disease. Eye (Lond) 2025; 39:1115-1122. [PMID: 39681711 PMCID: PMC11978745 DOI: 10.1038/s41433-024-03476-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2024] [Revised: 10/18/2024] [Accepted: 11/07/2024] [Indexed: 12/18/2024] Open
Abstract
BACKGROUND/OBJECTIVES Dry eye disease (DED) is an exceedingly common diagnosis in patients, yet recent analyses have demonstrated patient education materials (PEMs) on DED to be of low quality and readability. Our study evaluated the utility and performance of three large language models (LLMs) in enhancing and generating new patient education materials (PEMs) on dry eye disease (DED). SUBJECTS/METHODS We evaluated PEMs generated by ChatGPT-3.5, ChatGPT-4, Gemini Advanced, using three separate prompts. Prompts A and B requested they generate PEMs on DED, with Prompt B specifying a 6th-grade reading level, using the SMOG (Simple Measure of Gobbledygook) readability formula. Prompt C asked for a rewrite of existing PEMs at a 6th-grade reading level. Each PEM was assessed on readability (SMOG, FKGL: Flesch-Kincaid Grade Level), quality (PEMAT: Patient Education Materials Assessment Tool, DISCERN), and accuracy (Likert Misinformation scale). RESULTS All LLM-generated PEMs in response to Prompt A and B were of high quality (median DISCERN = 4), understandable (PEMAT understandability ≥70%) and accurate (Likert Score=1). LLM-generated PEMs were not actionable (PEMAT Actionability <70%). ChatGPT-4 and Gemini Advanced rewrote existing PEMs (Prompt C) from a baseline readability level (FKGL: 8.0 ± 2.4, SMOG: 7.9 ± 1.7) to targeted 6th-grade reading level; rewrites contained little to no misinformation (median Likert misinformation=1 (range: 1-2)). However, only ChatGPT-4 rewrote PEMs while maintaining high quality and reliability (median DISCERN = 4). CONCLUSION LLMs (notably ChatGPT-4) were able to generate and rewrite PEMs on DED that were readable, accurate, and high quality. Our study underscores the value of leveraging LLMs as supplementary tools to improving PEMs.
Collapse
Affiliation(s)
- Qais A Dihan
- Chicago Medical School, Rosalind Franklin University of Medicine and Science, North Chicago, IL, USA
- Department of Ophthalmology, Harvey and Bernice Jones Eye Institute, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Andrew D Brown
- UAMS College of Medicine, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Muhammad Z Chauhan
- UAMS College of Medicine, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Ahmad F Alzein
- College of Medicine, University of Illinois at Chicago, Chicago, IL, USA
| | - Seif E Abdelnaem
- College of Natural Sciences and Mathematics, University of Central Arkansas, Conway, AR, USA
| | - Sean D Kelso
- Burnett School of Medicine, Texas Christian University, Fort Worth, TX, USA
| | - Dania A Rahal
- UAMS College of Medicine, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Royce Park
- Department of Ophthalmology and Visual Sciences, University of Illinois Chicago, Chicago, USA
| | - Mohammadali Ashraf
- Department of Ophthalmology and Visual Sciences, University of Illinois Chicago, Chicago, USA
| | - Amr Azzam
- Department of Ophthalmology, Kasr Al-Ainy Hospitals, Cairo University, Cairo, Egypt
| | - Mahmoud Morsi
- Department of Anesthesia and Pain Management, Kasr Al-Ainy Hospitals, Cairo University, Cairo, Egypt
- Department of Anesthesiology and Pain Management, John H. Stroger, Jr. Hospital of Cook County, Chicago, IL, USA
| | - David B Warner
- Department of Ophthalmology, Harvey and Bernice Jones Eye Institute, University of Arkansas for Medical Sciences, Little Rock, AR, USA
| | - Ahmed B Sallam
- Department of Ophthalmology, Harvey and Bernice Jones Eye Institute, University of Arkansas for Medical Sciences, Little Rock, AR, USA
- Department of Ophthalmology, Faculty of Medicine, Ain Shams University, Cairo, Egypt
| | - Hajirah N Saeed
- Department of Ophthalmology and Visual Sciences, University of Illinois Chicago, Chicago, IL, USA.
- Department of Ophthalmology, Loyola University Medical Center, Maywood, IL, USA.
| | - Abdelrahman M Elhusseiny
- Department of Ophthalmology, Harvey and Bernice Jones Eye Institute, University of Arkansas for Medical Sciences, Little Rock, AR, USA.
- Department of Ophthalmology, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
17
|
Wan Z, Guo Y, Bao S, Wang Q, Malin BA. Evaluating Sex and Age Biases in Multimodal Large Language Models for Skin Disease Identification from Dermatoscopic Images. HEALTH DATA SCIENCE 2025; 5:0256. [PMID: 40170800 PMCID: PMC11961048 DOI: 10.34133/hds.0256] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/23/2024] [Revised: 02/15/2025] [Accepted: 02/20/2025] [Indexed: 04/03/2025]
Abstract
Background: Multimodal large language models (LLMs) have shown potential in various health-related fields. However, many healthcare studies have raised concerns about the reliability and biases of LLMs in healthcare applications. Methods: To explore the practical application of multimodal LLMs in skin disease identification, and to evaluate sex and age biases, we tested the performance of 2 popular multimodal LLMs, ChatGPT-4 and LLaVA-1.6, across diverse sex and age groups using a subset of a large dermatoscopic dataset containing around 10,000 images and 3 skin diseases (melanoma, melanocytic nevi, and benign keratosis-like lesions). Results: In comparison to 3 deep learning models (VGG16, ResNet50, and Model Derm) based on convolutional neural network (CNN), one vision transformer model (Swin-B), we found that ChatGPT-4 and LLaVA-1.6 demonstrated overall accuracies that were 3% and 23% higher (and F1-scores that were 4% and 34% higher), respectively, than the best performing CNN-based baseline while maintaining accuracies that were 38% and 26% lower (and F1-scores that were 38% and 19% lower), respectively, than Swin-B. Meanwhile, ChatGPT-4 is generally unbiased in identifying these skin diseases across sex and age groups, while LLaVA-1.6 is generally unbiased across age groups, in contrast to Swin-B, which is biased in identifying melanocytic nevi. Conclusions: This study suggests the usefulness and fairness of LLMs in dermatological applications, aiding physicians and practitioners with diagnostic recommendations and patient screening. To further verify and evaluate the reliability and fairness of LLMs in healthcare, experiments using larger and more diverse datasets need to be performed in the future.
Collapse
Affiliation(s)
- Zhiyu Wan
- Department of Biomedical Informatics,
Vanderbilt University Medical Center, Nashville, TN, USA
- School of Biomedical Engineering,
ShanghaiTech University, Shanghai, China
| | - Yuhang Guo
- School of Biomedical Engineering,
ShanghaiTech University, Shanghai, China
| | - Shunxing Bao
- Department of Electrical and Computer Engineering,
Vanderbilt University, Nashville, TN, USA
| | - Qian Wang
- School of Biomedical Engineering,
ShanghaiTech University, Shanghai, China
| | - Bradley A. Malin
- Department of Biomedical Informatics,
Vanderbilt University Medical Center, Nashville, TN, USA
- Department of Computer Science,
Vanderbilt University, Nashville, TN, USA
- Department of Biostatistics,
Vanderbilt University Medical Center, Nashville, TN, USA
| |
Collapse
|
18
|
Piras A, Mastroleo F, Colciago RR, Morelli I, D'Aviero A, Longo S, Grassi R, Iorio GC, De Felice F, Boldrini L, Desideri I, Salvestrini V. How Italian radiation oncologists use ChatGPT: a survey by the young group of the Italian association of radiotherapy and clinical oncology (yAIRO). LA RADIOLOGIA MEDICA 2025; 130:453-462. [PMID: 39690359 DOI: 10.1007/s11547-024-01945-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/26/2024] [Accepted: 12/11/2024] [Indexed: 12/19/2024]
Abstract
PURPOSE To investigate the awareness and the spread of ChatGPT and its possible role in both scientific research and clinical practice among the young radiation oncologists (RO). MATERIAL AND METHODS An anonymous, online survey via Google Forms (including 24 questions) was distributed among young (< 40 years old) ROs in Italy through the yAIRO network, from March 15, 2024, to 31, 2024. These ROs were officially registered with yAIRO in 2023. We particularly focused on the emerging use of ChatGPT and its future perspectives in clinical practice. RESULTS A total of 76 young physicians answered the survey. Seventy-three participants declared to be familiar with ChatGPT, and 71.1% of the surveyed physicians have already used ChatGPT. Thirty-one (40.8%) participants strongly agreed that AI has the potential to change the medical landscape in the future. Additionally, 79.1% of respondents agreed that AI will be mainly successful in research processes such as literature review and drafting articles/protocols. The belief in ChatGPT's potential results in direct use in daily practice in 43.4% of the cases, with mainly a fair grade of satisfaction (43.2%). A large part of participants (69.7%) believes in the implementation of ChatGPT into clinical practice, even though 53.9% fear an overall negative impact. CONCLUSIONS The results of the present survey clearly highlight the attitude of young Italian ROs toward the implementation of ChatGPT into clinical and academic RO practice. ChatGPT is considered a valuable and effective tool that can ease current and future workflows.
Collapse
Affiliation(s)
- Antonio Piras
- UO Radioterapia Oncologica, Villa Santa Teresa, 90011, Bagheria, Palermo, Italy
- Ri.Med Foundation, 90133, Palermo, Italy
- Department of Health Promotion, Mother and Child Care, Internal Medicine and Medical Specialties, Molecular and Clinical Medicine, University of Palermo, 90127, Palermo, Italy
- Radiation Oncology, Mater Olbia Hospital, Olbia, Sassari, Italy
| | - Federico Mastroleo
- Division of Radiation Oncology, IEO European Institute of Oncology IRCCS, 20141, Milan, Italy
- Department of Oncology and Hemato-Oncology, University of Milan, 20141, Milan, Italy
| | - Riccardo Ray Colciago
- School of Medicine and Surgery, University of Milano Bicocca, Piazza Dell'Ateneo Nuovo, 1, 20126, Milan, Italy.
| | - Ilaria Morelli
- Radiation Oncology Unit, Department of Experimental and Clinical Biomedical Sciences, Azienda Ospedaliero-Universitaria Careggi, University of Florence, Florence, Italy
| | - Andrea D'Aviero
- Department of Radiation Oncology, "S.S Annunziata" Chieti Hospital, Chieti, Italy
- Department of Medical, Oral and Biotechnogical Sciences, "G.D'Annunzio" University of Chieti, Chieti, Italy
| | - Silvia Longo
- UOC Radioterapia Oncologica, Fondazione Policlinico Universitario "A. Gemelli" IRCCS, Rome, Italy
| | - Roberta Grassi
- Department of Precision Medicine, University of Campania "L. Vanvitelli", Naples, Italy
| | | | - Francesca De Felice
- Radiation Oncology, Policlinico Umberto I, Department of Radiological, Oncological and Pathological Sciences, "Sapienza" University of Rome, Rome, Italy
| | - Luca Boldrini
- UOC Radioterapia Oncologica, Fondazione Policlinico Universitario "A. Gemelli" IRCCS, Rome, Italy
- Università Cattolica del Sacro Cuore, Rome, Italy
| | - Isacco Desideri
- Radiation Oncology Unit, Department of Experimental and Clinical Biomedical Sciences, Azienda Ospedaliero-Universitaria Careggi, University of Florence, Florence, Italy
| | - Viola Salvestrini
- Radiation Oncology Unit, Department of Experimental and Clinical Biomedical Sciences, Azienda Ospedaliero-Universitaria Careggi, University of Florence, Florence, Italy
| |
Collapse
|
19
|
Strale F, Riddle I, Geng B, Oxford B, Kah M, Sherwin R. Confirming SPSS Results With ChatGPT-4 and o3-mini Models. Cureus 2025; 17:e82005. [PMID: 40351918 PMCID: PMC12065437 DOI: 10.7759/cureus.82005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2025] [Accepted: 04/09/2025] [Indexed: 05/14/2025] Open
Abstract
Background This research compared the simple and advanced statistical results of SPSS (IBM Corp., Armonk, NY, USA) with ChatGPT-4 and ChatGPT o3-mini (OpenAI, San Francisco, CA, USA) in statistical data output and interpretation with behavioral healthcare data. It evaluated their methodological approaches, quantitative performance, interpretability, adaptability, ethical considerations, and future trends. Methods Fourteen statistical analyses were conducted from two real datasets that produced peer-reviewed, published scientific articles in 2024. Descriptive statistics, Pearson r, multiple correlation with Pearson r, Spearman's rho, simple linear regression, one-sample t-test, paired t-test, two-independent sample t-test, multiple linear regression, one-way analysis of variance (ANOVA), repeated measures ANOVA, two-way (factorial) ANOVA, and multivariate ANOVA were computed. The two datasets adhered to a systematically structured timeframe, March 19, 2023, through June 11, 2023, and June 7, 2023, through July 7, 2023, thereby ensuring the integrity and temporal representativeness of the data gathering. The analyses were conducted by inputting the verbal (text) commands into ChatGPT-4 and ChatGPT o3-mini along with the relevant SPSS variables, which were copied and pasted from the SPSS datasets. Results The study found high concordance between SPSS and ChatGPT-4 in fundamental statistical analyses, such as measures of central tendency, variability, and simple Pearson and Spearman correlation analyses, where the results were nearly identical. ChatGPT-4 also closely matched SPSS in the three t-tests and simple linear regression, with minimal effect size variations. Discrepancies emerged in complex analyses. ChatGPT o3-mini showed inflated correlation values and significant results where none were expected, indicating computational deviations. ChatGPT o3-mini produced inflated coefficients in the multiple correlation and R-squared values in two-way ANOVA and multiple regression, suggesting differing assumptions. ChatGPT-4 and ChatGPT o3-mini produced identical F-statistics with repeated measures ANOVA but reported incorrect degrees of freedom (df) values. While ChatGPT-4 performed well in one-way ANOVA, it miscalculated degrees of freedom in multivariate ANOVA (MANOVA), leading to significant discrepancies. ChatGPT o3-mini also generated erroneous F-statistics in factorial ANOVA, highlighting the need for further optimization in multivariate statistical modeling. Conclusions This study underscored the rapid advancements in artificial intelligence (AI)-driven statistical analyses while highlighting areas that require further refinement. ChatGPT-4 accurately executed fundamental statistical tests, closely matching SPSS. However, its reliability diminished in more advanced statistical procedures, requiring further validation. ChatGPT o3-mini, while optimized for Science, Technology, Engineering, and Mathematics (STEM) applications, produced inconsistencies in correlation and multivariate analyses, limiting its dependability for complex research applications. Ensuring its alignment with established statistical methodologies will be essential for widespread scientific research adoption as AI evolves.
Collapse
Affiliation(s)
| | - Isaac Riddle
- Information Technology, The Oxford Center, Brighton, USA
| | - Bowen Geng
- Information Technology, The Oxford Center, Brighton, USA
| | - Blake Oxford
- Information Technology, The Oxford Center, Brighton, USA
| | - Malia Kah
- Research, The Oxford Center, Brighton, USA
| | - Robert Sherwin
- Hyperbaric Oxygen Therapy, Wayne State University School of Medicine, Detroit, USA
| |
Collapse
|
20
|
van Lent LGG, Yilmaz NG, Goosen S, Burgers J, Giani S, Schouten BC, Langendam MW. Effectiveness of interpreters and other strategies for mitigating language barriers: A systematic review. PATIENT EDUCATION AND COUNSELING 2025; 136:108767. [PMID: 40179546 DOI: 10.1016/j.pec.2025.108767] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/22/2025] [Revised: 03/14/2025] [Accepted: 03/26/2025] [Indexed: 04/05/2025]
Abstract
OBJECTIVE To examine the effectiveness of different communication strategies for mitigating language barriers on patient-, provider- and context-related outcomes. METHODS A systematic search was conducted in nine databases for quantitative studies from 2013 comparing different strategies. The studies' quality was assessed with the Evidence Project Risk of Bias tool and the certainty of evidence with the GRADE approach. RESULTS Twenty-six articles were included, all about healthcare settings. Generally, having a shared language (e.g., a provider in the patient's native language) followed by using professional interpreters yielded the most positive outcomes, and in-person or video interpreters more than telephone interpreters. Compared to professional interpreters, the translation quality of informal interpreters was only similar when assessing patient outcomes after surgery, and the quality of digital translation tools was only sufficient with simple messages or when messages were pre-translated. CONCLUSION Having a provider in patients' native language and having professional interpreters outperform other strategies for mitigating language barriers in healthcare. However, other strategies may suffice in specific situations. Future research should explore the effectiveness of (combining) strategies, especially in social care. PRACTICE IMPLICATIONS This review can inform policy and help develop guidelines on mitigating language barriers to support providers in their daily practice.
Collapse
Affiliation(s)
- Liza G G van Lent
- Department of Communication Science, Amsterdam School for Communication Research (ASCoR), University of Amsterdam, Amsterdam, the Netherlands.
| | - Nida Gizem Yilmaz
- Department of Communication Science, Amsterdam School for Communication Research (ASCoR), University of Amsterdam, Amsterdam, the Netherlands
| | - Simone Goosen
- Netherlands Patients Federation, Utrecht, the Netherlands
| | - Jako Burgers
- Maastricht University, Department of General Practice, Care and Public Health Research Institute (CAPHRI), Maastricht, the Netherlands
| | - Stefano Giani
- University Library, University of Amsterdam, Amsterdam, the Netherlands
| | - Barbara C Schouten
- Department of Communication Science, Amsterdam School for Communication Research (ASCoR), University of Amsterdam, Amsterdam, the Netherlands
| | - Miranda W Langendam
- Department of Epidemiology and Data Science, Amsterdam University Medical Center, Amsterdam, the Netherlands; Amsterdam Public Health Research Institute, Methodology, Amsterdam, the Netherlands
| |
Collapse
|
21
|
Asiksoy G. Nurses' assessment of artificial ıntelligence chatbots for health literacy education. JOURNAL OF EDUCATION AND HEALTH PROMOTION 2025; 14:128. [PMID: 40271238 PMCID: PMC12017437 DOI: 10.4103/jehp.jehp_1195_24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/30/2024] [Accepted: 09/24/2024] [Indexed: 04/25/2025]
Abstract
BACKGROUND Artificial intelligence (AI)-powered chatbots are emerging as a new tool in healthcare, offering the potential to provide patients with information and support. Despite their growing presence, there are concerns regarding the medical reliability of the information they provide and the potential risks to patient safety. MATERIAL AND METHODS The aim of this study is to assess the medical reliability of responses to health-related questions provided by an AI-powered chatbot and to evaluate the risks to patient safety. The study is designed using a mixed-methods phenomenology approach. The participants are 44 nurses working at a private hospital in Cyprus. Data collection was conducted via survey forms and focus group discussions. Quantitative data were analyzed using descriptive statistics, while qualitative data were examined using content analysis. RESULTS The results indicate that according to the nurses' evaluations, the medical reliability of the AI chatbot's responses is generally high. However, instances of incorrect or incomplete information were also noted. Specifically, the quantitative analysis showed that a majority of the nurses found the chatbot's responses to be accurate and useful. The qualitative analysis revealed concerns about the potential for the chatbot to misdirect patients or contribute to diagnostic errors. These risks highlight the importance of monitoring and improving the AI systems to minimize errors and enhance reliability. CONCLUSION AI chatbots can provide valuable information and support to patients, improving accessibility and engagement in healthcare. However, concerns about medical reliability and patient safety remain. Continuous evaluation and improvement of these systems are necessary, alongside efforts to enhance patients' health literacy to help them accurately assess information from AI chatbots.
Collapse
Affiliation(s)
- Gulsum Asiksoy
- Department of Education and Instructional Technology, Atatürk Faculty of Education, North Cyprus via Mersin 10, Turkey
| |
Collapse
|
22
|
Yang H, Li J, Zhang C, Sierra AP, Shen B. Large Language Model-Driven Knowledge Graph Construction in Sepsis Care Using Multicenter Clinical Databases: Development and Usability Study. J Med Internet Res 2025; 27:e65537. [PMID: 40146985 PMCID: PMC11986385 DOI: 10.2196/65537] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2024] [Revised: 10/28/2024] [Accepted: 02/18/2025] [Indexed: 03/29/2025] Open
Abstract
BACKGROUND Sepsis is a complex, life-threatening condition characterized by significant heterogeneity and vast amounts of unstructured data, posing substantial challenges for traditional knowledge graph construction methods. The integration of large language models (LLMs) with real-world data offers a promising avenue to address these challenges and enhance the understanding and management of sepsis. OBJECTIVE This study aims to develop a comprehensive sepsis knowledge graph by leveraging the capabilities of LLMs, specifically GPT-4.0, in conjunction with multicenter clinical databases. The goal is to improve the understanding of sepsis and provide actionable insights for clinical decision-making. We also established a multicenter sepsis database (MSD) to support this effort. METHODS We collected clinical guidelines, public databases, and real-world data from 3 major hospitals in Western China, encompassing 10,544 patients diagnosed with sepsis. Using GPT-4.0, we used advanced prompt engineering techniques for entity recognition and relationship extraction, which facilitated the construction of a nuanced sepsis knowledge graph. RESULTS We established a sepsis database with 10,544 patient records, including 8497 from West China Hospital, 690 from Shangjin Hospital, and 357 from Tianfu Hospital. The sepsis knowledge graph comprises of 1894 nodes and 2021 distinct relationships, encompassing nine entity concepts (diseases, symptoms, biomarkers, imaging examinations, etc) and 8 semantic relationships (complications, recommended medications, laboratory tests, etc). GPT-4.0 demonstrated superior performance in entity recognition and relationship extraction, achieving an F1-score of 76.76 on a sepsis-specific dataset, outperforming other models such as Qwen2 (43.77) and Llama3 (48.39). On the CMeEE dataset, GPT-4.0 achieved an F1-score of 65.42 using few-shot learning, surpassing traditional models such as BERT-CRF (62.11) and Med-BERT (60.66). Building upon this, we compiled a comprehensive sepsis knowledge graph, comprising of 1894 nodes and 2021 distinct relationships. CONCLUSIONS This study represents a pioneering effort in using LLMs, particularly GPT-4.0, to construct a comprehensive sepsis knowledge graph. The innovative application of prompt engineering, combined with the integration of multicenter real-world data, has significantly enhanced the efficiency and accuracy of knowledge graph construction. The resulting knowledge graph provides a robust framework for understanding sepsis, supporting clinical decision-making, and facilitating further research. The success of this approach underscores the potential of LLMs in medical research and sets a new benchmark for future studies in sepsis and other complex medical conditions.
Collapse
Affiliation(s)
- Hao Yang
- Department of Critical Care Medicine, Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Frontiers Science Center for Disease-related Molecular Network, Institutes for Systems Genetics, Sichuan University, West China Hospital, Chengdu, China
- Information Center, Engineering Research Center of Medical Information Technology, Ministry of Education, West China Hospital, Sichuan University, Chengdu, China
- Department of Computer Science and Information Technologies, Iberian Society of Telehealth and Telemedicine, University of A Coruña, A Coruña, Spain
| | - Jiaxi Li
- Department of Clinical Laboratory Medicine, Jinniu Maternity and Child Health Hospital of Chengdu, Chengdu, China
| | - Chi Zhang
- Department of Critical Care Medicine, Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Frontiers Science Center for Disease-related Molecular Network, Institutes for Systems Genetics, Sichuan University, West China Hospital, Chengdu, China
| | - Alejandro Pazos Sierra
- Department of Computer Science and Information Technologies, Iberian Society of Telehealth and Telemedicine, Research Center for Information and Communications Technologies, Biomedical Research Institute of A Coruña, University of A Coruña, A Coruña, Spain
| | - Bairong Shen
- Department of Critical Care Medicine, Joint Laboratory of Artifcial Intelligence for Critical Care Medicine, Frontiers Science Center for Disease-related Molecular Network, Institutes for Systems Genetics, Sichuan University, West China Hospital, Chengdu, China
| |
Collapse
|
23
|
Wang J, Shue K, Liu L, Hu G. Preliminary evaluation of ChatGPT model iterations in emergency department diagnostics. Sci Rep 2025; 15:10426. [PMID: 40140500 PMCID: PMC11947261 DOI: 10.1038/s41598-025-95233-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2024] [Accepted: 03/19/2025] [Indexed: 03/28/2025] Open
Abstract
Large language model chatbots such as ChatGPT have shown the potential in assisting health professionals in emergency departments (EDs). However, the diagnostic accuracy of newer ChatGPT models remains unclear. This retrospective study evaluated the diagnostic performance of various ChatGPT models-including GPT-3.5, GPT-4, GPT-4o, and o1 series-in predicting diagnoses for ED patients (n = 30) and examined the impact of explicitly invoking reasoning (thoughts). Earlier models, such as GPT-3.5, demonstrated high accuracy for top-three differential diagnoses (80.0% in accuracy) but underperformed in identifying leading diagnoses (47.8%) compared to newer models such as chatgpt-4o-latest (60%, p < 0.01) and o1-preview (60%, p < 0.01). Asking for thoughts to be provided significantly enhanced the performance on predicting leading diagnosis for 4o models such as 4o-2024-0513 (from 45.6 to 56.7%; p = 0.03) and 4o-mini-2024-07-18 (from 54.4 to 60.0%; p = 0.04) but had minimal impact on o1-mini and o1-preview. In challenging cases, such as pneumonia without fever, all models generally failed to predict the correct diagnosis, indicating atypical presentations as a major limitation for ED application of current ChatGPT models.
Collapse
Affiliation(s)
- Jinge Wang
- Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV, 26506, USA
| | - Kenneth Shue
- Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV, 26506, USA
| | - Li Liu
- College of Health Solutions, Arizona State University, Phoenix, AZ, 85004, USA
- Biodesign Institute, Arizona State University, Tempe, AZ, 85281, USA
| | - Gangqing Hu
- Department of Microbiology, Immunology & Cell Biology, West Virginia University, Morgantown, WV, 26506, USA.
| |
Collapse
|
24
|
Akrimi S, Schwensfeier L, Düking P, Kreutz T, Brinkmann C. ChatGPT-4o-Generated Exercise Plans for Patients with Type 2 Diabetes Mellitus-Assessment of Their Safety and Other Quality Criteria by Coaching Experts. Sports (Basel) 2025; 13:92. [PMID: 40278718 PMCID: PMC12031090 DOI: 10.3390/sports13040092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2025] [Revised: 03/04/2025] [Accepted: 03/17/2025] [Indexed: 04/26/2025] Open
Abstract
In this discussion paper based on preliminary data, the safety and other quality criteria of ChatGPT-4o-generated exercise plans for patients with type 2 diabetes mellitus (T2DM) are evaluated. The study team created three fictional patient profiles varying in sex, age, body mass index, secondary diseases/complications, medication, self-rated physical fitness, weekly exercise routine and personal exercise preferences. Three distinct prompts were used to generate three exercise plans for each fictional patient. While Prompt 1 was very simple, Prompt 2 and Prompt 3 included more detailed requests. Prompt 3 was optimized by ChatGPT itself. Three coaching experts reviewed the exercise plans for safety and other quality criteria and discussed their evaluations. Some of the exercise plans showed serious safety issues, especially for patients with secondary diseases/complications. While most exercise plans incorporated key training principles, they showed some deficits, e.g., insufficient feasibility. The use of more detailed prompts (Prompt 2 and Prompt 3) tended to result in more elaborate exercise plans with better ratings. ChatGPT-4o-generated exercise plans may have safety issues for patients with T2DM, indicating the need to consult a professional coach for feedback before starting a training program.
Collapse
Affiliation(s)
- Samir Akrimi
- Department of Preventive and Rehabilitative Sport Medicine, Institute of Cardiovascular Research and Sport Medicine, German Sport University Cologne, 50933 Cologne, Germany; (S.A.); (L.S.)
| | - Leon Schwensfeier
- Department of Preventive and Rehabilitative Sport Medicine, Institute of Cardiovascular Research and Sport Medicine, German Sport University Cologne, 50933 Cologne, Germany; (S.A.); (L.S.)
| | - Peter Düking
- Department of Sports Science and Movement Pedagogy, TU Braunschweig, 38106 Braunschweig, Germany;
| | - Thorsten Kreutz
- Department of Fitness & Health, IST University of Applied Sciences, 40233 Düsseldorf, Germany;
| | - Christian Brinkmann
- Department of Preventive and Rehabilitative Sport Medicine, Institute of Cardiovascular Research and Sport Medicine, German Sport University Cologne, 50933 Cologne, Germany; (S.A.); (L.S.)
- Department of Fitness & Health, IST University of Applied Sciences, 40233 Düsseldorf, Germany;
| |
Collapse
|
25
|
Lo Bianco G, Robinson CL, D’Angelo FP, Cascella M, Natoli S, Sinagra E, Mercadante S, Drago F. Effectiveness of Generative Artificial Intelligence-Driven Responses to Patient Concerns in Long-Term Opioid Therapy: Cross-Model Assessment. Biomedicines 2025; 13:636. [PMID: 40149612 PMCID: PMC11940240 DOI: 10.3390/biomedicines13030636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2025] [Revised: 02/28/2025] [Accepted: 03/04/2025] [Indexed: 03/29/2025] Open
Abstract
Background: While long-term opioid therapy is a widely utilized strategy for managing chronic pain, many patients have understandable questions and concerns regarding its safety, efficacy, and potential for dependency and addiction. Providing clear, accurate, and reliable information is essential for fostering patient understanding and acceptance. Generative artificial intelligence (AI) applications offer interesting avenues for delivering patient education in healthcare. This study evaluates the reliability, accuracy, and comprehensibility of ChatGPT's responses to common patient inquiries about opioid long-term therapy. Methods: An expert panel selected thirteen frequently asked questions regarding long-term opioid therapy based on the authors' clinical experience in managing chronic pain patients and a targeted review of patient education materials. Questions were prioritized based on prevalence in patient consultations, relevance to treatment decision-making, and the complexity of information typically required to address them comprehensively. We assessed comprehensibility by implementing the multimodal generative AI Copilot (Microsoft 365 Copilot Chat). Spanning three domains-pre-therapy, during therapy, and post-therapy-each question was submitted to GPT-4.0 with the prompt "If you were a physician, how would you answer a patient asking…". Ten pain physicians and two non-healthcare professionals independently assessed the responses using a Likert scale to rate reliability (1-6 points), accuracy (1-3 points), and comprehensibility (1-3 points). Results: Overall, ChatGPT's responses demonstrated high reliability (5.2 ± 0.6) and good comprehensibility (2.8 ± 0.2), with most answers meeting or exceeding predefined thresholds. Accuracy was moderate (2.7 ± 0.3), with lower performance on more technical topics like opioid tolerance and dependency management. Conclusions: While AI applications exhibit significant potential as a supplementary tool for patient education on opioid long-term therapy, limitations in addressing highly technical or context-specific queries underscore the need for ongoing refinement and domain-specific training. Integrating AI systems into clinical practice should involve collaboration between healthcare professionals and AI developers to ensure safe, personalized, and up-to-date patient education in chronic pain management.
Collapse
Affiliation(s)
- Giuliano Lo Bianco
- Anesthesiology and Pain Department, Foundation G. Giglio Cefalù, 90015 Palermo, Italy
| | - Christopher L. Robinson
- Anesthesiology, Perioperative, and Pain Medicine, Brigham and Women’s Hospital, Harvard Medical School, Harvard University, Boston, MA 02115, USA;
| | - Francesco Paolo D’Angelo
- Department of Anaesthesia, Intensive Care and Emergency, University Hospital Policlinico Paolo Giaccone, 90127 Palermo, Italy;
| | - Marco Cascella
- Anesthesia and Pain Medicine, Department of Medicine, Surgery and Dentistry “Scuola Medica Salernitana”, University of Salerno, 84081 Baronissi, Italy;
| | - Silvia Natoli
- Department of Clinical-Surgical, Diagnostic and Pediatric Sciences, University of Pavia, 27100 Pavia, Italy;
- Pain Unit, Fondazione IRCCS Policlinico San Matteo, 27100 Pavia, Italy
| | - Emanuele Sinagra
- Gastroenterology and Endoscopy Unit, Fondazione Istituto San Raffaele Giglio, 90015 Cefalù, Italy;
| | - Sebastiano Mercadante
- Main Regional Center for Pain Relief and Supportive/Palliative Care, La Maddalena Cancer Center, Via San Lorenzo 312, 90146 Palermo, Italy;
| | - Filippo Drago
- Department of Biomedical and Biotechnological Sciences, University of Catania, 95124 Catania, Italy;
| |
Collapse
|
26
|
Phu J, Wang H, Kalloniatis M. Re: 'Using ChatGPT-4 in visual field test assessment'. Clin Exp Optom 2025:1-2. [PMID: 40032637 DOI: 10.1080/08164622.2025.2472876] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/05/2025] Open
Affiliation(s)
- Jack Phu
- School of Optometry and Vision Science, University of New South Wales, Kensington, New South Wales, Australia
| | - Henrietta Wang
- School of Optometry and Vision Science, University of New South Wales, Kensington, New South Wales, Australia
| | | |
Collapse
|
27
|
Trapp C, Schmidt-Hegemann N, Keilholz M, Brose SF, Marschner SN, Schönecker S, Maier SH, Dehelean DC, Rottler M, Konnerth D, Belka C, Corradini S, Rogowski P. Patient- and clinician-based evaluation of large language models for patient education in prostate cancer radiotherapy. Strahlenther Onkol 2025; 201:333-342. [PMID: 39792259 PMCID: PMC11839798 DOI: 10.1007/s00066-024-02342-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 11/18/2024] [Indexed: 01/12/2025]
Abstract
BACKGROUND This study aims to evaluate the capabilities and limitations of large language models (LLMs) for providing patient education for men undergoing radiotherapy for localized prostate cancer, incorporating assessments from both clinicians and patients. METHODS Six questions about definitive radiotherapy for prostate cancer were designed based on common patient inquiries. These questions were presented to different LLMs [ChatGPT‑4, ChatGPT-4o (both OpenAI Inc., San Francisco, CA, USA), Gemini (Google LLC, Mountain View, CA, USA), Copilot (Microsoft Corp., Redmond, WA, USA), and Claude (Anthropic PBC, San Francisco, CA, USA)] via the respective web interfaces. Responses were evaluated for readability using the Flesch Reading Ease Index. Five radiation oncologists assessed the responses for relevance, correctness, and completeness using a five-point Likert scale. Additionally, 35 prostate cancer patients evaluated the responses from ChatGPT‑4 for comprehensibility, accuracy, relevance, trustworthiness, and overall informativeness. RESULTS The Flesch Reading Ease Index indicated that the responses from all LLMs were relatively difficult to understand. All LLMs provided answers that clinicians found to be generally relevant and correct. The answers from ChatGPT‑4, ChatGPT-4o, and Claude AI were also found to be complete. However, we found significant differences between the performance of different LLMs regarding relevance and completeness. Some answers lacked detail or contained inaccuracies. Patients perceived the information as easy to understand and relevant, with most expressing confidence in the information and a willingness to use ChatGPT‑4 for future medical questions. ChatGPT-4's responses helped patients feel better informed, despite the initially standardized information provided. CONCLUSION Overall, LLMs show promise as a tool for patient education in prostate cancer radiotherapy. While improvements are needed in terms of accuracy and readability, positive feedback from clinicians and patients suggests that LLMs can enhance patient understanding and engagement. Further research is essential to fully realize the potential of artificial intelligence in patient education.
Collapse
Affiliation(s)
- Christian Trapp
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany.
| | - Nina Schmidt-Hegemann
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
| | - Michael Keilholz
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
| | - Sarah Frederike Brose
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
| | - Sebastian N Marschner
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
| | - Stephan Schönecker
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
| | - Sebastian H Maier
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
| | - Diana-Coralia Dehelean
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
| | - Maya Rottler
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
| | - Dinah Konnerth
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
| | - Claus Belka
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
- Bavarian Cancer Research Center (BZKF), Munich, Germany
- German Cancer Consortium (DKTK), Partner Site Munich, Munich, Germany
| | - Stefanie Corradini
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
| | - Paul Rogowski
- Department of Radiation Oncology, University Hospital, LMU Munich, Marchioninistr. 15, 81377, Munich, Germany
| |
Collapse
|
28
|
García-Rudolph A, Sanchez-Pinsach D, Caridad Fernandez M, Cunyat S, Opisso E, Hernandez-Pena E. How Chatbots Respond to NCLEX-RN Practice Questions: Assessment of Google Gemini, GPT-3.5, and GPT-4. Nurs Educ Perspect 2025; 46:E18-E20. [PMID: 39692545 DOI: 10.1097/01.nep.0000000000001364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2024]
Abstract
ABSTRACT ChatGPT often "hallucinates" or misleads, underscoring the need for formal validation at the professional level for reliable use in nursing education. We evaluated two free chatbots (Google Gemini and GPT-3.5) and a commercial version (GPT-4) on 250 standardized questions from a simulated nursing licensure exam, which closely matches the content and complexity of the actual exam. Gemini achieved 73.2 percent (183/250), GPT-3.5 achieved 72 percent (180/250), and GPT-4 reached a notably higher performance with 92.4 percent (231/250). GPT-4 exhibited its highest error rate (13.3%) in the psychosocial integrity category.
Collapse
Affiliation(s)
- Alejandro García-Rudolph
- About the Authors Alejandro García-Rudolph, PhD; David Sanchez-Pinsach, PhD; Mira Caridad Fernandez, MSc; Sandra Cunyat, MSc; Eloy Opisso, PhD; and Elena Hernandez-Pena, MSc, are faculty, Institut Guttmann Hospital de Neurorehabilitació, Barcelona, Spain. The authors are grateful to Olga Araujo of the Institut Guttmann-Documentation Office for her support in accessing the literature. For more information, contact Dr. Alejandro García-Rudolph at
| | | | | | | | | | | |
Collapse
|
29
|
Lehnen NC, Kürsch J, Wichtmann BD, Wolter M, Bendella Z, Bode FJ, Zimmermann H, Radbruch A, Vollmuth P, Dorn F. Llama 3.1 405B Is Comparable to GPT-4 for Extraction of Data from Thrombectomy Reports-A Step Towards Secure Data Extraction. Clin Neuroradiol 2025:10.1007/s00062-025-01500-z. [PMID: 39998651 DOI: 10.1007/s00062-025-01500-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2024] [Accepted: 01/12/2025] [Indexed: 02/27/2025]
Abstract
PURPOSE GPT‑4 has been shown to correctly extract procedural details from free-text reports on mechanical thrombectomy. However, GPT may not be suitable for analyzing reports containing personal data. The purpose of this study was to evaluate the ability of the large language models (LLM) Llama3.1 405B, Llama3 70B, Llama3 8B, and Mixtral 8X7B, that can be operated offline, to extract procedural details from free-text reports on mechanical thrombectomies. METHODS Free-text reports on mechanical thrombectomy from two institutions were included. A detailed prompt was used in German and English languages. The ability of the LLMs to extract procedural data was compared to GPT‑4 using McNemar's test. The manual data entries made by an interventional neuroradiologist served as the reference standard. RESULTS 100 reports from institution 1 (mean age 74.7 ± 13.2 years; 53 females) and 30 reports from institution 2 (mean age 72.7 ± 13.5 years; 18 males) were included. Llama 3.1 405B extracted 2619 of 2800 data points correctly (93.5% [95%CI: 92.6%, 94.4%], p = 0.39 vs. GPT-4). Llama3 70B with the English prompt extracted 2537 data points correctly (90.6% [95%CI: 89.5%, 91.7%], p < 0.001 vs. GPT-4), and 2471 (88.2% [95%CI: 87.0%, 89.4%], p < 0.001 vs. GPT-4) with the German prompt. Llama 3 8B extracted 2314 data points correctly (86.1% [95%CI: 84.8%, 87.4%], p < 0.001 vs. GPT-4), and Mixtral 8X7B extracted 2411 (86.1% [95%CI: 84.8%, 87.4%], p < 0.001 vs. GPT-4) correctly. CONCLUSION Llama 3.1 405B was equal to GPT‑4 for data extraction from free-text reports on mechanical thrombectomies and may represent a data secure alternative, when operated locally.
Collapse
Affiliation(s)
- Nils C Lehnen
- Department of Neuroradiology, University Hospital Bonn, Rheinische Friedrich-Wilhelms-Universität Bonn, Venusberg-Campus 1, 53127, Bonn, Germany.
| | - Johannes Kürsch
- Department of Neuroradiology, University Hospital Bonn, Rheinische Friedrich-Wilhelms-Universität Bonn, Venusberg-Campus 1, 53127, Bonn, Germany
| | - Barbara D Wichtmann
- Department of Neuroradiology, University Hospital Bonn, Rheinische Friedrich-Wilhelms-Universität Bonn, Venusberg-Campus 1, 53127, Bonn, Germany
| | - Moritz Wolter
- High Performance Computing & Analytics Lab, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| | - Zeynep Bendella
- Department of Neuroradiology, University Hospital Bonn, Rheinische Friedrich-Wilhelms-Universität Bonn, Venusberg-Campus 1, 53127, Bonn, Germany
| | - Felix J Bode
- Department of Vascular Neurology, University Hospital Bonn, Rheinische Friedrich-Wilhelms-Universität Bonn, 53127, Bonn, Germany
| | - Hanna Zimmermann
- Institute of Neuroradiology, University Hospital, LMU Munich, Munich, Germany
| | - Alexander Radbruch
- Department of Neuroradiology, University Hospital Bonn, Rheinische Friedrich-Wilhelms-Universität Bonn, Venusberg-Campus 1, 53127, Bonn, Germany
| | - Philipp Vollmuth
- Department of Neuroradiology, University Hospital Bonn, Rheinische Friedrich-Wilhelms-Universität Bonn, Venusberg-Campus 1, 53127, Bonn, Germany
| | - Franziska Dorn
- Department of Neuroradiology, University Hospital Bonn, Rheinische Friedrich-Wilhelms-Universität Bonn, Venusberg-Campus 1, 53127, Bonn, Germany
| |
Collapse
|
30
|
Lo Bianco G, Cascella M, Li S, Day M, Kapural L, Robinson CL, Sinagra E. Reliability, Accuracy, and Comprehensibility of AI-Based Responses to Common Patient Questions Regarding Spinal Cord Stimulation. J Clin Med 2025; 14:1453. [PMID: 40094896 PMCID: PMC11899866 DOI: 10.3390/jcm14051453] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2025] [Revised: 02/06/2025] [Accepted: 02/19/2025] [Indexed: 03/19/2025] Open
Abstract
Background: Although spinal cord stimulation (SCS) is an effective treatment for managing chronic pain, many patients have understandable questions and concerns regarding this therapy. Artificial intelligence (AI) has shown promise in delivering patient education in healthcare. This study evaluates the reliability, accuracy, and comprehensibility of ChatGPT's responses to common patient inquiries about SCS. Methods: Thirteen commonly asked questions regarding SCS were selected based on the authors' clinical experience managing chronic pain patients and a targeted review of patient education materials and relevant medical literature. The questions were prioritized based on their frequency in patient consultations, relevance to decision-making about SCS, and the complexity of the information typically required to comprehensively address the questions. These questions spanned three domains: pre-procedural, intra-procedural, and post-procedural concerns. Responses were generated using GPT-4.0 with the prompt "If you were a physician, how would you answer a patient asking…". Responses were independently assessed by 10 pain physicians and two non-healthcare professionals using a Likert scale for reliability (1-6 points), accuracy (1-3 points), and comprehensibility (1-3 points). Results: ChatGPT's responses demonstrated strong reliability (5.1 ± 0.7) and comprehensibility (2.8 ± 0.2), with 92% and 98% of responses, respectively, meeting or exceeding our predefined thresholds. Accuracy was 2.7 ± 0.3, with 95% of responses rated sufficiently accurate. General queries, such as "What is spinal cord stimulation?" and "What are the risks and benefits?", received higher scores compared to technical questions like "What are the different types of waveforms used in SCS?". Conclusions: ChatGPT can be implemented as a supplementary tool for patient education, particularly in addressing general and procedural queries about SCS. However, the AI's performance was less robust in addressing highly technical or nuanced questions.
Collapse
Affiliation(s)
- Giuliano Lo Bianco
- Anesthesiology and Pain Department, Foundation G. Giglio Cefalù, 90015 Palermo, Italy;
| | - Marco Cascella
- Anesthesia and Pain Medicine, Department of Medicine, Surgery and Dentistry “Scuola Medica Salernitana”, University of Salerno, 84081 Baronissi, Italy
| | - Sean Li
- National Spine and Pain Centers, Shrewsbury, NJ 07702, USA;
| | - Miles Day
- Department of Anesthesiology, Texas Tech University Health Sciences Center, Lubbock, TX 79430, USA;
| | | | - Christopher L. Robinson
- Anesthesiology, Perioperative, and Pain Medicine, Harvard Medical School, Brigham and Women’s Hospital, Boston, MA 02115, USA
| | - Emanuele Sinagra
- Gastroenterology and Endoscopy Unit, Fondazione Istituto San Raffaele Giglio, 90015 Cefalù, Italy;
| |
Collapse
|
31
|
On SW, Cho SW, Park SY, Ha JW, Yi SM, Park IY, Byun SH, Yang BE. Chat Generative Pre-Trained Transformer (ChatGPT) in Oral and Maxillofacial Surgery: A Narrative Review on Its Research Applications and Limitations. J Clin Med 2025; 14:1363. [PMID: 40004892 PMCID: PMC11856154 DOI: 10.3390/jcm14041363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2025] [Revised: 02/17/2025] [Accepted: 02/17/2025] [Indexed: 02/27/2025] Open
Abstract
Objectives: This review aimed to evaluate the role of ChatGPT in original research articles within the field of oral and maxillofacial surgery (OMS), focusing on its applications, limitations, and future directions. Methods: A literature search was conducted in PubMed using predefined search terms and Boolean operators to identify original research articles utilizing ChatGPT published up to October 2024. The selection process involved screening studies based on their relevance to OMS and ChatGPT applications, with 26 articles meeting the final inclusion criteria. Results: ChatGPT has been applied in various OMS-related domains, including clinical decision support in real and virtual scenarios, patient and practitioner education, scientific writing and referencing, and its ability to answer licensing exam questions. As a clinical decision support tool, ChatGPT demonstrated moderate accuracy (approximately 70-80%). It showed moderate to high accuracy (up to 90%) in providing patient guidance and information. However, its reliability remains inconsistent across different applications, necessitating further evaluation. Conclusions: While ChatGPT presents potential benefits in OMS, particularly in supporting clinical decisions and improving access to medical information, it should not be regarded as a substitute for clinicians and must be used as an adjunct tool. Further validation studies and technological refinements are required to enhance its reliability and effectiveness in clinical and research settings.
Collapse
Affiliation(s)
- Sung-Woon On
- Division of Oral and Maxillofacial Surgery, Department of Dentistry, Dongtan Sacred Heart Hospital, Hallym University College of Medicine, Hwaseong 18450, Republic of Korea; (S.-W.O.); (J.-W.H.)
- Department of Artificial Intelligence and Robotics in Dentistry, Graduated School of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea; (S.-W.C.); (S.-Y.P.); (S.-M.Y.); (I.-Y.P.); (S.-H.B.)
- Institute of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea
| | - Seoung-Won Cho
- Department of Artificial Intelligence and Robotics in Dentistry, Graduated School of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea; (S.-W.C.); (S.-Y.P.); (S.-M.Y.); (I.-Y.P.); (S.-H.B.)
- Institute of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea
| | - Sang-Yoon Park
- Department of Artificial Intelligence and Robotics in Dentistry, Graduated School of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea; (S.-W.C.); (S.-Y.P.); (S.-M.Y.); (I.-Y.P.); (S.-H.B.)
- Institute of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea
- Department of Oral and Maxillofacial Surgery, Hallym University Sacred Heart Hospital, Anyang 14066, Republic of Korea
- Dental Artificial Intelligence and Robotics R&D Center, Hallym University Medical Center, Anyang 14066, Republic of Korea
| | - Ji-Won Ha
- Division of Oral and Maxillofacial Surgery, Department of Dentistry, Dongtan Sacred Heart Hospital, Hallym University College of Medicine, Hwaseong 18450, Republic of Korea; (S.-W.O.); (J.-W.H.)
- Department of Artificial Intelligence and Robotics in Dentistry, Graduated School of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea; (S.-W.C.); (S.-Y.P.); (S.-M.Y.); (I.-Y.P.); (S.-H.B.)
- Institute of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea
| | - Sang-Min Yi
- Department of Artificial Intelligence and Robotics in Dentistry, Graduated School of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea; (S.-W.C.); (S.-Y.P.); (S.-M.Y.); (I.-Y.P.); (S.-H.B.)
- Institute of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea
- Department of Oral and Maxillofacial Surgery, Hallym University Sacred Heart Hospital, Anyang 14066, Republic of Korea
- Dental Artificial Intelligence and Robotics R&D Center, Hallym University Medical Center, Anyang 14066, Republic of Korea
| | - In-Young Park
- Department of Artificial Intelligence and Robotics in Dentistry, Graduated School of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea; (S.-W.C.); (S.-Y.P.); (S.-M.Y.); (I.-Y.P.); (S.-H.B.)
- Institute of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea
- Dental Artificial Intelligence and Robotics R&D Center, Hallym University Medical Center, Anyang 14066, Republic of Korea
- Department of Orthodontics, Hallym University Sacred Heart Hospital, Anyang 14066, Republic of Korea
| | - Soo-Hwan Byun
- Department of Artificial Intelligence and Robotics in Dentistry, Graduated School of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea; (S.-W.C.); (S.-Y.P.); (S.-M.Y.); (I.-Y.P.); (S.-H.B.)
- Institute of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea
- Department of Oral and Maxillofacial Surgery, Hallym University Sacred Heart Hospital, Anyang 14066, Republic of Korea
- Dental Artificial Intelligence and Robotics R&D Center, Hallym University Medical Center, Anyang 14066, Republic of Korea
| | - Byoung-Eun Yang
- Department of Artificial Intelligence and Robotics in Dentistry, Graduated School of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea; (S.-W.C.); (S.-Y.P.); (S.-M.Y.); (I.-Y.P.); (S.-H.B.)
- Institute of Clinical Dentistry, Hallym University, Chuncheon 24252, Republic of Korea
- Department of Oral and Maxillofacial Surgery, Hallym University Sacred Heart Hospital, Anyang 14066, Republic of Korea
- Dental Artificial Intelligence and Robotics R&D Center, Hallym University Medical Center, Anyang 14066, Republic of Korea
| |
Collapse
|
32
|
Wang Y, Yang S, Zeng C, Xie Y, Shen Y, Li J, Huang X, Wei R, Chen Y. Evaluating the performance of ChatGPT in patient consultation and image-based preliminary diagnosis in thyroid eye disease. Front Med (Lausanne) 2025; 12:1546706. [PMID: 40041459 PMCID: PMC11876178 DOI: 10.3389/fmed.2025.1546706] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2024] [Accepted: 01/27/2025] [Indexed: 03/06/2025] Open
Abstract
Background The emergence of Large Language Model (LLM) chatbots, such as ChatGPT, has great promise for enhancing healthcare practice. Online consultation, accurate pre-diagnosis, and clinical efforts are of fundamental importance for the patient-oriented management system. Objective This cross-sectional study aims to evaluate the performance of ChatGPT in inquiries across ophthalmic domains and to focus on Thyroid Eye Disease (TED) consultation and image-based preliminary diagnosis in a non-English language. Methods We obtained frequently consulted clinical inquiries from a published reference based on patient consultation data, titled A Comprehensive Collection of Thyroid Eye Disease Knowledge. Additionally, we collected facial and Computed Tomography (CT) images from 16 patients with a definitive diagnosis of TED. From 18 to 30 May 2024, inquiries about the TED consultation and preliminary diagnosis were posed to ChatGPT using a new chat for each question. Responses to questions from ChatGPT-4, 4o, and an experienced ocular professor were compiled into three questionnaires, which were evaluated by patients and ophthalmologists on four dimensions: accuracy, comprehensiveness, conciseness, and satisfaction. The preliminary diagnosis of TED was deemed accurate, and the differences in the accuracy rates were further calculated. Results For common TED consultation questions, ChatGPT-4o delivered more accurate information with logical consistency, adhering to a structured format of disease definition, detailed sections, and summarized conclusions. Notably, the answers generated by ChatGPT-4o were rated higher than those of ChatGPT-4 and the professor, with accuracy (4.33 [0.69]), comprehensiveness (4.17 [0.75]), conciseness (4.12 [0.77]), and satisfaction (4.28 [0.70]). The characteristics of the evaluators, the response variables, and other quality scores were all correlated with overall satisfaction levels. Based on several facial images, ChatGPT-4 twice failed to make diagnoses because of lacking characteristic symptoms or a complete medical history, whereas ChatGPT-4o accurately identified the pathologic conditions in 31.25% of cases (95% confidence interval, CI: 11.02-58.66%). Furthermore, in combination with CT images, ChatGPT-4o performed comparably to the professor in terms of diagnosis accuracy (87.5, 95% CI 61.65-98.45%). Conclusion ChatGPT-4o excelled in comprehensive and satisfactory patient consultation and imaging interpretation, indicating the potential to improve clinical practice efficiency. However, limitations in disinformation management and legal permissions remain major concerns, which require further investigation in clinical practice.
Collapse
Affiliation(s)
- Yue Wang
- Department of Ophthalmology, Changzheng Hospital of Naval Medical University, Shanghai, China
| | - Shuo Yang
- Department of Ophthalmology, Changzheng Hospital of Naval Medical University, Shanghai, China
| | - Chengcheng Zeng
- Department of Ophthalmology, Changzheng Hospital of Naval Medical University, Shanghai, China
| | - Yingwei Xie
- Department of Urology, Beijing Tongren Hospital of Capital Medical University, Beijing, China
| | - Ya Shen
- Department of Ophthalmology, Changzheng Hospital of Naval Medical University, Shanghai, China
| | - Jian Li
- Department of Ophthalmology, Changzheng Hospital of Naval Medical University, Shanghai, China
| | - Xiao Huang
- Department of Ophthalmology, Changzheng Hospital of Naval Medical University, Shanghai, China
| | - Ruili Wei
- Department of Ophthalmology, Changzheng Hospital of Naval Medical University, Shanghai, China
| | - Yuqing Chen
- Department of Ophthalmology, Changzheng Hospital of Naval Medical University, Shanghai, China
| |
Collapse
|
33
|
Huo B, Boyle A, Marfo N, Tangamornsuksan W, Steen JP, McKechnie T, Lee Y, Mayol J, Antoniou SA, Thirunavukarasu AJ, Sanger S, Ramji K, Guyatt G. Large Language Models for Chatbot Health Advice Studies: A Systematic Review. JAMA Netw Open 2025; 8:e2457879. [PMID: 39903463 PMCID: PMC11795331 DOI: 10.1001/jamanetworkopen.2024.57879] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 11/26/2024] [Indexed: 02/06/2025] Open
Abstract
Importance There is much interest in the clinical integration of large language models (LLMs) in health care. Many studies have assessed the ability of LLMs to provide health advice, but the quality of their reporting is uncertain. Objective To perform a systematic review to examine the reporting variability among peer-reviewed studies evaluating the performance of generative artificial intelligence (AI)-driven chatbots for summarizing evidence and providing health advice to inform the development of the Chatbot Assessment Reporting Tool (CHART). Evidence Review A search of MEDLINE via Ovid, Embase via Elsevier, and Web of Science from inception to October 27, 2023, was conducted with the help of a health sciences librarian to yield 7752 articles. Two reviewers screened articles by title and abstract followed by full-text review to identify primary studies evaluating the clinical accuracy of generative AI-driven chatbots in providing health advice (chatbot health advice studies). Two reviewers then performed data extraction for 137 eligible studies. Findings A total of 137 studies were included. Studies examined topics in surgery (55 [40.1%]), medicine (51 [37.2%]), and primary care (13 [9.5%]). Many studies focused on treatment (91 [66.4%]), diagnosis (60 [43.8%]), or disease prevention (29 [21.2%]). Most studies (136 [99.3%]) evaluated inaccessible, closed-source LLMs and did not provide enough information to identify the version of the LLM under evaluation. All studies lacked a sufficient description of LLM characteristics, including temperature, token length, fine-tuning availability, layers, and other details. Most studies (136 [99.3%]) did not describe a prompt engineering phase in their study. The date of LLM querying was reported in 54 (39.4%) studies. Most studies (89 [65.0%]) used subjective means to define the successful performance of the chatbot, while less than one-third addressed the ethical, regulatory, and patient safety implications of the clinical integration of LLMs. Conclusions and Relevance In this systematic review of 137 chatbot health advice studies, the reporting quality was heterogeneous and may inform the development of the CHART reporting standards. Ethical, regulatory, and patient safety considerations are crucial as interest grows in the clinical integration of LLMs.
Collapse
Affiliation(s)
- Bright Huo
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Amy Boyle
- Michael G. DeGroote School of Medicine, McMaster University, Hamilton, Ontario, Canada
| | - Nana Marfo
- H. Ross University School of Medicine, Miramar, Florida
| | - Wimonchat Tangamornsuksan
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| | - Jeremy P. Steen
- Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
| | - Tyler McKechnie
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Yung Lee
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Julio Mayol
- Hospital Clinico San Carlos, IdISSC, Universidad Complutense de Madrid, Madrid, Spain
| | | | | | - Stephanie Sanger
- Health Science Library, Faculty of Health Sciences, McMaster University, Hamilton, Ontario, Canada
| | - Karim Ramji
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Gordon Guyatt
- Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada
| |
Collapse
|
34
|
Kahlon S, Sleet M, Sujka J, Docimo S, DuCoin C, Dimou F, Mhaskar R. Evaluating the concordance of ChatGPT and physician recommendations for bariatric surgery. Can J Physiol Pharmacol 2025; 103:70-74. [PMID: 39561352 DOI: 10.1139/cjpp-2024-0026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2024]
Abstract
Integrating artificial intelligence (AI) into healthcare prompts the need to measure its proficiency relative to human experts. This study evaluates the proficiency of ChatGPT, an OpenAI language model, in offering guidance concerning bariatric surgery compared to bariatric surgeons. Five clinical scenarios representative of diverse bariatric surgery situations were given to American Society for Metabolic and Bariatric Surgery (ASMBS)-accredited bariatric surgeons and ChatGPT. Both groups proposed medical or surgical management for the patients depicted in each scenario. The outcomes from both the surgeons and ChatGPT were examined and matched with the clinical benchmarks set by the ASMBS. There was a high degree of agreement between ChatGPT and physicians on the three simpler clinical scenarios. There was a positive correlation between physicians' and ChatGPT answers for not recommending surgery. ChatGPT's advice aligned with ASMBS guidelines 60% of the time, in contrast to bariatric surgeons, who consistently aligned with the guidelines 100% of the time. ChatGPT showcases potential in offering guidance on bariatric surgery, but it does not have the comprehensive and personalized perspective that doctors exhibit consistently. Enhancing AI's training on intricate patient situations will bolster its role in the medical field.
Collapse
Affiliation(s)
- Sunny Kahlon
- University of South Florida Health Morsani College of Medicine, Tampa, FL, USA
| | - Mary Sleet
- University of South Florida Health Morsani College of Medicine, Tampa, FL, USA
| | - Joseph Sujka
- Department of Surgery, University of South Florida Morsani College of Medicine, Tampa, FL, USA
| | - Salvatore Docimo
- Department of Surgery, University of South Florida Morsani College of Medicine, Tampa, FL, USA
| | - Christopher DuCoin
- Department of Surgery, University of South Florida Morsani College of Medicine, Tampa, FL, USA
| | - Francesca Dimou
- Department of Surgery, University of South Florida Morsani College of Medicine, Tampa, FL, USA
| | - Rahul Mhaskar
- Department of Internal Medicine and Medical Education, University of South Florida Morsani College of Medicine, Tampa, FL, USA
| |
Collapse
|
35
|
Mehta R, Reitz JG, Venna A, Selcuk A, Dhamala B, Klein J, Sawda C, Haverty M, Yerebakan C, Tongut A, Desai M, d'Udekem Y. Navigating the future of pediatric cardiovascular surgery: Insights and innovation powered by Chat Generative Pre-Trained Transformer (ChatGPT). J Thorac Cardiovasc Surg 2025:S0022-5223(25)00093-5. [PMID: 39894069 DOI: 10.1016/j.jtcvs.2025.01.022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/22/2024] [Revised: 12/16/2024] [Accepted: 01/10/2025] [Indexed: 02/04/2025]
Abstract
INTRODUCTION Interdisciplinary consultations are essential to decision-making for patients with congenital heart disease. The integration of artificial intelligence (AI) and natural language processing into medical practice is rapidly accelerating, opening new avenues to diagnosis and treatment. The main objective of this study was to consult the AI-trained model Chat Generative Pre-Trained Transformer (ChatGPT) regarding cases discussed during a cardiovascular surgery conference (CSC) at a single tertiary center and compare the ChatGPT suggestions with CSC expert consensus results. METHODS In total, 37 cases discussed at a single CSC were retrospectively identified. Clinical information comprised deidentified data from the last electrocardiogram, echocardiogram, intensive care unit progress note (or cardiology clinic note if outpatient), as well as a patient summary. The diagnosis was removed from the summary and possible treatment options were deleted from all notes. ChatGPT (version 4.0) was asked to summarize the case, identify diagnoses, and recommend surgical procedures and timing of surgery. The responses of ChatGPT were compared with the results of the CSC. RESULTS Of the 37 cases uploaded to ChatGPT, 45.9% (n = 17) were considered to be less complex cases, with only 1 treatment option, and 54.1% (n = 20) were considered more complex, with several treatment options. ChatGPT correctly provided a detailed and systematically written summary for each case within 10 to 15 seconds. ChatGPT correctly identified diagnoses for approximately 94.5% (n = 35) cases. The surgical intervention plan matched the group decision for approximately 40.5% (n = 15) cases; however, it differed in 27% cases. In 23 of 37 cases, timing of surgery was the same between CSC group and ChatGPT. Overall, the match between ChatGPT responses and CSC decisions for diagnosis was 94.5%, surgical intervention was 40.5%, and timing of surgery was 62.2%. However, within complex cases, we have 25% agreement for surgical intervention and 67% for timing of surgery. CONCLUSIONS ChatGPT can be used as an augmentative tool for surgical conferences to systematically summarize large amounts of patient data from electronic health records and clinical notes in seconds. In addition, our study points out the potential of ChatGPT as an AI-based decision support tool in surgery, particularly for less-complex cases. The discrepancy, particularly in complex cases, emphasizes on the need for caution when using ChatGPT in decision-making for the complex cases in pediatric cardiovascular surgery. There is little doubt that the public will soon use this comparative tool.
Collapse
Affiliation(s)
- Rittal Mehta
- Department of Cardiac Surgery, Children's National Heart Institute, Children's National Hospital, Washington, DC
| | - Justus G Reitz
- Department of Cardiac Surgery, Children's National Heart Institute, Children's National Hospital, Washington, DC
| | - Alyssia Venna
- Department of Cardiac Surgery, Children's National Heart Institute, Children's National Hospital, Washington, DC
| | - Arif Selcuk
- Department of Cardiac Surgery, Children's National Heart Institute, Children's National Hospital, Washington, DC
| | - Bishakha Dhamala
- Department of Cardiac Surgery, Children's National Heart Institute, Children's National Hospital, Washington, DC
| | - Jennifer Klein
- Department of Cardiac Surgery, Children's National Heart Institute, Children's National Hospital, Washington, DC
| | - Christine Sawda
- Department of Cardiac Surgery, Children's National Heart Institute, Children's National Hospital, Washington, DC
| | - Mitchell Haverty
- Department of Cardiac Surgery, Children's National Heart Institute, Children's National Hospital, Washington, DC
| | - Can Yerebakan
- Department of Cardiac Surgery, Children's National Heart Institute, Children's National Hospital, Washington, DC
| | - Aybala Tongut
- Department of Cardiac Surgery, Children's National Heart Institute, Children's National Hospital, Washington, DC
| | - Manan Desai
- Department of Cardiac Surgery, Children's National Heart Institute, Children's National Hospital, Washington, DC
| | - Yves d'Udekem
- Department of Cardiac Surgery, Children's National Heart Institute, Children's National Hospital, Washington, DC.
| |
Collapse
|
36
|
Ma Y, Achiche S, Tu G, Vicente S, Lessard D, Engler K, Lemire B, Laymouna M, de Pokomandy A, Cox J, Lebouché B. The first AI-based Chatbot to promote HIV self-management: A mixed methods usability study. HIV Med 2025; 26:184-206. [PMID: 39390632 PMCID: PMC11786622 DOI: 10.1111/hiv.13720] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2024] [Accepted: 09/20/2024] [Indexed: 10/12/2024]
Abstract
BACKGROUND We developed MARVIN, an artificial intelligence (AI)-based chatbot that provides 24/7 expert-validated information on self-management-related topics for people with HIV. This study assessed (1) the feasibility of using MARVIN, (2) its usability and acceptability, and (3) four usability subconstructs (perceived ease of use, perceived usefulness, attitude towards use, and behavioural intention to use). METHODS In a mixed-methods study conducted at the McGill University Health Centre, enrolled participants were asked to have 20 conversations within 3 weeks with MARVIN on predetermined topics and to complete a usability questionnaire. Feasibility, usability, acceptability, and usability subconstructs were examined against predetermined success thresholds. Qualitatively, randomly selected participants were invited to semi-structured focus groups/interviews to discuss their experiences with MARVIN. Barriers and facilitators were identified according to the four usability subconstructs. RESULTS From March 2021 to April 2022, 28 participants were surveyed after a 3-week testing period, and nine were interviewed. Study retention was 70% (28/40). Mean usability exceeded the threshold (69.9/68), whereas mean acceptability was very close to target (23.8/24). Ratings of attitude towards MARVIN's use were positive (+14%), with the remaining subconstructs exceeding the target (5/7). Facilitators included MARVIN's reliable and useful real-time information support, its easy accessibility, provision of convivial conversations, confidentiality, and perception as being emotionally safe. However, MARVIN's limited comprehension and the use of Facebook as an implementation platform were identified as barriers, along with the need for more conversation topics and new features (e.g., memorization). CONCLUSIONS The study demonstrated MARVIN's global usability. Our findings show its potential for HIV self-management and provide direction for further development.
Collapse
Affiliation(s)
- Yuanchao Ma
- Department of Biomedical Engineering, Polytechnique MontréalMontrealQuebecCanada
- Centre for Outcomes Research & EvaluationResearch Institute of the McGill University Health CentreMontrealQuebecCanada
- Infectious Diseases and Immunity in Global Health ProgramResearch Institute of McGill University Health CentreMontrealQuebecCanada
- Chronic Viral Illness Service, Division of Infectious Disease, Department of MedicineMcGill University Health CentreMontrealQuebecCanada
| | - Sofiane Achiche
- Department of Biomedical Engineering, Polytechnique MontréalMontrealQuebecCanada
| | - Gavin Tu
- Faculty of MedicineUniversité LavalQuebecQuebecCanada
| | - Serge Vicente
- Centre for Outcomes Research & EvaluationResearch Institute of the McGill University Health CentreMontrealQuebecCanada
- Infectious Diseases and Immunity in Global Health ProgramResearch Institute of McGill University Health CentreMontrealQuebecCanada
- Department of Family Medicine, Faculty of Medicine and Health SciencesMcGill UniversityMontrealQuebecCanada
- Department of Mathematics and StatisticsUniversity of MontrealMontrealQuebecCanada
| | - David Lessard
- Centre for Outcomes Research & EvaluationResearch Institute of the McGill University Health CentreMontrealQuebecCanada
- Infectious Diseases and Immunity in Global Health ProgramResearch Institute of McGill University Health CentreMontrealQuebecCanada
- Chronic Viral Illness Service, Division of Infectious Disease, Department of MedicineMcGill University Health CentreMontrealQuebecCanada
| | - Kim Engler
- Centre for Outcomes Research & EvaluationResearch Institute of the McGill University Health CentreMontrealQuebecCanada
- Infectious Diseases and Immunity in Global Health ProgramResearch Institute of McGill University Health CentreMontrealQuebecCanada
| | - Benoît Lemire
- Chronic Viral Illness Service, Division of Infectious Disease, Department of MedicineMcGill University Health CentreMontrealQuebecCanada
- Department of PharmacyMcGill University Health CentreMontrealQuebecCanada
| | | | - Moustafa Laymouna
- Centre for Outcomes Research & EvaluationResearch Institute of the McGill University Health CentreMontrealQuebecCanada
- Infectious Diseases and Immunity in Global Health ProgramResearch Institute of McGill University Health CentreMontrealQuebecCanada
- Department of Family Medicine, Faculty of Medicine and Health SciencesMcGill UniversityMontrealQuebecCanada
| | - Alexandra de Pokomandy
- Centre for Outcomes Research & EvaluationResearch Institute of the McGill University Health CentreMontrealQuebecCanada
- Infectious Diseases and Immunity in Global Health ProgramResearch Institute of McGill University Health CentreMontrealQuebecCanada
- Chronic Viral Illness Service, Division of Infectious Disease, Department of MedicineMcGill University Health CentreMontrealQuebecCanada
- Department of Family Medicine, Faculty of Medicine and Health SciencesMcGill UniversityMontrealQuebecCanada
| | - Joseph Cox
- Centre for Outcomes Research & EvaluationResearch Institute of the McGill University Health CentreMontrealQuebecCanada
- Infectious Diseases and Immunity in Global Health ProgramResearch Institute of McGill University Health CentreMontrealQuebecCanada
- Chronic Viral Illness Service, Division of Infectious Disease, Department of MedicineMcGill University Health CentreMontrealQuebecCanada
- Department of Epidemiology, Biostatistics, and Occupational Health, Faculty of Medicine and Health SciencesMcGill UniversityMontrealQuebecCanada
| | - Bertrand Lebouché
- Centre for Outcomes Research & EvaluationResearch Institute of the McGill University Health CentreMontrealQuebecCanada
- Infectious Diseases and Immunity in Global Health ProgramResearch Institute of McGill University Health CentreMontrealQuebecCanada
- Chronic Viral Illness Service, Division of Infectious Disease, Department of MedicineMcGill University Health CentreMontrealQuebecCanada
- Department of Family Medicine, Faculty of Medicine and Health SciencesMcGill UniversityMontrealQuebecCanada
| |
Collapse
|
37
|
Koirala P, Thongprayoon C, Miao J, Garcia Valencia OA, Sheikh MS, Suppadungsuk S, Mao MA, Pham JH, Craici IM, Cheungpasitporn W. Evaluating AI performance in nephrology triage and subspecialty referrals. Sci Rep 2025; 15:3455. [PMID: 39870788 PMCID: PMC11772766 DOI: 10.1038/s41598-025-88074-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2024] [Accepted: 01/23/2025] [Indexed: 01/29/2025] Open
Abstract
Artificial intelligence (AI) has shown promise in revolutionizing medical triage, particularly in the context of the rising prevalence of kidney-related conditions with the aging global population. This study evaluates the utility of ChatGPT, a large language model, in triaging nephrology cases through simulated real-world scenarios. Two nephrologists created 100 patient cases that encompassed various aspects of nephrology. ChatGPT's performance in determining the appropriateness of nephrology consultations and identifying suitable nephrology subspecialties was assessed. The results demonstrated high accuracy; ChatGPT correctly determined the need for nephrology in 99-100% of cases, and it accurately identified the most suitable nephrology subspecialty triage in 96-99% of cases across two evaluation rounds. The agreement between the two rounds was 97%. While ChatGPT showed promise in improving medical triage efficiency and accuracy, the study also identified areas for refinement. This included the need for better integration of multidisciplinary care for patients with complex, intersecting medical conditions. This study's findings highlight the potential of AI in enhancing decision-making processes in clinical workflow, and it can inform the development of AI-assisted triage systems tailored to institution-specific practices including multidisciplinary approaches.
Collapse
Affiliation(s)
| | - Charat Thongprayoon
- Division of Nephrology and Hypertension, Mayo Clinic, 200 First Street SW, Rochester, MN, 55905, USA
| | - Jing Miao
- Division of Nephrology and Hypertension, Mayo Clinic, 200 First Street SW, Rochester, MN, 55905, USA
| | - Oscar A Garcia Valencia
- Division of Nephrology and Hypertension, Mayo Clinic, 200 First Street SW, Rochester, MN, 55905, USA
| | - Mohammad S Sheikh
- Division of Nephrology and Hypertension, Mayo Clinic, 200 First Street SW, Rochester, MN, 55905, USA
| | - Supawadee Suppadungsuk
- Division of Nephrology and Hypertension, Mayo Clinic, 200 First Street SW, Rochester, MN, 55905, USA
- Faculty of Medicine Ramathibodi Hospital, Chakri Naruebodindra Medical Institute, Mahidol University, Samut Prakan, 10540, Thailand
| | - Michael A Mao
- Division of Nephrology and Hypertension, Department of Medicine, Mayo Clinic, Jacksonville, FL, 32224, USA
| | - Justin H Pham
- Internal Medicine, Mayo Clinic, Rochester, MN, 55905, USA
| | - Iasmina M Craici
- Division of Nephrology and Hypertension, Mayo Clinic, 200 First Street SW, Rochester, MN, 55905, USA
| | - Wisit Cheungpasitporn
- Division of Nephrology and Hypertension, Mayo Clinic, 200 First Street SW, Rochester, MN, 55905, USA.
| |
Collapse
|
38
|
García-Rudolph A, Sanchez-Pinsach D, Opisso E. Evaluating AI Models: Performance Validation Using Formal Multiple-Choice Questions in Neuropsychology. Arch Clin Neuropsychol 2025; 40:150-155. [PMID: 39231527 DOI: 10.1093/arclin/acae068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2024] [Revised: 08/13/2024] [Accepted: 08/19/2024] [Indexed: 09/06/2024] Open
Abstract
High-quality and accessible education is crucial for advancing neuropsychology. A recent study identified key barriers to board certification in clinical neuropsychology, such as time constraints and insufficient specialized knowledge. To address these challenges, this study explored the capabilities of advanced Artificial Intelligence (AI) language models, GPT-3.5 (free-version) and GPT-4.0 (under-subscription version), by evaluating their performance on 300 American Board of Professional Psychology in Clinical Neuropsychology-like questions. The results indicate that GPT-4.0 achieved a higher accuracy rate of 80.0% compared to GPT-3.5's 65.7%. In the "Assessment" category, GPT-4.0 demonstrated a notable improvement with an accuracy rate of 73.4% compared to GPT-3.5's 58.6% (p = 0.012). The "Assessment" category, which comprised 128 questions and exhibited the highest error rate by both AI models, was analyzed. A thematic analysis of the 26 incorrectly answered questions revealed 8 main themes and 17 specific codes, highlighting significant gaps in areas such as "Neurodegenerative Diseases" and "Neuropsychological Testing and Interpretation."
Collapse
Affiliation(s)
- Alejandro García-Rudolph
- Departmento de Investigación e Innovación, Institut Guttmann, Institut Universitari de Neurorehabilitació adscrit a la UAB, Badalona, Barcelona, Spain
- Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Spain
- Fundació Institut d'Investigació en Ciències de la Salut Germans Trias i Pujol, Badalona, Barcelona, Spain
| | - David Sanchez-Pinsach
- Departmento de Investigación e Innovación, Institut Guttmann, Institut Universitari de Neurorehabilitació adscrit a la UAB, Badalona, Barcelona, Spain
- Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Spain
- Fundació Institut d'Investigació en Ciències de la Salut Germans Trias i Pujol, Badalona, Barcelona, Spain
| | - Eloy Opisso
- Departmento de Investigación e Innovación, Institut Guttmann, Institut Universitari de Neurorehabilitació adscrit a la UAB, Badalona, Barcelona, Spain
- Universitat Autònoma de Barcelona, Bellaterra (Cerdanyola del Vallès), Spain
- Fundació Institut d'Investigació en Ciències de la Salut Germans Trias i Pujol, Badalona, Barcelona, Spain
| |
Collapse
|
39
|
Kim J, Vajravelu BN. Assessing the Current Limitations of Large Language Models in Advancing Health Care Education. JMIR Form Res 2025; 9:e51319. [PMID: 39819585 PMCID: PMC11756841 DOI: 10.2196/51319] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Revised: 08/31/2024] [Accepted: 09/03/2024] [Indexed: 01/19/2025] Open
Abstract
Unlabelled The integration of large language models (LLMs), as seen with the generative pretrained transformers series, into health care education and clinical management represents a transformative potential. The practical use of current LLMs in health care sparks great anticipation for new avenues, yet its embracement also elicits considerable concerns that necessitate careful deliberation. This study aims to evaluate the application of state-of-the-art LLMs in health care education, highlighting the following shortcomings as areas requiring significant and urgent improvements: (1) threats to academic integrity, (2) dissemination of misinformation and risks of automation bias, (3) challenges with information completeness and consistency, (4) inequity of access, (5) risks of algorithmic bias, (6) exhibition of moral instability, (7) technological limitations in plugin tools, and (8) lack of regulatory oversight in addressing legal and ethical challenges. Future research should focus on strategically addressing the persistent challenges of LLMs highlighted in this paper, opening the door for effective measures that can improve their application in health care education.
Collapse
Affiliation(s)
- JaeYong Kim
- School of Pharmacy, Massachusetts College of Pharmacy and Health Sciences, Boston, MA, United States
| | - Bathri Narayan Vajravelu
- Department of Physician Assistant Studies, Massachusetts College of Pharmacy and Health Sciences, 179 Longwood Avenue, Boston, MA, 02115, United States, 1 6177322961
| |
Collapse
|
40
|
Jaber SA, Hasan HE, Alzoubi KH, Khabour OF. Knowledge, attitude, and perceptions of MENA researchers towards the use of ChatGPT in research: A cross-sectional study. Heliyon 2025; 11:e41331. [PMID: 39811375 PMCID: PMC11731567 DOI: 10.1016/j.heliyon.2024.e41331] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Revised: 12/03/2024] [Accepted: 12/17/2024] [Indexed: 01/16/2025] Open
Abstract
Background Artificial intelligence (AI) technologies are increasingly recognized for their potential to revolutionize research practices. However, there is a gap in understanding the perspectives of MENA researchers on ChatGPT. This study explores the knowledge, attitudes, and perceptions of ChatGPT utilization in research. Methods A cross-sectional survey was conducted among 369 MENA researchers. Participants provided demographic information and responded to questions about their knowledge of AI, their experience with ChatGPT, their attitudes toward technology, and their perceptions of the potential roles and benefits of ChatGPT in research. Results The results indicate a moderate level of knowledge about ChatGPT, with a total score of 58.3 ± 19.6. Attitudes towards its use were generally positive, with a total score of 68.1 ± 8.1 expressing enthusiasm for integrating ChatGPT into their research workflow. About 56 % of the sample reported using ChatGPT for various applications. In addition, 27.6 % expressed their intention to use it in their research, while 17.3 % have already started using it in their research. However, perceptions varied, with concerns about accuracy, bias, and ethical implications highlighted. The results showed significant differences in knowledge scores based on gender (p < 0.001), working country (p < 0.05), and work field (p < 0.01). Regarding attitude scores, there were significant differences based on the highest qualification and the employment field (p < 0.05). These findings underscore the need for targeted training programs and ethical guidelines to support the effective use of ChatGPT in research. Conclusion MENA researchers demonstrate significant awareness and interest in integrating ChatGPT into their research workflow. Addressing concerns about reliability and ethical implications is essential for advancing scientific innovation in the MENA region.
Collapse
Affiliation(s)
- Sana'a A. Jaber
- Department of Clinical Pharmacy, Faculty of Pharmacy, Jordan University of Science and Technology, Irbid, 22110, Jordan
| | - Hisham E. Hasan
- Department of Clinical Pharmacy, Faculty of Pharmacy, Jordan University of Science and Technology, Irbid, 22110, Jordan
| | - Karem H. Alzoubi
- Department of Clinical Pharmacy, Faculty of Pharmacy, Jordan University of Science and Technology, Irbid, 22110, Jordan
| | - Omar F. Khabour
- Department of Medical Laboratory Sciences, Faculty of Applied Medical Sciences, Jordan University of Science and Technology, Irbid, 22110, Jordan
| |
Collapse
|
41
|
Andriollo L, Picchi A, Iademarco G, Fidanza A, Perticarini L, Rossi SMP, Logroscino G, Benazzo F. The Role of Artificial Intelligence and Emerging Technologies in Advancing Total Hip Arthroplasty. J Pers Med 2025; 15:21. [PMID: 39852213 PMCID: PMC11767033 DOI: 10.3390/jpm15010021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2024] [Revised: 01/05/2025] [Accepted: 01/07/2025] [Indexed: 01/26/2025] Open
Abstract
Total hip arthroplasty (THA) is a widely performed surgical procedure that has evolved significantly due to advancements in artificial intelligence (AI) and robotics. As demand for THA grows, reliable tools are essential to enhance diagnosis, preoperative planning, surgical precision, and postoperative rehabilitation. AI applications in orthopedic surgery offer innovative solutions, including automated hip osteoarthritis (OA) diagnosis, precise implant positioning, and personalized risk stratification, thereby improving patient outcomes. Deep learning models have transformed OA severity grading and implant identification by automating traditionally manual processes with high accuracy. Additionally, AI-powered systems optimize preoperative planning by predicting the hip joint center and identifying complications using multimodal data. Robotic-assisted THA enhances surgical precision with real-time feedback, reducing complications such as dislocations and leg length discrepancies while accelerating recovery. Despite these advancements, barriers such as cost, accessibility, and the steep learning curve for surgeons hinder widespread adoption. Postoperative rehabilitation benefits from technologies like virtual and augmented reality and telemedicine, which enhance patient engagement and adherence. However, limitations, particularly among elderly populations with lower adaptability to technology, underscore the need for user-friendly platforms. To ensure comprehensiveness, a structured literature search was conducted using PubMed, Scopus, and Web of Science. Keywords included "artificial intelligence", "machine learning", "robotics", and "total hip arthroplasty". Inclusion criteria emphasized peer-reviewed studies published in English within the last decade focusing on technological advancements and clinical outcomes. This review evaluates AI and robotics' role in THA, highlighting opportunities and challenges and emphasizing further research and real-world validation to integrate these technologies into clinical practice effectively.
Collapse
Affiliation(s)
- Luca Andriollo
- Sezione di Chirurgia Protesica ad Indirizzo Robotico—Unità di Traumatologia dello Sport, Ortopedia e Traumatologia, Fondazione Poliambulanza, 25124 Brescia, Italy
- Ortopedia e Traumatologia, Università Cattolica del Sacro Cuore, 00168 Rome, Italy
- Artificial Intelligence Center, Alma Mater Europaea University, 1010 Vienna, Austria
| | - Aurelio Picchi
- Unit of Orthopedics, Department of Life, Health and Environmental Sciences, University of L’Aquila, 67100 L’Aquila, Italy
| | - Giulio Iademarco
- Unit of Orthopedics, Department of Life, Health and Environmental Sciences, University of L’Aquila, 67100 L’Aquila, Italy
| | - Andrea Fidanza
- Unit of Orthopedics, Department of Life, Health and Environmental Sciences, University of L’Aquila, 67100 L’Aquila, Italy
| | - Loris Perticarini
- Sezione di Chirurgia Protesica ad Indirizzo Robotico—Unità di Traumatologia dello Sport, Ortopedia e Traumatologia, Fondazione Poliambulanza, 25124 Brescia, Italy
| | - Stefano Marco Paolo Rossi
- Sezione di Chirurgia Protesica ad Indirizzo Robotico—Unità di Traumatologia dello Sport, Ortopedia e Traumatologia, Fondazione Poliambulanza, 25124 Brescia, Italy
- Department of Life Science, Health, and Health Professions, Università degli Studi Link, Link Campus University, 00165 Rome, Italy
- Biomedical Sciences Area, IUSS University School for Advanced Studies, 27100 Pavia, Italy
| | - Giandomenico Logroscino
- Unit of Orthopedics, Department of Life, Health and Environmental Sciences, University of L’Aquila, 67100 L’Aquila, Italy
| | - Francesco Benazzo
- Sezione di Chirurgia Protesica ad Indirizzo Robotico—Unità di Traumatologia dello Sport, Ortopedia e Traumatologia, Fondazione Poliambulanza, 25124 Brescia, Italy
- Biomedical Sciences Area, IUSS University School for Advanced Studies, 27100 Pavia, Italy
| |
Collapse
|
42
|
Antonie NI, Gheorghe G, Ionescu VA, Tiucă LC, Diaconu CC. The Role of ChatGPT and AI Chatbots in Optimizing Antibiotic Therapy: A Comprehensive Narrative Review. Antibiotics (Basel) 2025; 14:60. [PMID: 39858346 PMCID: PMC11761957 DOI: 10.3390/antibiotics14010060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2024] [Revised: 01/03/2025] [Accepted: 01/07/2025] [Indexed: 01/27/2025] Open
Abstract
Background/Objectives: Antimicrobial resistance represents a growing global health crisis, demanding innovative approaches to improve antibiotic stewardship. Artificial intelligence (AI) chatbots based on large language models have shown potential as tools to support clinicians, especially non-specialists, in optimizing antibiotic therapy. This review aims to synthesize current evidence on the capabilities, limitations, and future directions for AI chatbots in enhancing antibiotic selection and patient outcomes. Methods: A narrative review was conducted by analyzing studies published in the last five years across databases such as PubMed, SCOPUS, Web of Science, and Google Scholar. The review focused on research discussing AI-based chatbots, antibiotic stewardship, and clinical decision support systems. Studies were evaluated for methodological soundness and significance, and the findings were synthesized narratively. Results: Current evidence highlights the ability of AI chatbots to assist in guideline-based antibiotic recommendations, improve medical education, and enhance clinical decision-making. Promising results include satisfactory accuracy in preliminary diagnostic and prescriptive tasks. However, challenges such as inconsistent handling of clinical nuances, susceptibility to unsafe advice, algorithmic biases, data privacy concerns, and limited clinical validation underscore the importance of human oversight and refinement. Conclusions: AI chatbots have the potential to complement antibiotic stewardship efforts by promoting appropriate antibiotic use and improving patient outcomes. Realizing this potential will require rigorous clinical trials, interdisciplinary collaboration, regulatory clarity, and tailored algorithmic improvements to ensure their safe and effective integration into clinical practice.
Collapse
Affiliation(s)
- Ninel Iacobus Antonie
- Faculty of Medicine, University of Medicine and Pharmacy Carol Davila Bucharest, 050474 Bucharest, Romania; (N.I.A.); (V.A.I.); (C.C.D.)
- Internal Medicine Department, Clinical Emergency Hospital of Bucharest, 105402 Bucharest, Romania
| | - Gina Gheorghe
- Faculty of Medicine, University of Medicine and Pharmacy Carol Davila Bucharest, 050474 Bucharest, Romania; (N.I.A.); (V.A.I.); (C.C.D.)
- Internal Medicine Department, Clinical Emergency Hospital of Bucharest, 105402 Bucharest, Romania
| | - Vlad Alexandru Ionescu
- Faculty of Medicine, University of Medicine and Pharmacy Carol Davila Bucharest, 050474 Bucharest, Romania; (N.I.A.); (V.A.I.); (C.C.D.)
- Internal Medicine Department, Clinical Emergency Hospital of Bucharest, 105402 Bucharest, Romania
| | - Loredana-Crista Tiucă
- Faculty of Medicine, University of Medicine and Pharmacy Carol Davila Bucharest, 050474 Bucharest, Romania; (N.I.A.); (V.A.I.); (C.C.D.)
- Internal Medicine Department, Clinical Emergency Hospital of Bucharest, 105402 Bucharest, Romania
| | - Camelia Cristina Diaconu
- Faculty of Medicine, University of Medicine and Pharmacy Carol Davila Bucharest, 050474 Bucharest, Romania; (N.I.A.); (V.A.I.); (C.C.D.)
- Internal Medicine Department, Clinical Emergency Hospital of Bucharest, 105402 Bucharest, Romania
- Academy of Romanian Scientists, 050045 Bucharest, Romania
| |
Collapse
|
43
|
Austin J, Benas K, Caicedo S, Imiolek E, Piekutowski A, Ghanim I. Perceptions of Artificial Intelligence and ChatGPT by Speech-Language Pathologists and Students. AMERICAN JOURNAL OF SPEECH-LANGUAGE PATHOLOGY 2025; 34:174-200. [PMID: 39496075 DOI: 10.1044/2024_ajslp-24-00218] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2024]
Abstract
PURPOSE This project explores the perceived implications of artificial intelligence (AI) tools and generative language tools, like ChatGPT, on practice in speech-language pathology. METHOD A total of 107 clinician (n = 60) and student (n = 47) participants completed an 87-item survey that included Likert-style questions and open-ended qualitative responses. The survey explored participants' current frequency of use, experience with AI tools, ethical concerns, and concern with replacing clinicians, as well as likelihood to use in particular professional and clinical areas. Results were analyzed in the context of qualitative responses to typed-response open-ended questions. RESULTS A series of analyses indicated participants are somewhat knowledgeable and experienced with GPT software and other AI tools. Despite a positive outlook and the belief that AI tools are helpful for practice, programs like ChatGPT and other AI tools are infrequently used by speech-language pathologists and students for clinical purposes, mostly restricted to administrative tasks. CONCLUSION While impressions of GPT and other AI tools cite the beneficial ways that AI tools can enhance a clinician's workloads, participants indicate a hesitancy to use AI tools and call for institutional guidelines and training for its adoption.
Collapse
Affiliation(s)
- Julianna Austin
- Department of Communication Sciences and Disorders, Kean University, Union, NJ
| | - Keith Benas
- Department of Communication Sciences and Disorders, Kean University, Union, NJ
| | - Sara Caicedo
- Department of Communication Sciences and Disorders, Kean University, Union, NJ
| | - Emily Imiolek
- Department of Communication Sciences and Disorders, Kean University, Union, NJ
| | - Anna Piekutowski
- Department of Communication Sciences and Disorders, Kean University, Union, NJ
| | - Iyad Ghanim
- Department of Communication Sciences and Disorders, Kean University, Union, NJ
| |
Collapse
|
44
|
Foltyn-Dumitru M, Rastogi A, Cho J, Schell M, Mahmutoglu MA, Kessler T, Sahm F, Wick W, Bendszus M, Brugnara G, Vollmuth P. The potential of GPT-4 advanced data analysis for radiomics-based machine learning models. Neurooncol Adv 2025; 7:vdae230. [PMID: 39780768 PMCID: PMC11707530 DOI: 10.1093/noajnl/vdae230] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2025] Open
Abstract
Background This study aimed to explore the potential of the Advanced Data Analytics (ADA) package of GPT-4 to autonomously develop machine learning models (MLMs) for predicting glioma molecular types using radiomics from MRI. Methods Radiomic features were extracted from preoperative MRI of n = 615 newly diagnosed glioma patients to predict glioma molecular types (IDH-wildtype vs IDH-mutant 1p19q-codeleted vs IDH-mutant 1p19q-non-codeleted) with a multiclass ML approach. Specifically, ADA was used to autonomously develop an ML pipeline and benchmark performance against an established handcrafted model using various MRI normalization methods (N4, Zscore, and WhiteStripe). External validation was performed on 2 public glioma datasets D2 (n = 160) and D3 (n = 410). Results GPT-4 achieved the highest accuracy of 0.820 (95% CI = 0.819-0.821) on the D3 dataset with N4/WS normalization, significantly outperforming the benchmark model's accuracy of 0.678 (95% CI = 0.677-0.680) (P < .001). Class-wise analysis showed performance variations across different glioma types. In the IDH-wildtype group, GPT-4 had a recall of 0.997 (95% CI = 0.997-0.997), surpassing the benchmark's 0.742 (95% CI = 0.740-0.743). For the IDH-mut 1p/19q-non-codel group, GPT-4's recall was 0.275 (95% CI = 0.272-0.279), lower than the benchmark's 0.426 (95% CI = 0.423-0.430). In the IDH-mut 1p/19q-codel group, GPT-4's recall was 0.199 (95% CI = 0.191-0.206), below the benchmark's 0.730 (95% CI = 0.721-0.738). On the D2 dataset, GPT-4's accuracy was significantly lower (P < .001) than the benchmark's, with N4/WS achieving 0.668 (95% CI = 0.666-0.671) compared with 0.719 (95% CI = 0.717-0.722) (P < .001). Class-wise analysis revealed the same pattern as observed in D3. Conclusions GPT-4 can autonomously develop radiomics-based MLMs, achieving performance comparable to handcrafted MLMs. However, its poorer class-wise performance due to unbalanced datasets shows limitations in handling complete end-to-end ML pipelines.
Collapse
Affiliation(s)
- Martha Foltyn-Dumitru
- Division for Computational Radiology & Clinical AI (CCIBonn.ai), Department of Neuroradiology, Bonn University Hospital, Bonn, Germany
- Division for Computational Neuroimaging, Heidelberg University Hospital, Heidelberg, Germany
- Department of Neuroradiology, Heidelberg University Hospital, Heidelberg, Germany
| | - Aditya Rastogi
- Division for Computational Radiology & Clinical AI (CCIBonn.ai), Department of Neuroradiology, Bonn University Hospital, Bonn, Germany
- Division for Computational Neuroimaging, Heidelberg University Hospital, Heidelberg, Germany
- Department of Neuroradiology, Heidelberg University Hospital, Heidelberg, Germany
| | - Jaeyoung Cho
- Division for Computational Radiology & Clinical AI (CCIBonn.ai), Department of Neuroradiology, Bonn University Hospital, Bonn, Germany
- Division for Computational Neuroimaging, Heidelberg University Hospital, Heidelberg, Germany
- Department of Neuroradiology, Heidelberg University Hospital, Heidelberg, Germany
| | - Marianne Schell
- Division for Computational Neuroimaging, Heidelberg University Hospital, Heidelberg, Germany
- Department of Neuroradiology, Heidelberg University Hospital, Heidelberg, Germany
| | - Mustafa Ahmed Mahmutoglu
- Division for Computational Neuroimaging, Heidelberg University Hospital, Heidelberg, Germany
- Department of Neuroradiology, Heidelberg University Hospital, Heidelberg, Germany
| | - Tobias Kessler
- Clinical Cooperation Unit Neurooncology, German Cancer Research Center (DKFZ), Heidelberg, Germany
- Department of Neurology and Neurooncology Program, Heidelberg University Hospital, Heidelberg University, Heidelberg, Germany
| | - Felix Sahm
- Clinical Cooperation Unit Neuropathology, German Cancer Consortium (DKTK), German Cancer Research Center (DKFZ), Heidelberg, Germany
- Department of Neuropathology, Heidelberg University Hospital, Heidelberg, Germany
| | - Wolfgang Wick
- Clinical Cooperation Unit Neurooncology, German Cancer Research Center (DKFZ), Heidelberg, Germany
- Department of Neurology and Neurooncology Program, Heidelberg University Hospital, Heidelberg University, Heidelberg, Germany
| | - Martin Bendszus
- Department of Neuroradiology, Heidelberg University Hospital, Heidelberg, Germany
| | - Gianluca Brugnara
- Division for Medical Image Computing (MIC), German Cancer Research Center (DKFZ), Heidelberg, Germany
- Division for Computational Radiology & Clinical AI (CCIBonn.ai), Department of Neuroradiology, Bonn University Hospital, Bonn, Germany
- Division for Computational Neuroimaging, Heidelberg University Hospital, Heidelberg, Germany
- Department of Neuroradiology, Heidelberg University Hospital, Heidelberg, Germany
| | - Philipp Vollmuth
- Division for Medical Image Computing (MIC), German Cancer Research Center (DKFZ), Heidelberg, Germany
- Division for Computational Radiology & Clinical AI (CCIBonn.ai), Department of Neuroradiology, Bonn University Hospital, Bonn, Germany
- Division for Computational Neuroimaging, Heidelberg University Hospital, Heidelberg, Germany
- Department of Neuroradiology, Heidelberg University Hospital, Heidelberg, Germany
| |
Collapse
|
45
|
Doreswamy N, Horstmanshof L. Generative AI Decision-Making Attributes in Complex Health Services: A Rapid Review. Cureus 2025; 17:e78257. [PMID: 40026934 PMCID: PMC11871968 DOI: 10.7759/cureus.78257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/29/2025] [Indexed: 03/05/2025] Open
Abstract
The advent of Generative Artificial Intelligence (Generative AI or GAI) marks a significant inflection point in AI development. Long viewed as the epitome of reasoning and logic, Generative AI incorporates programming rules that are normative. However, it also has a descriptive component based on its programmers' subjective preferences and any discrepancies in the underlying data. Generative AI generates both truth and falsehood, supports both ethical and unethical decisions, and is neither transparent nor accountable. These factors pose clear risks to optimal decision-making in complex health services such as health policy and health regulation. It is important to examine how Generative AI makes decisions both from a rational, normative perspective and from a descriptive point of view to ensure an ethical approach to Generative AI design, engineering, and use. The objective is to provide a rapid review that identifies and maps attributes reported in the literature that influence Generative AI decision-making in complex health services. This review provides a clear, reproducible methodology that is reported in accordance with a recognised framework and Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) 2020 standards adapted for a rapid review. Inclusion and exclusion criteria were developed, and a database search was undertaken within four search systems: ProQuest, Scopus, Web of Science, and Google Scholar. The results include articles published in 2023 and early 2024. A total of 1,550 articles were identified. After removing duplicates, 1,532 articles remained. Of these, 1,511 articles were excluded based on the selection criteria and a total of 21 articles were selected for analysis. Learning, understanding, and bias were the most frequently mentioned Generative AI attributes. Generative AI brings the promise of advanced automation, but carries significant risk. Learning and pattern recognition are helpful, but the lack of a moral compass, empathy, consideration for privacy, and a propensity for bias and hallucination are detrimental to good decision-making. The results suggest that there is, perhaps, more work to be done before Generative AI can be applied to complex health services.
Collapse
Affiliation(s)
- Nandini Doreswamy
- Faculty of Health Sciences, Southern Cross University, Lismore, AUS
- Health Sciences, National Coalition of Independent Scholars, Canberra, AUS
| | | |
Collapse
|
46
|
Liu J, Koopman B, Brown NJ, Chu K, Nguyen A. Generating synthetic clinical text with local large language models to identify misdiagnosed limb fractures in radiology reports. Artif Intell Med 2025; 159:103027. [PMID: 39580897 DOI: 10.1016/j.artmed.2024.103027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2024] [Revised: 09/26/2024] [Accepted: 11/15/2024] [Indexed: 11/26/2024]
Abstract
Large language models (LLMs) demonstrate impressive capabilities in generating human-like content and have much potential to improve the performance and efficiency of healthcare. An important application of LLMs is to generate synthetic clinical reports that could alleviate the burden of annotating and collecting real-world data in training AI models. Meanwhile, there could be concerns and limitations in using commercial LLMs to handle sensitive clinical data. In this study, we examined the use of open-source LLMs as an alternative to generate synthetic radiology reports to supplement real-world annotated data. We found LLMs hosted locally can achieve similar performance compared to ChatGPT and GPT-4 in augmenting training data for the downstream report classification task of identifying misdiagnosed fractures. We also examined the predictive value of using synthetic reports alone for training downstream models, where our best setting achieved more than 90 % of the performance using real-world data. Overall, our findings show that open-source, local LLMs can be a favourable option for creating synthetic clinical reports for downstream tasks.
Collapse
Affiliation(s)
- Jinghui Liu
- Australian e-Health Research Centre, CSIRO, Brisbane, Queensland, Australia.
| | - Bevan Koopman
- Australian e-Health Research Centre, CSIRO, Brisbane, Queensland, Australia
| | - Nathan J Brown
- Emergency and Trauma Centre, Royal Brisbane and Women's Hospital, Brisbane, Queensland, Australia
| | - Kevin Chu
- Emergency and Trauma Centre, Royal Brisbane and Women's Hospital, Brisbane, Queensland, Australia
| | - Anthony Nguyen
- Australian e-Health Research Centre, CSIRO, Brisbane, Queensland, Australia
| |
Collapse
|
47
|
Heisinger S, Salzmann SN, Senker W, Aspalter S, Oberndorfer J, Matzner MP, Stienen MN, Motov S, Huber D, Grohs JG. ChatGPT's Performance in Spinal Metastasis Cases-Can We Discuss Our Complex Cases with ChatGPT? J Clin Med 2024; 13:7864. [PMID: 39768787 PMCID: PMC11727723 DOI: 10.3390/jcm13247864] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2024] [Revised: 12/11/2024] [Accepted: 12/19/2024] [Indexed: 01/06/2025] Open
Abstract
Background: The integration of artificial intelligence (AI), particularly large language models (LLMs) like ChatGPT-4, is transforming healthcare. ChatGPT's potential to assist in decision-making for complex cases, such as spinal metastasis treatment, is promising but widely untested. Especially in cancer patients who develop spinal metastases, precise and personalized treatment is essential. This study examines ChatGPT-4's performance in treatment planning for spinal metastasis cases compared to experienced spine surgeons. Materials and Methods: Five spine metastasis cases were randomly selected from recent literature. Consequently, five spine surgeons and ChatGPT-4 were tasked with providing treatment recommendations for each case in a standardized manner. Responses were analyzed for frequency distribution, agreement, and subjective rater opinions. Results: ChatGPT's treatment recommendations aligned with the majority of human raters in 73% of treatment choices, with moderate to substantial agreement on systemic therapy, pain management, and supportive care. However, ChatGPT's recommendations tended towards generalized statements, with raters noting its generalized answers. Agreement among raters improved in sensitivity analyses excluding ChatGPT, particularly for controversial areas like surgical intervention and palliative care. Conclusions: ChatGPT shows potential in aligning with experienced surgeons on certain treatment aspects of spinal metastasis. However, its generalized approach highlights limitations, suggesting that training with specific clinical guidelines could potentially enhance its utility in complex case management. Further studies are necessary to refine AI applications in personalized healthcare decision-making.
Collapse
Affiliation(s)
- Stephan Heisinger
- Department of Orthopedics and Trauma Surgery, Medical University of Vienna, 1090 Vienna, Austria; (S.H.)
| | - Stephan N. Salzmann
- Department of Orthopedics and Trauma Surgery, Medical University of Vienna, 1090 Vienna, Austria; (S.H.)
| | - Wolfgang Senker
- Department of Neurosurgery, Kepler University Hospital, 4020 Linz, Austria (S.A.)
| | - Stefan Aspalter
- Department of Neurosurgery, Kepler University Hospital, 4020 Linz, Austria (S.A.)
| | - Johannes Oberndorfer
- Department of Neurosurgery, Kepler University Hospital, 4020 Linz, Austria (S.A.)
| | - Michael P. Matzner
- Department of Orthopedics and Trauma Surgery, Medical University of Vienna, 1090 Vienna, Austria; (S.H.)
| | - Martin N. Stienen
- Spine Center of Eastern Switzerland & Department of Neurosurgery, Kantonsspital St. Gallen, Medical School of St. Gallen, University of St.Gallen, 9000 St. Gallen, Switzerland
| | - Stefan Motov
- Spine Center of Eastern Switzerland & Department of Neurosurgery, Kantonsspital St. Gallen, Medical School of St. Gallen, University of St.Gallen, 9000 St. Gallen, Switzerland
| | - Dominikus Huber
- Division of Oncology, Department of Medicine I, Medical University of Vienna, 1090 Vienna, Austria
| | - Josef Georg Grohs
- Department of Orthopedics and Trauma Surgery, Medical University of Vienna, 1090 Vienna, Austria; (S.H.)
| |
Collapse
|
48
|
Başaran M, Duman C. Dialogues with artificial intelligence: Exploring medical students' perspectives on ChatGPT. MEDICAL TEACHER 2024:1-10. [PMID: 39692300 DOI: 10.1080/0142159x.2024.2438766] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Accepted: 12/03/2024] [Indexed: 12/19/2024]
Abstract
ChatGPT has initiated a new era of inquiry into sources of information within the scientific community. Studies leveraging ChatGPT in the medical field have demonstrated notable performance in academic processes and healthcare applications. This research presents how medical students have benefited from ChatGPT during their educational journey and the challenges they encountered, as reported through their personal experiences. The methodological framework of this study adheres to the stages of qualitative research. An explanatory case study, a qualitative research method, was adopted to determine user experiences with ChatGPT. Content analysis based on student experiences with ChatGPT indicates that it may offer advantages in health education as a resource for scientific research activities. However, adverse reports were also identified, including ethical issues, lack of personal data protection, and potential misuse in scientific research. This study emphasizes the need for comprehensive steps in effectively integrating AI tools like ChatGPT into medical education as a new technology.
Collapse
Affiliation(s)
- Mehmet Başaran
- Curriculum and Instruction, Gaziantep University, Gaziantep, Turkey
| | - Cevahir Duman
- Curriculum and Instruction, Gaziantep University, Gaziantep, Turkey
| |
Collapse
|
49
|
Chung D, Sidhom K, Dhillon H, Bal DS, Fidel MG, Jawanda G, Patel P. Real-world utility of ChatGPT in pre-vasectomy counselling, a safe and efficient practice: a prospective single-centre clinical study. World J Urol 2024; 43:32. [PMID: 39673635 DOI: 10.1007/s00345-024-05385-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2024] [Accepted: 11/15/2024] [Indexed: 12/16/2024] Open
Abstract
PURPOSE This study sought to assess if pre-vasectomy counselling with ChatGPT can safely streamline the consultation process by reducing visit times and increasing patient satisfaction. METHODS A single-institution randomized pilot study was conducted to evaluate the safety and efficacy of ChatGPT for pre-vasectomy counselling. All adult patients interested in undergoing a vasectomy were included. Unwillingness to provide consent or not having internet access constituted exclusion. Patients were randomized 1:1 to ChatGPT with standard in-person or in-person consultation without ChatGPT. Length of visit, number of questions asked, and a Likert scale questionnaire (on a scale of 10, with 10 being defined as great and 0 being defined as poor), were collected. Descriptive statistics and a comparative analysis were performed. RESULTS 18 patients were included with a mean age of 35.8 ± 5.4 (n = 9) in the intervention arm and 36.9 ± 7.4 (n = 9) in the control arm. Pre-vasectomy counselling with ChatGPT was associated with a higher provider perception of patient understanding of the procedure (8.8 ± 1.0 vs. 6.7 ± 2.8; p = 0.047) and a decreased length of in-person consultation (7.7 ± 2.3 min vs. 10.6 ± 3.4 min; p = 0.05). Quality of information provided by ChatGPT, ease of use, and overall experience were rated highly at 8.3 ± 1.9, 9.1 ± 1.5, and 8.6 ± 1.7, respectively. CONCLUSIONS ChatGPT for pre-vasectomy counselling improved the efficiency of consultations and the provider's perception of the patient's understanding of the procedure.
Collapse
Affiliation(s)
- David Chung
- Section of Urology, Department of Surgery, University of Manitoba, AD203-720 McDermot Avenue, Winnipeg, Manitoba, R3N 1B1, Canada.
| | - Karim Sidhom
- Section of Urology, Department of Surgery, University of Manitoba, AD203-720 McDermot Avenue, Winnipeg, Manitoba, R3N 1B1, Canada
| | | | - Dhiraj S Bal
- Max Rady College of Medicine, University of Manitoba, Winnipeg, MB, Canada
| | - Maximilian G Fidel
- Max Rady College of Medicine, University of Manitoba, Winnipeg, MB, Canada
| | - Gary Jawanda
- Manitoba Men's Health Clinic, Winnipeg, MB, Canada
| | - Premal Patel
- Section of Urology, Department of Surgery, University of Manitoba, AD203-720 McDermot Avenue, Winnipeg, Manitoba, R3N 1B1, Canada
- Manitoba Men's Health Clinic, Winnipeg, MB, Canada
| |
Collapse
|
50
|
Koyama H, Kashio A, Yamasoba T. Application of Artificial Intelligence in Otology: Past, Present, and Future. J Clin Med 2024; 13:7577. [PMID: 39768500 PMCID: PMC11727971 DOI: 10.3390/jcm13247577] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2024] [Revised: 12/11/2024] [Accepted: 12/11/2024] [Indexed: 01/16/2025] Open
Abstract
Artificial Intelligence (AI) is a concept whose goal is to imitate human intellectual activity in computers. It emerged in the 1950s and has gone through three booms. We are in the third boom, and it will continue. Medical applications of AI include diagnosing otitis media from images of the eardrum, often outperforming human doctors. Temporal bone CT and MRI analyses also benefit from AI, with segmentation accuracy improved in anatomically significant structures or diagnostic accuracy improved in conditions such as otosclerosis and vestibular schwannoma. In treatment, AI predicts hearing outcomes for sudden sensorineural hearing loss and post-operative hearing outcomes for patients who have undergone tympanoplasty. AI helps patients with hearing aids hear in challenging situations, such as in noisy environments or when multiple people are speaking. It also provides fitting information to help improve hearing with hearing aids. AI also improves cochlear implant mapping and outcome prediction, even in cases of cochlear malformation. Future trends include generative AI, such as ChatGPT, which can provide medical advice and information, although its reliability and application in clinical settings requires further investigation.
Collapse
Affiliation(s)
- Hajime Koyama
- Department of Otolaryngology and Head and Neck Surgery, Faculty of Medicine, University of Tokyo, Tokyo 113-8655, Japan (A.K.)
| | - Akinori Kashio
- Department of Otolaryngology and Head and Neck Surgery, Faculty of Medicine, University of Tokyo, Tokyo 113-8655, Japan (A.K.)
| | - Tatsuya Yamasoba
- Department of Otolaryngology and Head and Neck Surgery, Faculty of Medicine, University of Tokyo, Tokyo 113-8655, Japan (A.K.)
- Department of Otolaryngology, Tokyo Teishin Hospital, Tokyo 102-8798, Japan
| |
Collapse
|