1
|
Norris SA, Kron T, Masterson M, Badawy MK. An Australasian survey on the use of ChatGPT and other large language models in medical physics. Phys Eng Sci Med 2025:10.1007/s13246-025-01571-9. [PMID: 40392469 DOI: 10.1007/s13246-025-01571-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2025] [Accepted: 05/13/2025] [Indexed: 05/22/2025]
Abstract
This study surveyed medical physicists in Australia and New Zealand on their use of large language models (LLMs), particularly ChatGPT. There is currently no literature on the application of ChatGPT and other LLMs by medical physicists. This survey targeted a mixed group of professionals, including clinical medical physicists, registrars, students, and other specialised roles. It reveals that many respondents integrate LLM platforms into their work for a broad range of tasks. Most participants reported efficiency gains, although fewer perceived improvements in the overall quality of their work. Despite these benefits, substantial concerns remain regarding data security, patient confidentiality, and the lack of established guidelines or professional training for using these tools in a clinical context. Further, the potential for sudden changes in accessibility and pricing, which could disproportionately impact developing countries and under-resourced departments, implies that other vulnerabilities may exist. These findings suggest the need for the medical physics community to come together and debate the careful balance between exploiting LLM platforms and developing clear best practices that implement robust risk management strategies.
Collapse
Affiliation(s)
- Stanley A Norris
- Monash Health, Monash Imaging, Melbourne, Australia.
- Peter MacCallum Cancer Centre, Department of Physical Sciences, Melbourne, Australia.
| | - Tomas Kron
- Peter MacCallum Cancer Centre, Department of Physical Sciences, Melbourne, Australia
- Sir Peter MacCallum Department of Oncology, University of Melbourne, Melbourne, Australia
| | | | - Mohamed K Badawy
- Monash Health, Monash Imaging, Melbourne, Australia
- Department of Medical Imaging and Radiation Sciences, Monash University, Melbourne, Australia
| |
Collapse
|
2
|
Chuang WK, Kao YS, Liu YT, Lee CY. Assessing ChatGPT for clinical decision-making in radiation oncology, with open-ended questions and images. Pract Radiat Oncol 2025:S1879-8500(25)00115-8. [PMID: 40311921 DOI: 10.1016/j.prro.2025.04.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2025] [Revised: 04/10/2025] [Accepted: 04/23/2025] [Indexed: 05/03/2025]
Abstract
PURPOSE This study assesses the practicality and correctness of ChatGPT-4 and 4O's answers to clinical inquiries in radiation oncology, and evaluates ChatGPT-4O for staging NPC cases with MR images. METHODS 164 open-ended questions covering representative professional domains (Clinical_G: knowledge on standardized guidelines; Clinical_C: complex clinical scenarios; Nursing: nursing and health education; Technology: Radiation technology and dosimetry) were prospectively formulated by experts and presented to ChatGPT-4 and 4O. Each ChatGPT's answer was graded as 1 (Directly practical for clinical decision-making), 2 (Correct but inadequate), 3 (Mixed with correct and incorrect information), or 4 (Completely incorrect). ChatGPT-4O was presented with the representative diagnostic MR images of 20 NPC patients across different T stages, and asked to determine the T stage of each case. RESULTS The proportions of ChatGPT's answers that were practical (Grade 1) varied across professional domains (p<0.01), higher in Nursing (GPT-4: 91.9%; GPT-4O: 94.6%) and Clinical_G (GPT-4: 82.2%; GPT-4O: 88.9%) domains than in Clinical_C (GPT-4: 54.1%; GPT-4O: 62.2%) and Technology (GPT-4: 64.4%; GPT-4O: 77.8%) domains. The proportions of correct (Grade 1+2) answers (GPT-4: 89.6%; GPT-4O: 98.8%; p<0.01) were universally high across all professional domains. However, ChatGPT-4O failed to stage NPC cases via MR images, indiscriminately assigning T4 to all actually non-T4 cases (κ=0; 95% C.I. :-0.253∼0.253). CONCLUSIONS ChatGPT could be a safe clinical decision-support tool in radiation oncology, as it correctly answered the vast majority of clinical inquiries across professional domains. However, its clinical practicality should be cautiously weighted particularly in the Clinical_C and Technology domains. ChatGPT-4O is not yet mature to interpret diagnostic images for cancer staging.
Collapse
Affiliation(s)
- Wei-Kai Chuang
- Department of Radiation Oncology, Shuang Ho Hospital, Taipei Medical University, New Taipei City 235, Taiwan; Department of Biomedical Imaging and Radiological Sciences, National Yang Ming Chiao Tung University, Taipei 112, Taiwan
| | - Yung-Shuo Kao
- Department of Radiation Oncology, Taoyuan General Hospital, Ministry of Health and Welfare, Taoyuan 330, Taiwan
| | - Yen-Ting Liu
- Division of Radiation Oncology, Department of Oncology, National Taiwan University Hospital Yunlin Branch, Yunlin County 632, Taiwan; Department of Biomedical Engineering, National Taiwan University, Taipei 100, Taiwan; Division of Radiation Oncology, Department of Oncology, National Taiwan University Hospital, Taipei 100, Taiwan.
| | - Cho-Yin Lee
- Department of Radiation Oncology, Taoyuan General Hospital, Ministry of Health and Welfare, Taoyuan 330, Taiwan,; Department of Biomedical Engineering, National Yang Ming Chiao Tung University, Taipei 112, Taiwan.
| |
Collapse
|
3
|
Grilo A, Marques C, Corte-Real M, Carolino E, Caetano M. Assessing the Quality and Reliability of ChatGPT's Responses to Radiotherapy-Related Patient Queries: Comparative Study With GPT-3.5 and GPT-4. JMIR Cancer 2025; 11:e63677. [PMID: 40239208 PMCID: PMC12017613 DOI: 10.2196/63677] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2024] [Revised: 01/30/2025] [Accepted: 02/27/2025] [Indexed: 04/18/2025] Open
Abstract
Background Patients frequently resort to the internet to access information about cancer. However, these websites often lack content accuracy and readability. Recently, ChatGPT, an artificial intelligence-powered chatbot, has signified a potential paradigm shift in how patients with cancer can access vast amounts of medical information, including insights into radiotherapy. However, the quality of the information provided by ChatGPT remains unclear. This is particularly significant given the general public's limited knowledge of this treatment and concerns about its possible side effects. Furthermore, evaluating the quality of responses is crucial, as misinformation can foster a false sense of knowledge and security, lead to noncompliance, and result in delays in receiving appropriate treatment. Objective This study aims to evaluate the quality and reliability of ChatGPT's responses to common patient queries about radiotherapy, comparing the performance of ChatGPT's two versions: GPT-3.5 and GPT-4. Methods We selected 40 commonly asked radiotherapy questions and entered the queries in both versions of ChatGPT. Response quality and reliability were evaluated by 16 radiotherapy experts using the General Quality Score (GQS), a 5-point Likert scale, with the median GQS determined based on the experts' ratings. Consistency and similarity of responses were assessed using the cosine similarity score, which ranges from 0 (complete dissimilarity) to 1 (complete similarity). Readability was analyzed using the Flesch Reading Ease Score, ranging from 0 to 100, and the Flesch-Kincaid Grade Level, reflecting the average number of years of education required for comprehension. Statistical analyses were performed using the Mann-Whitney test and effect size, with results deemed significant at a 5% level (P=.05). To assess agreement between experts, Krippendorff α and Fleiss κ were used. Results GPT-4 demonstrated superior performance, with a higher GQS and a lower number of scores of 1 and 2, compared to GPT-3.5. The Mann-Whitney test revealed statistically significant differences in some questions, with GPT-4 generally receiving higher ratings. The median (IQR) cosine similarity score indicated substantial similarity (0.81, IQR 0.05) and consistency in the responses of both versions (GPT-3.5: 0.85, IQR 0.04; GPT-4: 0.83, IQR 0.04). Readability scores for both versions were considered college level, with GPT-4 scoring slightly better in the Flesch Reading Ease Score (34.61) and Flesch-Kincaid Grade Level (12.32) compared to GPT-3.5 (32.98 and 13.32, respectively). Responses by both versions were deemed challenging for the general public. Conclusions Both GPT-3.5 and GPT-4 demonstrated having the capability to address radiotherapy concepts, with GPT-4 showing superior performance. However, both models present readability challenges for the general population. Although ChatGPT demonstrates potential as a valuable resource for addressing common patient queries related to radiotherapy, it is imperative to acknowledge its limitations, including the risks of misinformation and readability issues. In addition, its implementation should be supported by strategies to enhance accessibility and readability.
Collapse
Affiliation(s)
- Ana Grilo
- Research Center for Psychological Science of the Faculty of Psychology, University of Lisbon to CICPSI, Faculdade de Psicologia, Universidade de Lisboa, Av. D. João II, Lote 4.69.01, Parque das Nações, Lisboa, 1990-096, Portugal, 351 964371101
| | - Catarina Marques
- Escola Superior de Tecnologia da Saúde, Instituto Politécnico de Lisboa, Lisboa, Portugal
| | - Maria Corte-Real
- Escola Superior de Tecnologia da Saúde, Instituto Politécnico de Lisboa, Lisboa, Portugal
| | - Elisabete Carolino
- Research Center for Psychological Science of the Faculty of Psychology, University of Lisbon to CICPSI, Faculdade de Psicologia, Universidade de Lisboa, Av. D. João II, Lote 4.69.01, Parque das Nações, Lisboa, 1990-096, Portugal, 351 964371101
| | - Marco Caetano
- Escola Superior de Tecnologia da Saúde, Instituto Politécnico de Lisboa, Lisboa, Portugal
| |
Collapse
|
4
|
Chen D, Avison K, Alnassar S, Huang RS, Raman S. Medical accuracy of artificial intelligence chatbots in oncology: a scoping review. Oncologist 2025; 30:oyaf038. [PMID: 40285677 PMCID: PMC12032582 DOI: 10.1093/oncolo/oyaf038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2024] [Accepted: 01/03/2025] [Indexed: 04/29/2025] Open
Abstract
BACKGROUND Recent advances in large language models (LLM) have enabled human-like qualities of natural language competency. Applied to oncology, LLMs have been proposed to serve as an information resource and interpret vast amounts of data as a clinical decision-support tool to improve clinical outcomes. OBJECTIVE This review aims to describe the current status of medical accuracy of oncology-related LLM applications and research trends for further areas of investigation. METHODS A scoping literature search was conducted on Ovid Medline for peer-reviewed studies published since 2000. We included primary research studies that evaluated the medical accuracy of a large language model applied in oncology settings. Study characteristics and primary outcomes of included studies were extracted to describe the landscape of oncology-related LLMs. RESULTS Sixty studies were included based on the inclusion and exclusion criteria. The majority of studies evaluated LLMs in oncology as a health information resource in question-answer style examinations (48%), followed by diagnosis (20%) and management (17%). The number of studies that evaluated the utility of fine-tuning and prompt-engineering LLMs increased over time from 2022 to 2024. Studies reported the advantages of LLMs as an accurate information resource, reduction of clinician workload, and improved accessibility and readability of clinical information, while noting disadvantages such as poor reliability, hallucinations, and need for clinician oversight. DISCUSSION There exists significant interest in the application of LLMs in clinical oncology, with a particular focus as a medical information resource and clinical decision support tool. However, further research is needed to validate these tools in external hold-out datasets for generalizability and to improve medical accuracy across diverse clinical scenarios, underscoring the need for clinician supervision of these tools.
Collapse
Affiliation(s)
- David Chen
- Princess Margaret Hospital Cancer Centre, Radiation Medicine Program, Toronto, ON M5G 2C4, Canada
- Temerty Faculty of Medicine, University of Toronto, Toronto, ON M5S 3K3, Canada
| | - Kate Avison
- Princess Margaret Hospital Cancer Centre, Radiation Medicine Program, Toronto, ON M5G 2C4, Canada
- Department of Systems Design Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada
| | - Saif Alnassar
- Princess Margaret Hospital Cancer Centre, Radiation Medicine Program, Toronto, ON M5G 2C4, Canada
- Department of Systems Design Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada
| | - Ryan S Huang
- Princess Margaret Hospital Cancer Centre, Radiation Medicine Program, Toronto, ON M5G 2C4, Canada
- Temerty Faculty of Medicine, University of Toronto, Toronto, ON M5S 3K3, Canada
| | - Srinivas Raman
- Princess Margaret Hospital Cancer Centre, Radiation Medicine Program, Toronto, ON M5G 2C4, Canada
- Temerty Faculty of Medicine, University of Toronto, Toronto, ON M5S 3K3, Canada
- Department of Radiation Oncology, University of Toronto, Toronto, ON M5T 1P5, Canada
- Department of Radiation Oncology, BC Cancer, Vancouver, BC V5Z 1G1, Canada
- Division of Radiation Oncology, University of British Columbia, Vancouver, BC V5Z 1M9, Canada
| |
Collapse
|
5
|
Zhu S, Ma SJ, Farag A, Huerta T, Gamez ME, Blakaj DM. Artificial Intelligence, Machine Learning and Big Data in Radiation Oncology. Hematol Oncol Clin North Am 2025; 39:453-469. [PMID: 39779423 DOI: 10.1016/j.hoc.2024.12.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2025]
Abstract
This review explores the applications of artificial intelligence and machine learning (AI/ML) in radiation oncology, focusing on computer vision (CV) and natural language processing (NLP) techniques. We examined CV-based AI/ML in digital pathology and radiomics, highlighting the prospective clinical studies demonstrating their utility. We also reviewed NLP-based AI/ML applications in clinical documentation analysis, knowledge assessment, and quality assurance. While acknowledging the challenges for clinical adoption, this review underscores the transformative potential of AI/ML in enhancing precision, efficiency, and quality of care in radiation oncology.
Collapse
Affiliation(s)
- Simeng Zhu
- Department of Radiation Oncology, The Arthur G. James Cancer Hospital and Richard J. Solove Research Institute, The Ohio State University Comprehensive Cancer Center, 460 West 10th Avenue, Columbus, OH 43210, USA
| | - Sung Jun Ma
- Department of Radiation Oncology, The Arthur G. James Cancer Hospital and Richard J. Solove Research Institute, The Ohio State University Comprehensive Cancer Center, 460 West 10th Avenue, Columbus, OH 43210, USA
| | - Alexander Farag
- Department of Radiation Oncology, The Arthur G. James Cancer Hospital and Richard J. Solove Research Institute, The Ohio State University Comprehensive Cancer Center, 460 West 10th Avenue, Columbus, OH 43210, USA; Department of Otolaryngology-Head and Neck Surgery, Jacksonville Sinus and Nasal Institute, 836 Prudential Drive Suite 1601, Jacksonville, FL 32207, USA
| | - Timothy Huerta
- Department of Biomedical Informatics, The Arthur G. James Cancer Hospital and Richard J. Solove Research Institute, The Ohio State University Comprehensive Cancer Center, 460 West 10th Avenue, Columbus, OH 43210, USA
| | - Mauricio E Gamez
- Department of Radiation Oncology, Mayo Clinic, 200 First Street Southwest, Rochester, MN 55905, USA
| | - Dukagjin M Blakaj
- Division of Head and Neck/Skull Base, Department of Radiation Oncology, The Arthur G. James Cancer Hospital and Richard J. Solove Research Institute, The Ohio State University Comprehensive Cancer Center, 460 West 10th Avenue, Columbus, OH 43210, USA.
| |
Collapse
|
6
|
Zada T, Tam N, Barnard F, Van Sittert M, Bhat V, Rambhatla S. Medical Misinformation in AI-Assisted Self-Diagnosis: Development of a Method (EvalPrompt) for Analyzing Large Language Models. JMIR Form Res 2025; 9:e66207. [PMID: 40063849 PMCID: PMC11913316 DOI: 10.2196/66207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2024] [Revised: 01/29/2025] [Accepted: 01/29/2025] [Indexed: 03/19/2025] Open
Abstract
Background Rapid integration of large language models (LLMs) in health care is sparking global discussion about their potential to revolutionize health care quality and accessibility. At a time when improving health care quality and access remains a critical concern for countries worldwide, the ability of these models to pass medical examinations is often cited as a reason to use them for medical training and diagnosis. However, the impact of their inevitable use as a self-diagnostic tool and their role in spreading health care misinformation has not been evaluated. Objective This study aims to assess the effectiveness of LLMs, particularly ChatGPT, from the perspective of an individual self-diagnosing to better understand the clarity, correctness, and robustness of the models. Methods We propose the comprehensive testing methodology evaluation of LLM prompts (EvalPrompt). This evaluation methodology uses multiple-choice medical licensing examination questions to evaluate LLM responses. Experiment 1 prompts ChatGPT with open-ended questions to mimic real-world self-diagnosis use cases, and experiment 2 performs sentence dropout on the correct responses from experiment 1 to mimic self-diagnosis with missing information. Humans then assess the responses returned by ChatGPT for both experiments to evaluate the clarity, correctness, and robustness of ChatGPT. Results In experiment 1, we found that ChatGPT-4.0 was deemed correct for 31% (29/94) of the questions by both nonexperts and experts, with only 34% (32/94) agreement between the 2 groups. Similarly, in experiment 2, which assessed robustness, 61% (92/152) of the responses continued to be categorized as correct by all assessors. As a result, in comparison to a passing threshold of 60%, ChatGPT-4.0 is considered incorrect and unclear, though robust. This indicates that sole reliance on ChatGPT-4.0 for self-diagnosis could increase the risk of individuals being misinformed. Conclusions The results highlight the modest capabilities of LLMs, as their responses are often unclear and inaccurate. Any medical advice provided by LLMs should be cautiously approached due to the significant risk of misinformation. However, evidence suggests that LLMs are steadily improving and could potentially play a role in health care systems in the future. To address the issue of medical misinformation, there is a pressing need for the development of a comprehensive self-diagnosis dataset. This dataset could enhance the reliability of LLMs in medical applications by featuring more realistic prompt styles with minimal information across a broader range of medical fields.
Collapse
Affiliation(s)
- Troy Zada
- Department of Management Sciences and Engineering, University of Waterloo, 200 University Avenue West, Waterloo, ON, N2L 3G1, Canada, 1 5198884567 ext 33279
| | - Natalie Tam
- Department of Management Sciences and Engineering, University of Waterloo, 200 University Avenue West, Waterloo, ON, N2L 3G1, Canada, 1 5198884567 ext 33279
| | - Francois Barnard
- Department of Management Sciences and Engineering, University of Waterloo, 200 University Avenue West, Waterloo, ON, N2L 3G1, Canada, 1 5198884567 ext 33279
| | | | - Venkat Bhat
- Department of Psychiatry, University of Toronto, Toronto, ON, Canada
- Interventional Psychiatry Program, St. Michael’s Hospital, Unity Health Toronto, Toronto, ON, Canada
| | - Sirisha Rambhatla
- Department of Management Sciences and Engineering, University of Waterloo, 200 University Avenue West, Waterloo, ON, N2L 3G1, Canada, 1 5198884567 ext 33279
| |
Collapse
|
7
|
Fernández-Pichel M, Pichel JC, Losada DE. Evaluating search engines and large language models for answering health questions. NPJ Digit Med 2025; 8:153. [PMID: 40065094 PMCID: PMC11894092 DOI: 10.1038/s41746-025-01546-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2024] [Accepted: 02/27/2025] [Indexed: 03/14/2025] Open
Abstract
Search engines (SEs) have traditionally been primary tools for information seeking, but the new large language models (LLMs) are emerging as powerful alternatives, particularly for question-answering tasks. This study compares the performance of four popular SEs, seven LLMs, and retrieval-augmented (RAG) variants in answering 150 health-related questions from the TREC Health Misinformation (HM) Track. Results reveal SEs correctly answer 50-70% of questions, often hindered by many retrieval results not responding to the health question. LLMs deliver higher accuracy, correctly answering about 80% of questions, though their performance is sensitive to input prompts. RAG methods significantly enhance smaller LLMs' effectiveness, improving accuracy by up to 30% by integrating retrieval evidence.
Collapse
Affiliation(s)
- Marcos Fernández-Pichel
- Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela, Santiago de Compostela, Galicia, Spain.
| | - Juan C Pichel
- Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela, Santiago de Compostela, Galicia, Spain
| | - David E Losada
- Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela, Santiago de Compostela, Galicia, Spain
| |
Collapse
|
8
|
Wawrzuta D, Napieralska A, Ludwikowska K, Jaruševičius L, Trofimoviča-Krasnorucka A, Rausis G, Szulc A, Pędziwiatr K, Poláchová K, Klejdysz J, Chojnacka M. Large language models for pretreatment education in pediatric radiation oncology: A comparative evaluation study. Clin Transl Radiat Oncol 2025; 51:100914. [PMID: 39867725 PMCID: PMC11762905 DOI: 10.1016/j.ctro.2025.100914] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2024] [Accepted: 01/05/2025] [Indexed: 01/28/2025] Open
Abstract
Background and purpose Pediatric radiotherapy patients and their parents are usually aware of their need for radiotherapy early on, but they meet with a radiation oncologist later in their treatment. Consequently, they search for information online, often encountering unreliable sources. Large language models (LLMs) have the potential to serve as an educational pretreatment tool, providing reliable answers to their questions. We aimed to evaluate the responses provided by generative pre-trained transformers (GPT), the most popular subgroup of LLMs, to questions about pediatric radiation oncology. Materials and methods We collected pretreatment questions regarding radiotherapy from patients and parents. Responses were generated using GPT-3.5, GPT-4, and fine-tuned GPT-3.5, with fine-tuning based on pediatric radiotherapy guides from various institutions. Additionally, a radiation oncologist prepared answers to these questions. Finally, a multi-institutional group of nine pediatric radiotherapy experts conducted a blind review of responses, assessing reliability, concision, and comprehensibility. Results The radiation oncologist and GPT-4 provided the highest-quality responses, though GPT-4's answers were often excessively verbose. While fine-tuned GPT-3.5 generally outperformed basic GPT-3.5, it often provided overly simplistic answers. Inadequate responses were rare, occurring in 4% of GPT-generated responses across all models, primarily due to GPT-3.5 generating excessively long responses. Conclusions LLMs can be valuable tools for educating patients and their families before treatment in pediatric radiation oncology. Among them, only GPT-4 provides information of a quality comparable to that of a radiation oncologist, although it still occasionally generates poor-quality responses. GPT-3.5 models should be used cautiously, as they are more likely to produce inadequate answers to patient questions.
Collapse
Affiliation(s)
- Dominik Wawrzuta
- Department of Radiation Oncology, Maria Sklodowska-Curie National Research Institute of Oncology, Wawelska 15B, 02-034 Warsaw, Poland
| | - Aleksandra Napieralska
- Radiotherapy Department, Maria Sklodowska-Curie National Research Institute of Oncology, Wybrzeże Armii Krajowej 15, 44-100 Gliwice, Poland
- Department of Oncology, Maria Sklodowska-Curie National Research Institute of Oncology, Garncarska 11, 31-115 Cracow, Poland
- Faculty of Medicine & Health Sciences, Andrzej Frycz Modrzewski Krakow University, Gustawa Herlinga-Grudzińskiego 1, 30-705 Cracow, Poland
| | - Katarzyna Ludwikowska
- Department of Radiation Oncology, Maria Sklodowska-Curie National Research Institute of Oncology, Wawelska 15B, 02-034 Warsaw, Poland
| | - Laimonas Jaruševičius
- Oncology Institute, Lithuanian University of Health Sciences, A. Mickevičiaus g. 9, LT-44307, Kaunas, Lithuania
| | - Anastasija Trofimoviča-Krasnorucka
- Department of Radiation Oncology, Riga East University Hospital, Hipokrāta iela 2, LV-1038 Riga, Latvia
- Department of Internal Diseases, Riga Stradiņš University, Dzirciema iela 16, LV-1007 Riga, Latvia
| | - Gints Rausis
- Department of Radiation Oncology, Riga East University Hospital, Hipokrāta iela 2, LV-1038 Riga, Latvia
| | - Agata Szulc
- Department of Radiation Oncology, Lower Silesian Center of Oncology, Pulmonology and Hematology, Hirszfelda 12, 53-413 Wroclaw, Poland
| | - Katarzyna Pędziwiatr
- Department of Radiation Oncology, Maria Sklodowska-Curie National Research Institute of Oncology, Wawelska 15B, 02-034 Warsaw, Poland
| | - Kateřina Poláchová
- Department of Radiation Oncology, Masaryk Memorial Cancer Institute, Žlutý kopec 7, 656 53 Brno, Czech Republic
- Department of Radiation Oncology, Faculty of Medicine, Masaryk University, Kamenice 5, 625 00 Brno, Czech Republic
| | - Justyna Klejdysz
- Department of Economics, Ludwig Maximilian University of Munich (LMU), Geschwister-Scholl-Platz 1, 80539 Munich, Germany
- ifo Institute, Poschinger Straße 5, 81679 Munich, Germany
| | - Marzanna Chojnacka
- Department of Radiation Oncology, Maria Sklodowska-Curie National Research Institute of Oncology, Wawelska 15B, 02-034 Warsaw, Poland
| |
Collapse
|
9
|
Hao Y, Holmes J, Hobson J, Bennett A, McKone EL, Ebner DK, Routman DM, Shiraishi S, Patel SH, Yu NY, Hallemeier CL, Ball BE, Waddle M, Liu W. Retrospective Comparative Analysis of Prostate Cancer In-Basket Messages: Responses From Closed-Domain Large Language Models Versus Clinical Teams. MAYO CLINIC PROCEEDINGS. DIGITAL HEALTH 2025; 3:100198. [PMID: 40130001 PMCID: PMC11932704 DOI: 10.1016/j.mcpdig.2025.100198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 03/26/2025]
Abstract
Objective To evaluate the effectiveness of RadOnc-generative pretrained transformer (GPT), a GPT-4 based large language model, in assisting with in-basket message response generation for prostate cancer treatment, with the goal of reducing the workload and time on clinical care teams while maintaining response quality. Patients and Methods RadOnc-GPT was integrated with electronic health records from both Mayo Clinic-wide databases and a radiation-oncology-specific database. The model was evaluated on 158 previously recorded in-basket message interactions, selected from 90 patients with nonmetastatic prostate cancer from the Mayo Clinic Department of Radiation Oncology in-basket message database in the calendar years 2022-2024. Quantitative natural language processing analysis and 2 grading studies, conducted by 5 clinicians and 4 nurses, were used to assess RadOnc-GPT's responses. Three primary clinicians independently graded all messages, whereas a fourth senior clinician reviewed 41 responses with relevant discrepancies, and a fifth senior clinician evaluated 2 additional responses. The grading focused on 5 key areas: completeness, correctness, clarity, empathy, and editing time. The grading study was performed from July 20, 2024 to December 15, 2024. Results The RadOnc-GPT slightly outperformed the clinical care team in empathy, whereas achieving comparable scores with the clinical care team in completeness, correctness, and clarity. Five clinician graders identified key limitations in RadOnc-GPT's responses, such as lack of context, insufficient domain-specific knowledge, inability to perform essential meta-tasks, and hallucination. It was estimated that RadOnc-GPT could save an average of 5.2 minutes per message for nurses and 2.4 minutes for clinicians, from reading the inquiry to sending the response. Conclusion RadOnc-GPT has the potential to considerably reduce the workload of clinical care teams by generating high-quality, timely responses for in-basket message interactions. This could lead to improved efficiency in health care workflows and reduced costs while maintaining or enhancing the quality of communication between patients and health care providers.Abbreviations and AcronymsAI; artificial intelligence; LLM; large language model; NLP; natural language processing; RadOnc-GPT; radiation oncology generative pretrained transformer.
Collapse
Affiliation(s)
- Yuexing Hao
- Department of Radiation Oncology, Mayo Clinic, Phoenix, AZ
- Cornell University, Ithaca, NY
- Department of Electric Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA
| | - Jason Holmes
- Department of Radiation Oncology, Mayo Clinic, Phoenix, AZ
| | - Jared Hobson
- Department of Radiation Oncology, Mayo Clinic, Rochester, MN
| | | | | | - Daniel K. Ebner
- Department of Radiation Oncology, Mayo Clinic, Rochester, MN
| | | | | | - Samir H. Patel
- Department of Radiation Oncology, Mayo Clinic, Rochester, MN
| | - Nathan Y. Yu
- Department of Radiation Oncology, Mayo Clinic, Phoenix, AZ
| | | | - Brooke E. Ball
- Department of Radiation Oncology, Mayo Clinic, Rochester, MN
| | - Mark Waddle
- Department of Radiation Oncology, Mayo Clinic, Rochester, MN
| | - Wei Liu
- Department of Radiation Oncology, Mayo Clinic, Phoenix, AZ
| |
Collapse
|
10
|
Alfonzetti T, Xia J. Transforming the Landscape of Clinical Information Retrieval Using Generative Artificial Intelligence: An Application in Machine Fault Analysis. Pract Radiat Oncol 2025:S1879-8500(25)00058-X. [PMID: 40024439 DOI: 10.1016/j.prro.2025.02.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2024] [Revised: 02/07/2025] [Accepted: 02/13/2025] [Indexed: 03/04/2025]
Abstract
In a radiation oncology clinic, machine downtime can be a serious burden to the entire department. This study investigates using increasingly popular generative artificial intelligence (AI) techniques to assist medical physicists in troubleshooting linear accelerator issues. Google's NotebookLM, supplemented with background information on linear accelerator issues/solutions, was used as a machine troubleshooting assistant for this purpose. Two board-certified medical physicists evaluated the large language model's responses based on hallucination, relevancy, correctness, and completeness. Results indicated that responses improved with increasing source data context and more specific prompt construction. Keeping risk mitigation and the inherent limitations of AI in mind, this work offers a viable, low-risk method to improve efficiency in radiation oncology. This work uses a "Machine Troubleshooting Assistance" application to provide an adaptable example of how radiation oncology clinics can begin using generative AI to enhance clinical efficiency.
Collapse
Affiliation(s)
- Tyler Alfonzetti
- Department of Radiation Oncology, Mount Sinai Hospital, New York, New York.
| | - Junyi Xia
- Department of Radiation Oncology, Mount Sinai Hospital, New York, New York
| |
Collapse
|
11
|
Sabri H, Saleh MHA, Hazrati P, Merchant K, Misch J, Kumar PS, Wang H, Barootchi S. Performance of three artificial intelligence (AI)-based large language models in standardized testing; implications for AI-assisted dental education. J Periodontal Res 2025; 60:121-133. [PMID: 39030766 PMCID: PMC11873669 DOI: 10.1111/jre.13323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Revised: 06/24/2024] [Accepted: 06/25/2024] [Indexed: 07/22/2024]
Abstract
INTRODUCTION The emerging rise in novel computer technologies and automated data analytics has the potential to change the course of dental education. In line with our long-term goal of harnessing the power of AI to augment didactic teaching, the objective of this study was to quantify and compare the accuracy of responses provided by ChatGPT (GPT-4 and GPT-3.5) and Google Gemini, the three primary large language models (LLMs), to human graduate students (control group) to the annual in-service examination questions posed by the American Academy of Periodontology (AAP). METHODS Under a comparative cross-sectional study design, a corpus of 1312 questions from the annual in-service examination of AAP administered between 2020 and 2023 were presented to the LLMs. Their responses were analyzed using chi-square tests, and the performance was juxtaposed to the scores of periodontal residents from corresponding years, as the human control group. Additionally, two sub-analyses were performed: one on the performance of the LLMs on each section of the exam; and in answering the most difficult questions. RESULTS ChatGPT-4 (total average: 79.57%) outperformed all human control groups as well as GPT-3.5 and Google Gemini in all exam years (p < .001). This chatbot showed an accuracy range between 78.80% and 80.98% across the various exam years. Gemini consistently recorded superior performance with scores of 70.65% (p = .01), 73.29% (p = .02), 75.73% (p < .01), and 72.18% (p = .0008) for the exams from 2020 to 2023 compared to ChatGPT-3.5, which achieved 62.5%, 68.24%, 69.83%, and 59.27% respectively. Google Gemini (72.86%) surpassed the average scores achieved by first- (63.48% ± 31.67) and second-year residents (66.25% ± 31.61) when all exam years combined. However, it could not surpass that of third-year residents (69.06% ± 30.45). CONCLUSIONS Within the confines of this analysis, ChatGPT-4 exhibited a robust capability in answering AAP in-service exam questions in terms of accuracy and reliability while Gemini and ChatGPT-3.5 showed a weaker performance. These findings underscore the potential of deploying LLMs as an educational tool in periodontics and oral implantology domains. However, the current limitations of these models such as inability to effectively process image-based inquiries, the propensity for generating inconsistent responses to the same prompts, and achieving high (80% by GPT-4) but not absolute accuracy rates should be considered. An objective comparison of their capability versus their capacity is required to further develop this field of study.
Collapse
Affiliation(s)
- Hamoun Sabri
- Department of Periodontics and Oral Medicine, School of DentistryUniversity of MichiganAnn ArborMichiganUSA
- Center for Clinical Research and Evidence Synthesis in Oral Tissue Regeneration (CRITERION)Ann ArborMichiganUSA
| | - Muhammad H. A. Saleh
- Department of Periodontics and Oral Medicine, School of DentistryUniversity of MichiganAnn ArborMichiganUSA
| | - Parham Hazrati
- Department of Periodontics and Oral Medicine, School of DentistryUniversity of MichiganAnn ArborMichiganUSA
| | | | - Jonathan Misch
- Department of Periodontics and Oral Medicine, School of DentistryUniversity of MichiganAnn ArborMichiganUSA
- Private PracticeAnn ArborMichiganUSA
| | - Purnima S. Kumar
- Department of Periodontics and Oral Medicine, School of DentistryUniversity of MichiganAnn ArborMichiganUSA
| | - Hom‐Lay Wang
- Department of Periodontics and Oral Medicine, School of DentistryUniversity of MichiganAnn ArborMichiganUSA
| | - Shayan Barootchi
- Department of Periodontics and Oral Medicine, School of DentistryUniversity of MichiganAnn ArborMichiganUSA
- Center for Clinical Research and Evidence Synthesis in Oral Tissue Regeneration (CRITERION)Ann ArborMichiganUSA
- Division of Periodontology, Department of Oral Medicine, Infection, and ImmunityHarvard School of Dental MedicineBostonMassachusettsUSA
| |
Collapse
|
12
|
Li X, Zhao L, Zhang L, Wu Z, Liu Z, Jiang H, Cao C, Xu S, Li Y, Dai H, Yuan Y, Liu J, Li G, Zhu D, Yan P, Li Q, Liu W, Liu T, Shen D. Artificial General Intelligence for Medical Imaging Analysis. IEEE Rev Biomed Eng 2025; 18:113-129. [PMID: 39509310 DOI: 10.1109/rbme.2024.3493775] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2024]
Abstract
Large-scale Artificial General Intelligence (AGI) models, including Large Language Models (LLMs) such as ChatGPT/GPT-4, have achieved unprecedented success in a variety of general domain tasks. Yet, when applied directly to specialized domains like medical imaging, which require in-depth expertise, these models face notable challenges arising from the medical field's inherent complexities and unique characteristics. In this review, we delve into the potential applications of AGI models in medical imaging and healthcare, with a primary focus on LLMs, Large Vision Models, and Large Multimodal Models. We provide a thorough overview of the key features and enabling techniques of LLMs and AGI, and further examine the roadmaps guiding the evolution and implementation of AGI models in the medical sector, summarizing their present applications, potentialities, and associated challenges. In addition, we highlight potential future research directions, offering a holistic view on upcoming ventures. This comprehensive review aims to offer insights into the future implications of AGI in medical imaging, healthcare, and beyond.
Collapse
|
13
|
Hou Y, Bert C, Gomaa A, Lahmer G, Höfler D, Weissmann T, Voigt R, Schubert P, Schmitter C, Depardon A, Semrau S, Maier A, Fietkau R, Huang Y, Putz F. Fine-tuning a local LLaMA-3 large language model for automated privacy-preserving physician letter generation in radiation oncology. Front Artif Intell 2025; 7:1493716. [PMID: 39877751 PMCID: PMC11772293 DOI: 10.3389/frai.2024.1493716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2024] [Accepted: 12/18/2024] [Indexed: 01/31/2025] Open
Abstract
Introduction Generating physician letters is a time-consuming task in daily clinical practice. Methods This study investigates local fine-tuning of large language models (LLMs), specifically LLaMA models, for physician letter generation in a privacy-preserving manner within the field of radiation oncology. Results Our findings demonstrate that base LLaMA models, without fine-tuning, are inadequate for effectively generating physician letters. The QLoRA algorithm provides an efficient method for local intra-institutional fine-tuning of LLMs with limited computational resources (i.e., a single 48 GB GPU workstation within the hospital). The fine-tuned LLM successfully learns radiation oncology-specific information and generates physician letters in an institution-specific style. ROUGE scores of the generated summary reports highlight the superiority of the 8B LLaMA-3 model over the 13B LLaMA-2 model. Further multidimensional physician evaluations of 10 cases reveal that, although the fine-tuned LLaMA-3 model has limited capacity to generate content beyond the provided input data, it successfully generates salutations, diagnoses and treatment histories, recommendations for further treatment, and planned schedules. Overall, clinical benefit was rated highly by the clinical experts (average score of 3.4 on a 4-point scale). Discussion With careful physician review and correction, automated LLM-based physician letter generation has significant practical value.
Collapse
Affiliation(s)
- Yihao Hou
- Department of Radiation Oncology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - Christoph Bert
- Department of Radiation Oncology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany
- Bavarian Cancer Research Center (BZKF), Erlangen, Germany
| | - Ahmed Gomaa
- Department of Radiation Oncology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany
- Bavarian Cancer Research Center (BZKF), Erlangen, Germany
| | - Godehard Lahmer
- Department of Radiation Oncology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany
- Bavarian Cancer Research Center (BZKF), Erlangen, Germany
| | - Daniel Höfler
- Department of Radiation Oncology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany
- Bavarian Cancer Research Center (BZKF), Erlangen, Germany
| | - Thomas Weissmann
- Department of Radiation Oncology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany
- Bavarian Cancer Research Center (BZKF), Erlangen, Germany
| | - Raphaela Voigt
- Department of Radiation Oncology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany
- Bavarian Cancer Research Center (BZKF), Erlangen, Germany
| | - Philipp Schubert
- Department of Radiation Oncology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany
- Bavarian Cancer Research Center (BZKF), Erlangen, Germany
| | - Charlotte Schmitter
- Department of Radiation Oncology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany
- Bavarian Cancer Research Center (BZKF), Erlangen, Germany
| | - Alina Depardon
- Department of Radiation Oncology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany
- Bavarian Cancer Research Center (BZKF), Erlangen, Germany
| | - Sabine Semrau
- Department of Radiation Oncology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany
- Bavarian Cancer Research Center (BZKF), Erlangen, Germany
| | - Andreas Maier
- Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - Rainer Fietkau
- Department of Radiation Oncology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany
- Bavarian Cancer Research Center (BZKF), Erlangen, Germany
| | - Yixing Huang
- Department of Radiation Oncology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Institute of Medical Technology, Health Science Center, Peking University, Beijing, China
| | - Florian Putz
- Department of Radiation Oncology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN, Erlangen, Germany
- Bavarian Cancer Research Center (BZKF), Erlangen, Germany
| |
Collapse
|
14
|
Yang H, Hu M, Most A, Hawkins WA, Murray B, Smith SE, Li S, Sikora A. Evaluating accuracy and reproducibility of large language model performance on critical care assessments in pharmacy education. Front Artif Intell 2025; 7:1514896. [PMID: 39850846 PMCID: PMC11754395 DOI: 10.3389/frai.2024.1514896] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2024] [Accepted: 12/23/2024] [Indexed: 01/25/2025] Open
Abstract
Background Large language models (LLMs) have demonstrated impressive performance on medical licensing and diagnosis-related exams. However, comparative evaluations to optimize LLM performance and ability in the domain of comprehensive medication management (CMM) are lacking. The purpose of this evaluation was to test various LLMs performance optimization strategies and performance on critical care pharmacotherapy questions used in the assessment of Doctor of Pharmacy students. Methods In a comparative analysis using 219 multiple-choice pharmacotherapy questions, five LLMs (GPT-3.5, GPT-4, Claude 2, Llama2-7b and 2-13b) were evaluated. Each LLM was queried five times to evaluate the primary outcome of accuracy (i.e., correctness). Secondary outcomes included variance, the impact of prompt engineering techniques (e.g., chain-of-thought, CoT) and training of a customized GPT on performance, and comparison to third year doctor of pharmacy students on knowledge recall vs. knowledge application questions. Accuracy and variance were compared with student's t-test to compare performance under different model settings. Results ChatGPT-4 exhibited the highest accuracy (71.6%), while Llama2-13b had the lowest variance (0.070). All LLMs performed more accurately on knowledge recall vs. knowledge application questions (e.g., ChatGPT-4: 87% vs. 67%). When applied to ChatGPT-4, few-shot CoT across five runs improved accuracy (77.4% vs. 71.5%) with no effect on variance. Self-consistency and the custom-trained GPT demonstrated similar accuracy to ChatGPT-4 with few-shot CoT. Overall pharmacy student accuracy was 81%, compared to an optimal overall LLM accuracy of 73%. Comparing question types, six of the LLMs demonstrated equivalent or higher accuracy than pharmacy students on knowledge recall questions (e.g., self-consistency vs. students: 93% vs. 84%), but pharmacy students achieved higher accuracy than all LLMs on knowledge application questions (e.g., self-consistency vs. students: 68% vs. 80%). Conclusion ChatGPT-4 was the most accurate LLM on critical care pharmacy questions and few-shot CoT improved accuracy the most. Average student accuracy was similar to LLMs overall, and higher on knowledge application questions. These findings support the need for future assessment of customized training for the type of output needed. Reliance on LLMs is only supported with recall-based questions.
Collapse
Affiliation(s)
- Huibo Yang
- Department of Computer Science, University of Virginia, Charlottesville, VA, United States
| | - Mengxuan Hu
- School of Data Science, University of Virginia, Charlottesville, VA, United States
| | - Amoreena Most
- University of Georgia College of Pharmacy, Augusta, GA, United States
| | - W. Anthony Hawkins
- Department of Clinical and Administrative Pharmacy, University of Georgia College of Pharmacy, Albany, GA, United States
| | - Brian Murray
- University of Colorado Skaggs Schools of Pharmacy and Pharamceutical Sciences, Aurora, CO, United States
| | - Susan E. Smith
- Department of Clinical and Administrative Pharmacy, University of Georgia College of Pharmacy, Athens, GA, United States
| | - Sheng Li
- School of Data Science, University of Virginia, Charlottesville, VA, United States
| | - Andrea Sikora
- Department of Clinical and Administrative Pharmacy, University of Georgia College of Pharmacy, Augusta, GA, United States
| |
Collapse
|
15
|
Azimi I, Qi M, Wang L, Rahmani AM, Li Y. Evaluation of LLMs accuracy and consistency in the registered dietitian exam through prompt engineering and knowledge retrieval. Sci Rep 2025; 15:1506. [PMID: 39789057 PMCID: PMC11718202 DOI: 10.1038/s41598-024-85003-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2024] [Accepted: 12/30/2024] [Indexed: 01/12/2025] Open
Abstract
Large language models (LLMs) are fundamentally transforming human-facing applications in the health and well-being domains: boosting patient engagement, accelerating clinical decision-making, and facilitating medical education. Although state-of-the-art LLMs have shown superior performance in several conversational applications, evaluations within nutrition and diet applications are still insufficient. In this paper, we propose to employ the Registered Dietitian (RD) exam to conduct a standard and comprehensive evaluation of state-of-the-art LLMs, GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro, assessing both accuracy and consistency in nutrition queries. Our evaluation includes 1050 RD exam questions encompassing several nutrition topics and proficiency levels. In addition, for the first time, we examine the impact of Zero-Shot (ZS), Chain of Thought (CoT), Chain of Thought with Self Consistency (CoT-SC), and Retrieval Augmented Prompting (RAP) on both accuracy and consistency of the responses. Our findings revealed that while these LLMs obtained acceptable overall performance, their results varied considerably with different prompts and question domains. GPT-4o with CoT-SC prompting outperformed the other approaches, whereas Gemini 1.5 Pro with ZS recorded the highest consistency. For GPT-4o and Claude 3.5, CoT improved the accuracy, and CoT-SC improved both accuracy and consistency. RAP was particularly effective for GPT-4o to answer Expert level questions. Consequently, choosing the appropriate LLM and prompting technique, tailored to the proficiency level and specific domain, can mitigate errors and potential risks in diet and nutrition chatbots.
Collapse
Affiliation(s)
- Iman Azimi
- Department of Engineering, iHealth Labs, Sunnyvale, CA, 94085, United States.
| | - Mohan Qi
- Department of Engineering, iHealth Labs, Sunnyvale, CA, 94085, United States
| | - Li Wang
- Department of Clinical Research, iHealth Labs, Sunnyvale, CA, 94085, United States
| | - Amir M Rahmani
- School of Nursing and Department of Computer Science, University of California Irvine, Irvine, CA, 92697, United States
| | - Youlin Li
- Department of Engineering, iHealth Labs, Sunnyvale, CA, 94085, United States
| |
Collapse
|
16
|
Altalla' B, Ahmad A, Bitar L, Al-Bssol M, Al Omari A, Sultan I. Radiology Report Annotation Using Generative Large Language Models: Comparative Analysis. Int J Biomed Imaging 2025; 2025:5019035. [PMID: 39968311 PMCID: PMC11835477 DOI: 10.1155/ijbi/5019035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2024] [Revised: 09/20/2024] [Accepted: 12/03/2024] [Indexed: 02/20/2025] Open
Abstract
Recent advancements in large language models (LLMs), particularly GPT-3.5 and GPT-4, have sparked significant interest in their application within the medical field. This research offers a detailed comparative analysis of the abilities of GPT-3.5 and GPT-4 in the context of annotating radiology reports and generating impressions from chest computed tomography (CT) scans. The primary objective is to use these models to assist healthcare professionals in handling routine documentation tasks. Employing methods such as in-context learning (ICL) and retrieval-augmented generation (RAG), the study focused on generating impression sections from radiological findings. Comprehensive evaluation was applied using a variety of metrics, including recall-oriented understudy for gisting evaluation (ROUGE) for n-gram analysis, Instructor Similarity for contextual similarity, and BERTScore for semantic similarity, to assess the performance of these models. The study shows distinct performance differences between GPT-3.5 and GPT-4 across both zero-shot and few-shot learning scenarios. It was observed that certain prompts significantly influenced the performance outcomes, with specific prompts leading to more accurate impressions. The RAG method achieved a superior BERTScore of 0.92, showcasing its ability to generate semantically rich and contextually accurate impressions. In contrast, GPT-3.5 and GPT-4 excel in preserving language tone, with Instructor Similarity scores of approximately 0.92 across scenarios, underscoring the importance of prompt design in effective summarization tasks. The findings of this research emphasize the critical role of prompt design in optimizing model efficacy and point to the significant potential for further exploration in prompt engineering. Moreover, the study advocates for the standardized integration of such advanced LLMs in healthcare practices, highlighting their potential to enhance the efficiency and accuracy of medical documentation.
Collapse
Affiliation(s)
- Bayan Altalla'
- Office of Scientific Affairs and Research, King Hussein Cancer Center, Amman, Jordan
- School of Computing Sciences, Princess Sumaya University for Technology, Amman, Jordan
| | - Ashraf Ahmad
- School of Computing Sciences, Princess Sumaya University for Technology, Amman, Jordan
| | - Layla Bitar
- Artificial Intelligence and Data Innovation Office, King Hussein Cancer Center, Amman, Jordan
| | - Mohammed Al-Bssol
- Office of Scientific Affairs and Research, King Hussein Cancer Center, Amman, Jordan
| | - Amal Al Omari
- Office of Scientific Affairs and Research, King Hussein Cancer Center, Amman, Jordan
| | - Iyad Sultan
- Artificial Intelligence and Data Innovation Office, King Hussein Cancer Center, Amman, Jordan
| |
Collapse
|
17
|
Ramadan S, Mutsaers A, Chen PHC, Bauman G, Velker V, Ahmad B, Arifin AJ, Nguyen TK, Palma D, Goodman CD. Evaluating ChatGPT's competency in radiation oncology: A comprehensive assessment across clinical scenarios. Radiother Oncol 2025; 202:110645. [PMID: 39571686 DOI: 10.1016/j.radonc.2024.110645] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2024] [Revised: 11/06/2024] [Accepted: 11/17/2024] [Indexed: 11/24/2024]
Abstract
PURPOSE Artificial intelligence (AI) and machine learning present an opportunity to enhance clinical decision-making in radiation oncology. This study aims to evaluate the competency of ChatGPT, an AI language model, in interpreting clinical scenarios and assessing its oncology knowledge. METHODS AND MATERIALS A series of clinical cases were designed covering 12 disease sites. Questions were grouped into domains: epidemiology, staging and workup, clinical management, treatment planning, cancer biology, physics, and surveillance. Royal College-certified radiation oncologists (ROs) reviewed cases and provided solutions. ROs scored responses on 3 criteria: conciseness (focused answers), completeness (addressing all aspects of the question), and correctness (answer aligns with expert opinion) using a standardized rubric. Scores ranged from 0 to 5 for each criterion for a total possible score of 15. RESULTS Across 12 cases, 182 questions were answered with a total AI score of 2317/2730 (84 %). Scores by criteria were: completeness (79 %, range: 70-99 %), conciseness (92 %, range: 83-99 %), and correctness (81 %, range: 72-92 %). AI performed best in the domains of epidemiology (93 %) and cancer biology (93 %) and reasonably in staging and workup (89 %), physics (86 %) and surveillance (82 %). Weaker domains included treatment planning (78 %) and clinical management (81 %). Statistical differences were driven by variations in the completeness (p < 0.01) and correctness (p = 0.04) criteria, whereas conciseness scored universally high (p = 0.91). These trends were consistent across disease sites. CONCLUSIONS ChatGPT showed potential as a tool in radiation oncology, demonstrating a high degree of accuracy in several oncologic domains. However, this study highlights limitations with incorrect and incomplete answers in complex cases.
Collapse
Affiliation(s)
- Sherif Ramadan
- Department of Radiation Oncology, London Health Sciences Centre, London, ON, Canada.
| | - Adam Mutsaers
- Department of Radiation Oncology, London Health Sciences Centre, London, ON, Canada
| | | | - Glenn Bauman
- Department of Radiation Oncology, London Health Sciences Centre, London, ON, Canada
| | - Vikram Velker
- Department of Radiation Oncology, London Health Sciences Centre, London, ON, Canada
| | - Belal Ahmad
- Department of Radiation Oncology, London Health Sciences Centre, London, ON, Canada
| | - Andrew J Arifin
- Department of Radiation Oncology, London Health Sciences Centre, London, ON, Canada; Department of Radiation Oncology, The University of Texas MD Anderson Cancer Center, Houston, USA
| | - Timothy K Nguyen
- Department of Radiation Oncology, London Health Sciences Centre, London, ON, Canada
| | - David Palma
- Department of Radiation Oncology, London Health Sciences Centre, London, ON, Canada
| | | |
Collapse
|
18
|
Kim J, Lee S, Jeon H, Lee KJ, Bae HJ, Kim B, Seo J. PhenoFlow: A Human-LLM Driven Visual Analytics System for Exploring Large and Complex Stroke Datasets. IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS 2025; 31:470-480. [PMID: 39316495 DOI: 10.1109/tvcg.2024.3456215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/26/2024]
Abstract
Acute stroke demands prompt diagnosis and treatment to achieve optimal patient outcomes. However, the intricate and irregular nature of clinical data associated with acute stroke, particularly blood pressure (BP) measurements, presents substantial obstacles to effective visual analytics and decision-making. Through a year-long collaboration with experienced neurologists, we developed PhenoFlow, a visual analytics system that leverages the collaboration between human and Large Language Models (LLMs) to analyze the extensive and complex data of acute ischemic stroke patients. PhenoFlow pioneers an innovative workflow, where the LLM serves as a data wrangler while neurologists explore and supervise the output using visualizations and natural language interactions. This approach enables neurologists to focus more on decision-making with reduced cognitive load. To protect sensitive patient information, PhenoFlow only utilizes metadata to make inferences and synthesize executable codes, without accessing raw patient data. This ensures that the results are both reproducible and interpretable while maintaining patient privacy. The system incorporates a slice-and-wrap design that employs temporal folding to create an overlaid circular visualization. Combined with a linear bar graph, this design aids in exploring meaningful patterns within irregularly measured BP data. Through case studies, PhenoFlow has demonstrated its capability to support iterative analysis of extensive clinical datasets, reducing cognitive load and enabling neurologists to make well-informed decisions. Grounded in long-term collaboration with domain experts, our research demonstrates the potential of utilizing LLMs to tackle current challenges in data-driven clinical decision-making for acute ischemic stroke patients.
Collapse
|
19
|
Zitu MM, Le TD, Duong T, Haddadan S, Garcia M, Amorrortu R, Zhao Y, Rollison DE, Thieu T. Large language models in cancer: potentials, risks, and safeguards. BJR ARTIFICIAL INTELLIGENCE 2025; 2:ubae019. [PMID: 39777117 PMCID: PMC11703354 DOI: 10.1093/bjrai/ubae019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Revised: 10/26/2024] [Accepted: 12/09/2024] [Indexed: 01/11/2025]
Abstract
This review examines the use of large language models (LLMs) in cancer, analysing articles sourced from PubMed, Embase, and Ovid Medline, published between 2017 and 2024. Our search strategy included terms related to LLMs, cancer research, risks, safeguards, and ethical issues, focusing on studies that utilized text-based data. 59 articles were included in the review, categorized into 3 segments: quantitative studies on LLMs, chatbot-focused studies, and qualitative discussions on LLMs on cancer. Quantitative studies highlight LLMs' advanced capabilities in natural language processing (NLP), while chatbot-focused articles demonstrate their potential in clinical support and data management. Qualitative research underscores the broader implications of LLMs, including the risks and ethical considerations. Our findings suggest that LLMs, notably ChatGPT, have potential in data analysis, patient interaction, and personalized treatment in cancer care. However, the review identifies critical risks, including data biases and ethical challenges. We emphasize the need for regulatory oversight, targeted model development, and continuous evaluation. In conclusion, integrating LLMs in cancer research offers promising prospects but necessitates a balanced approach focusing on accuracy, ethical integrity, and data privacy. This review underscores the need for further study, encouraging responsible exploration and application of artificial intelligence in oncology.
Collapse
Affiliation(s)
- Md Muntasir Zitu
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Tuan Dung Le
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Thanh Duong
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Shohreh Haddadan
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Melany Garcia
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Rossybelle Amorrortu
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Yayi Zhao
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Dana E Rollison
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Thanh Thieu
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| |
Collapse
|
20
|
Kinikoglu O, Isik D. Evaluating the Performance of ChatGPT-4o Oncology Expert in Comparison to Standard Medical Oncology Knowledge: A Focus on Treatment-Related Clinical Questions. Cureus 2025; 17:e78076. [PMID: 39872919 PMCID: PMC11771770 DOI: 10.7759/cureus.78076] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/27/2025] [Indexed: 01/30/2025] Open
Abstract
Integrating artificial intelligence (AI) into oncology can revolutionize decision-making by providing accurate information. This study evaluates the performance of ChatGPT-4o (OpenAI, San Francisco, CA) Oncology Expert, in addressing open-ended clinical oncology questions. Thirty-seven treatment-related questions on solid organ tumors were selected from a hematology-oncology textbook. Responses from ChatGPT-4o Oncology Expert and the textbook were anonymized and independently evaluated by two medical oncologists using a structured scoring system focused on accuracy and clinical justification. Statistical analysis, including paired t-tests, was conducted to compare scores, and interrater reliability was assessed using Cohen's Kappa. Oncology Expert achieved a significantly higher average score of 7.83 compared to the textbook's 7.0 (p < 0.01). In 10 cases, Oncology Expert provided more accurate and updated answers, demonstrating its ability to integrate recent medical knowledge. In 26 cases, both sources provided equally relevant answers, but the Oncology Expert's responses were clearer and easier to understand. Cohen's Kappa indicated almost perfect agreement (κ = 0.93). Both sources included outdated information for bladder cancer treatment, underscoring the need for regular updates. ChatGPT-4o Oncology Expert shows significant potential as a clinical tool in oncology by offering precise, up-to-date, and user-friendly responses. It could transform oncology practice by enhancing decision-making efficiency, improving educational tools, and serving as a reliable adjunct to clinical workflows. However, its integration requires regular updates, expert validation, and a collaborative approach to ensure reliability and relevance in the rapidly evolving field of oncology.
Collapse
Affiliation(s)
- Oguzcan Kinikoglu
- Medical Oncology, Kartal Dr. Lütfi Kirdar City Hospital, Health Science University, Istanbul, TUR
| | - Deniz Isik
- Medical Oncology, Kartal Dr. Lütfi Kirdar City Hospital, Health Science University, Istanbul, TUR
| |
Collapse
|
21
|
Wong J, Kriegler C, Shrivastava A, Duimering A, Le C. Utility of Chatbot Literature Search in Radiation Oncology. JOURNAL OF CANCER EDUCATION : THE OFFICIAL JOURNAL OF THE AMERICAN ASSOCIATION FOR CANCER EDUCATION 2024:10.1007/s13187-024-02547-1. [PMID: 39673022 DOI: 10.1007/s13187-024-02547-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 11/23/2024] [Indexed: 12/15/2024]
Abstract
Artificial intelligence and natural language processing tools have shown promise in oncology by assisting with medical literature retrieval and providing patient support. The potential for these technologies to generate inaccurate yet seemingly correct information poses significant challenges. This study evaluates the effectiveness, benefits, and limitations of ChatGPT for clinical use in conducting literature reviews of radiation oncology treatments. This cross-sectional study used ChatGPT version 3.5 to generate literature searches on radiotherapy options for seven tumor sites, with prompts issued five times per site to generate up to 50 publications per tumor type. The publications were verified using the Scopus database and categorized as correct, irrelevant, or non-existent. Statistical analysis with one-way ANOVA compared the impact factors and citation counts across different tumor sites. Among the 350 publications generated, there were 44 correct, 298 non-existent, and 8 irrelevant papers. The average publication year of all generated papers was 2011, compared to 2009 for the correct papers. The average impact factor of all generated papers was 38.8, compared to 113.8 for the correct papers. There were significant differences in the publication year, impact factor, and citation counts between tumor sites for both correct and non-existent papers. Our study highlights both the potential utility and significant limitations of using AI, specifically ChatGPT 3.5, in radiation oncology literature reviews. The findings emphasize the need for verification of AI outputs, development of standardized quality assurance protocols, and continued research into AI biases to ensure reliable integration into clinical practice.
Collapse
Affiliation(s)
- Justina Wong
- Faculty of Medicine and Dentistry, University of Alberta, Edmonton, AB, Canada
| | - Conley Kriegler
- Division of Radiation Oncology, Department of Oncology, University of Alberta, Cross Cancer Institute, 11560 University Ave, Edmonton, AB, T6G 1Z2, Canada
| | - Ananya Shrivastava
- Faculty of Medicine and Dentistry, University of Alberta, Edmonton, AB, Canada
| | - Adele Duimering
- Division of Radiation Oncology, Department of Oncology, University of Alberta, Cross Cancer Institute, 11560 University Ave, Edmonton, AB, T6G 1Z2, Canada
| | - Connie Le
- Division of Radiation Oncology, Department of Oncology, University of Alberta, Cross Cancer Institute, 11560 University Ave, Edmonton, AB, T6G 1Z2, Canada.
| |
Collapse
|
22
|
Lee J, Park S, Shin J, Cho B. Analyzing evaluation methods for large language models in the medical field: a scoping review. BMC Med Inform Decis Mak 2024; 24:366. [PMID: 39614219 PMCID: PMC11606129 DOI: 10.1186/s12911-024-02709-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Accepted: 10/03/2024] [Indexed: 12/01/2024] Open
Abstract
BACKGROUND Owing to the rapid growth in the popularity of Large Language Models (LLMs), various performance evaluation studies have been conducted to confirm their applicability in the medical field. However, there is still no clear framework for evaluating LLMs. OBJECTIVE This study reviews studies on LLM evaluations in the medical field and analyzes the research methods used in these studies. It aims to provide a reference for future researchers designing LLM studies. METHODS & MATERIALS We conducted a scoping review of three databases (PubMed, Embase, and MEDLINE) to identify LLM-related articles published between January 1, 2023, and September 30, 2023. We analyzed the types of methods, number of questions (queries), evaluators, repeat measurements, additional analysis methods, use of prompt engineering, and metrics other than accuracy. RESULTS A total of 142 articles met the inclusion criteria. LLM evaluation was primarily categorized as either providing test examinations (n = 53, 37.3%) or being evaluated by a medical professional (n = 80, 56.3%), with some hybrid cases (n = 5, 3.5%) or a combination of the two (n = 4, 2.8%). Most studies had 100 or fewer questions (n = 18, 29.0%), 15 (24.2%) performed repeated measurements, 18 (29.0%) performed additional analyses, and 8 (12.9%) used prompt engineering. For medical assessment, most studies used 50 or fewer queries (n = 54, 64.3%), had two evaluators (n = 43, 48.3%), and 14 (14.7%) used prompt engineering. CONCLUSIONS More research is required regarding the application of LLMs in healthcare. Although previous studies have evaluated performance, future studies will likely focus on improving performance. A well-structured methodology is required for these studies to be conducted systematically.
Collapse
Affiliation(s)
- Junbok Lee
- Institute for Innovation in Digital Healthcare, Yonsei University, Seoul, Republic of Korea
- Department of Human Systems Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Sungkyung Park
- Department of Bigdata AI Management Information, Seoul National University of Science and Technology, Seoul, Republic of Korea
| | - Jaeyong Shin
- Department of Preventive Medicine and Public Health, Yonsei University College of Medicine, 50-1, Yonsei-ro, Seodaemun-gu, Seoul, 03722, Republic of Korea.
- Institute of Health Services Research, Yonsei University College of Medicine, Seoul, Korea.
| | - Belong Cho
- Department of Human Systems Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea.
- Department of Family Medicine, Seoul National University Hospital, Seoul, Republic of Korea.
- Department of Family Medicine, Seoul National University College of Medicine, 101 Daehak-ro, Jongno-gu, Seoul, 03080, Republic of Korea.
| |
Collapse
|
23
|
Wang L, Wan Z, Ni C, Song Q, Li Y, Clayton E, Malin B, Yin Z. Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review. J Med Internet Res 2024; 26:e22769. [PMID: 39509695 PMCID: PMC11582494 DOI: 10.2196/22769] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Revised: 09/19/2024] [Accepted: 10/03/2024] [Indexed: 11/15/2024] Open
Abstract
BACKGROUND The launch of ChatGPT (OpenAI) in November 2022 attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including health care. Numerous studies have since been conducted regarding how to use state-of-the-art LLMs in health-related scenarios. OBJECTIVE This review aims to summarize applications of and concerns regarding conversational LLMs in health care and provide an agenda for future research in this field. METHODS We used PubMed, ACM, and the IEEE digital libraries as primary sources for this review. We followed the guidance of PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) to screen and select peer-reviewed research articles that (1) were related to health care applications and conversational LLMs and (2) were published before September 1, 2023, the date when we started paper collection. We investigated these papers and classified them according to their applications and concerns. RESULTS Our search initially identified 820 papers according to targeted keywords, out of which 65 (7.9%) papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT (60/65, 92% of papers), followed by Bard (Google LLC; 1/65, 2% of papers), LLaMA (Meta; 1/65, 2% of papers), and other LLMs (6/65, 9% papers). These papers were classified into four categories of applications: (1) summarization, (2) medical knowledge inquiry, (3) prediction (eg, diagnosis, treatment recommendation, and drug synergy), and (4) administration (eg, documentation and information collection), and four categories of concerns: (1) reliability (eg, training data quality, accuracy, interpretability, and consistency in responses), (2) bias, (3) privacy, and (4) public acceptability. There were 49 (75%) papers using LLMs for either summarization or medical knowledge inquiry, or both, and there are 58 (89%) papers expressing concerns about either reliability or bias, or both. We found that conversational LLMs exhibited promising results in summarization and providing general medical knowledge to patients with a relatively high accuracy. However, conversational LLMs such as ChatGPT are not always able to provide reliable answers to complex health-related tasks (eg, diagnosis) that require specialized domain expertise. While bias or privacy issues are often noted as concerns, no experiments in our reviewed papers thoughtfully examined how conversational LLMs lead to these issues in health care research. CONCLUSIONS Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications bring bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in health care.
Collapse
Affiliation(s)
- Leyao Wang
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Zhiyu Wan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
- School of Biomedical Engineering, ShanghaiTech University, Shanghai, China
| | - Congning Ni
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Qingyuan Song
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Yang Li
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Ellen Clayton
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, TN, United States
- School of Law, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Bradley Malin
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Zhijun Yin
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
| |
Collapse
|
24
|
Piras A, Morelli I, Colciago RR, Boldrini L, D'Aviero A, De Felice F, Grassi R, Iorio GC, Longo S, Mastroleo F, Desideri I, Salvestrini V. The continuous improvement of digital assistance in the radiation oncologist's work: from web-based nomograms to the adoption of large-language models (LLMs). A systematic review by the young group of the Italian association of radiotherapy and clinical oncology (AIRO). LA RADIOLOGIA MEDICA 2024; 129:1720-1735. [PMID: 39397129 DOI: 10.1007/s11547-024-01891-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/23/2024] [Accepted: 09/20/2024] [Indexed: 10/15/2024]
Abstract
PURPOSE Recently, the availability of online medical resources for radiation oncologists and trainees has significantly expanded, alongside the development of numerous artificial intelligence (AI)-based tools. This review evaluates the impact of web-based clinical decision-making tools in the clinical practice of radiation oncology. MATERIAL AND METHODS We searched databases, including PubMed, EMBASE, and Scopus, using keywords related to web-based clinical decision-making tools and radiation oncology, adhering to PRISMA guidelines. RESULTS Out of 2161 identified manuscripts, 70 were ultimately included in our study. These papers all supported the evidence that web-based tools can be transversally integrated into multiple radiation oncology fields, with online applications available for dose and clinical calculations, staging and other multipurpose intents. Specifically, the possible benefit of web-based nomograms for educational purposes was investigated in 35 of the evaluated manuscripts. As regards to the applications of digital and AI-based tools to treatment planning, diagnosis, treatment strategy selection and follow-up adoption, a total of 35 articles were selected. More specifically, 19 articles investigated the role of these tools in heterogeneous cancer types, while nine and seven articles were related to breast and head & neck cancers, respectively. CONCLUSIONS Our analysis suggests that employing web-based and AI tools offers promising potential to enhance the personalization of cancer treatment.
Collapse
Affiliation(s)
- Antonio Piras
- UO Radioterapia Oncologica, Villa Santa Teresa, 90011, Bagheria, Palermo, Italy
- Ri.Med Foundation, 90133, Palermo, Italy
- Department of Health Promotion, Mother and Child Care, Internal Medicine and Medical Specialties, Molecular and Clinical Medicine, University of Palermo, 90127, Palermo, Italy
- Radiation Oncology, Mater Olbia Hospital, Olbia, Italy
| | - Ilaria Morelli
- Radiation Oncology Unit, Department of Experimental and Clinical Biomedical Sciences, Azienda Ospedaliero-Universitaria Careggi, University of Florence, Florence, Italy
| | - Riccardo Ray Colciago
- Department of Radiation Oncology, Fondazione IRCCS Istituto Nazionale Dei Tumori, 20133, Milan, Italy
| | - Luca Boldrini
- UOC Radioterapia Oncologica, Fondazione Policlinico Universitario IRCCS "A. Gemelli", Rome, Italy
- Università Cattolica del Sacro Cuore, Rome, Italy
| | - Andrea D'Aviero
- Department of Medical, Oral and Biotechnological Sciences, "G. D'Annunzio" University of Chieti, Chieti, Italy
- Department of Radiation Oncology, "S.S. Annunziata" Chieti Hospital, Chieti, Italy
| | - Francesca De Felice
- Radiation Oncology, Policlinico Umberto I, Department of Radiological, Oncological and Pathological Sciences, "Sapienza" University of Rome, Rome, Italy
| | - Roberta Grassi
- Department of Precision Medicine, University of Campania "L. Vanvitelli", Naples, Italy
| | | | - Silvia Longo
- UOC Radioterapia Oncologica, Fondazione Policlinico Universitario IRCCS "A. Gemelli", Rome, Italy
| | - Federico Mastroleo
- Division of Radiation Oncology, European Institute of Oncology IRCCS, Via Ripamonti 435, 20141, Milan, Italy.
- Department of Oncology and Hemato-Oncology, University of Milan, Milan, Italy.
| | - Isacco Desideri
- Radiation Oncology Unit, Department of Experimental and Clinical Biomedical Sciences, Azienda Ospedaliero-Universitaria Careggi, University of Florence, Florence, Italy
| | - Viola Salvestrini
- Radiation Oncology Unit, Department of Experimental and Clinical Biomedical Sciences, Azienda Ospedaliero-Universitaria Careggi, University of Florence, Florence, Italy
| |
Collapse
|
25
|
Holmes J, Zhang L, Ding Y, Feng H, Liu Z, Liu T, Wong WW, Vora SA, Ashman JB, Liu W. Benchmarking a Foundation Large Language Model on its Ability to Relabel Structure Names in Accordance With the American Association of Physicists in Medicine Task Group-263 Report. Pract Radiat Oncol 2024; 14:e515-e521. [PMID: 39243241 DOI: 10.1016/j.prro.2024.04.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Revised: 04/11/2024] [Accepted: 04/16/2024] [Indexed: 09/09/2024]
Abstract
PURPOSE To introduce the concept of using large language models (LLMs) to relabel structure names in accordance with the American Association of Physicists in Medicine Task Group-263 standard and to establish a benchmark for future studies to reference. METHODS AND MATERIALS Generative Pretrained Transformer (GPT)-4 was implemented within a Digital Imaging and Communications in Medicine server. Upon receiving a structure-set Digital Imaging and Communications in Medicine file, the server prompts GPT-4 to relabel the structure names according to the American Association of Physicists in Medicine Task Group-263 report. The results were evaluated for 3 disease sites: prostate, head and neck, and thorax. For each disease site, 150 patients were randomly selected for manually tuning the instructions prompt (in batches of 50), and 50 patients were randomly selected for evaluation. Structure names considered were those that were most likely to be relevant for studies using structure contours for many patients. RESULTS The per-patient accuracy was 97.2%, 98.3%, and 97.1% for prostate, head and neck, and thorax disease sites, respectively. On a per-structure basis, the clinical target volume was relabeled correctly in 100%, 95.3%, and 92.9% of cases, respectively. CONCLUSIONS Given the accuracy of GPT-4 in relabeling structure names as presented in this work, LLMs are poised to become an important method for standardizing structure names in radiation oncology, especially considering the rapid advancements in LLM capabilities that are likely to continue.
Collapse
Affiliation(s)
- Jason Holmes
- Department of Radiation Oncology, Mayo Clinic, Phoenix, Arizona.
| | - Lian Zhang
- Department of Radiation Oncology, Mayo Clinic, Phoenix, Arizona
| | - Yuzhen Ding
- Department of Radiation Oncology, Mayo Clinic, Phoenix, Arizona
| | - Hongying Feng
- Department of Radiation Oncology, Mayo Clinic, Phoenix, Arizona
| | - Zhengliang Liu
- School of Computing, University of Georgia, Athens, Georgia
| | - Tianming Liu
- School of Computing, University of Georgia, Athens, Georgia
| | - William W Wong
- Department of Radiation Oncology, Mayo Clinic, Phoenix, Arizona
| | - Sujay A Vora
- Department of Radiation Oncology, Mayo Clinic, Phoenix, Arizona
| | | | - Wei Liu
- Department of Radiation Oncology, Mayo Clinic, Phoenix, Arizona
| |
Collapse
|
26
|
Carl N, Schramm F, Haggenmüller S, Kather JN, Hetz MJ, Wies C, Michel MS, Wessels F, Brinker TJ. Large language model use in clinical oncology. NPJ Precis Oncol 2024; 8:240. [PMID: 39443582 PMCID: PMC11499929 DOI: 10.1038/s41698-024-00733-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Accepted: 10/12/2024] [Indexed: 10/25/2024] Open
Abstract
Large language models (LLMs) are undergoing intensive research for various healthcare domains. This systematic review and meta-analysis assesses current applications, methodologies, and the performance of LLMs in clinical oncology. A mixed-methods approach was used to extract, summarize, and compare methodological approaches and outcomes. This review includes 34 studies. LLMs are primarily evaluated on their ability to answer oncologic questions across various domains. The meta-analysis highlights a significant performance variance, influenced by diverse methodologies and evaluation criteria. Furthermore, differences in inherent model capabilities, prompting strategies, and oncological subdomains contribute to heterogeneity. The lack of use of standardized and LLM-specific reporting protocols leads to methodological disparities, which must be addressed to ensure comparability in LLM research and ultimately leverage the reliable integration of LLM technologies into clinical practice.
Collapse
Affiliation(s)
- Nicolas Carl
- Department of Digital Prevention, Diagnostics and Therapy Guidance, German Cancer Research Center (DKFZ), Heidelberg, Germany
- Department of Urology and Urological Surgery, University Medical Center Mannheim, Ruprecht-Karls University Heidelberg, Mannheim, Germany
| | - Franziska Schramm
- Department of Digital Prevention, Diagnostics and Therapy Guidance, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Sarah Haggenmüller
- Department of Digital Prevention, Diagnostics and Therapy Guidance, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Jakob Nikolas Kather
- Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, Technical University Dresden, Dresden, Germany
| | - Martin J Hetz
- Department of Digital Prevention, Diagnostics and Therapy Guidance, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Christoph Wies
- Department of Digital Prevention, Diagnostics and Therapy Guidance, German Cancer Research Center (DKFZ), Heidelberg, Germany
- Medical Faculty, Ruprecht-Karls University Heidelberg, Heidelberg, Germany
| | - Maurice Stephan Michel
- Department of Urology and Urological Surgery, University Medical Center Mannheim, Ruprecht-Karls University Heidelberg, Mannheim, Germany
| | - Frederik Wessels
- Department of Urology and Urological Surgery, University Medical Center Mannheim, Ruprecht-Karls University Heidelberg, Mannheim, Germany
| | - Titus J Brinker
- Department of Digital Prevention, Diagnostics and Therapy Guidance, German Cancer Research Center (DKFZ), Heidelberg, Germany.
| |
Collapse
|
27
|
Irmici G, Cozzi A, Della Pepa G, De Berardinis C, D'Ascoli E, Cellina M, Cè M, Depretto C, Scaperrotta G. How do large language models answer breast cancer quiz questions? A comparative study of GPT-3.5, GPT-4 and Google Gemini. LA RADIOLOGIA MEDICA 2024; 129:1463-1467. [PMID: 39138732 DOI: 10.1007/s11547-024-01872-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Accepted: 08/01/2024] [Indexed: 08/15/2024]
Abstract
Applications of large language models (LLMs) in the healthcare field have shown promising results in processing and summarizing multidisciplinary information. This study evaluated the ability of three publicly available LLMs (GPT-3.5, GPT-4, and Google Gemini-then called Bard) to answer 60 multiple-choice questions (29 sourced from public databases, 31 newly formulated by experienced breast radiologists) about different aspects of breast cancer care: treatment and prognosis, diagnostic and interventional techniques, imaging interpretation, and pathology. Overall, the rate of correct answers significantly differed among LLMs (p = 0.010): the best performance was achieved by GPT-4 (95%, 57/60) followed by GPT-3.5 (90%, 54/60) and Google Gemini (80%, 48/60). Across all LLMs, no significant differences were observed in the rates of correct replies to questions sourced from public databases and newly formulated ones (p ≥ 0.593). These results highlight the potential benefits of LLMs in breast cancer care, which will need to be further refined through in-context training.
Collapse
Affiliation(s)
- Giovanni Irmici
- Breast Radiology Department, Fondazione IRCCS Istituto Nazionale dei Tumori, Via Giacomo Venezian 1, 20133, Milano, Italy.
| | - Andrea Cozzi
- Imaging Institute of Southern Switzerland (IIMSI), Ente Ospedaliero Cantonale (EOC), Lugano, Switzerland
| | - Gianmarco Della Pepa
- Breast Radiology Department, Fondazione IRCCS Istituto Nazionale dei Tumori, Via Giacomo Venezian 1, 20133, Milano, Italy
| | - Claudia De Berardinis
- Breast Radiology Department, Fondazione IRCCS Istituto Nazionale dei Tumori, Via Giacomo Venezian 1, 20133, Milano, Italy
| | - Elisa D'Ascoli
- Breast Radiology Department, Fondazione IRCCS Istituto Nazionale dei Tumori, Via Giacomo Venezian 1, 20133, Milano, Italy
| | - Michaela Cellina
- Radiology Department, ASST Fatebenefratelli Sacco, Milano, Italy
| | - Maurizio Cè
- Postgraduation School in Radiodiagnostics, Università degli Studi di Milano, Milano, Italy
| | - Catherine Depretto
- Breast Radiology Department, Fondazione IRCCS Istituto Nazionale dei Tumori, Via Giacomo Venezian 1, 20133, Milano, Italy
| | - Gianfranco Scaperrotta
- Breast Radiology Department, Fondazione IRCCS Istituto Nazionale dei Tumori, Via Giacomo Venezian 1, 20133, Milano, Italy
| |
Collapse
|
28
|
Apornvirat S, Namboonlue C, Laohawetwanit T. Comparative analysis of ChatGPT and Bard in answering pathology examination questions requiring image interpretation. Am J Clin Pathol 2024; 162:252-260. [PMID: 38619043 DOI: 10.1093/ajcp/aqae036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2023] [Accepted: 03/08/2024] [Indexed: 04/16/2024] Open
Abstract
OBJECTIVES To evaluate the accuracy of ChatGPT and Bard in answering pathology examination questions requiring image interpretation. METHODS The study evaluated ChatGPT-4 and Bard's performance using 86 multiple-choice questions, with 17 (19.8%) focusing on general pathology and 69 (80.2%) on systemic pathology. Of these, 62 (72.1%) included microscopic images, and 57 (66.3%) were first-order questions focusing on diagnosing the disease. The authors presented these artificial intelligence (AI) tools with questions, both with and without clinical contexts, and assessed their answers against a reference standard set by pathologists. RESULTS ChatGPT-4 achieved a 100% (n = 86) accuracy rate in questions with clinical context, surpassing Bard's 87.2% (n = 75). Without context, the accuracy of both AI tools declined significantly, with ChatGPT-4 at 52.3% (n = 45) and Bard at 38.4% (n = 33). ChatGPT-4 consistently outperformed Bard across various categories, particularly in systemic pathology and first-order questions. A notable issue identified was Bard's tendency to "hallucinate" or provide plausible but incorrect answers, especially without clinical context. CONCLUSIONS This study demonstrated the potential of ChatGPT and Bard in pathology education, stressing the importance of clinical context for accurate AI interpretations of pathology images. It underlined the need for careful AI integration in medical education.
Collapse
Affiliation(s)
- Sompon Apornvirat
- Division of Pathology, Chulabhorn International College of Medicine, Thammasat University, Pathum Thani, Thailand
- Division of Pathology, Thammasat University Hospital, Pathum Thani, Thailand
| | | | - Thiyaphat Laohawetwanit
- Division of Pathology, Chulabhorn International College of Medicine, Thammasat University, Pathum Thani, Thailand
- Division of Pathology, Thammasat University Hospital, Pathum Thani, Thailand
| |
Collapse
|
29
|
Ming S, Guo Q, Cheng W, Lei B. Influence of Model Evolution and System Roles on ChatGPT's Performance in Chinese Medical Licensing Exams: Comparative Study. JMIR MEDICAL EDUCATION 2024; 10:e52784. [PMID: 39140269 PMCID: PMC11336778 DOI: 10.2196/52784] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 05/20/2024] [Accepted: 06/20/2024] [Indexed: 08/15/2024]
Abstract
Background With the increasing application of large language models like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research. Objective The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE). Methods The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical subspecialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompt's designation of system roles tailored to medical subspecialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The χ2 tests and κ values were employed to evaluate the model's accuracy and consistency. Results GPT-4.0 achieved a passing accuracy of 72.7%, which was significantly higher than that of GPT-3.5 (54%; P<.001). The variability rate of repeated responses from GPT-4.0 was lower than that of GPT-3.5 (9% vs 19.5%; P<.001). However, both models showed relatively good response coherence, with κ values of 0.778 and 0.610, respectively. System roles numerically increased accuracy for both GPT-4.0 (0.3%-3.7%) and GPT-3.5 (1.3%-4.5%), and reduced variability by 1.7% and 1.8%, respectively (P>.05). In subgroup analysis, ChatGPT achieved comparable accuracy among different question types (P>.05). GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.5 did so in 7 of 15 on the first response. Conclusions GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, and medical subspecialty expertise. Adding a system role insignificantly enhanced the model's reliability and answer coherence. GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study.
Collapse
Affiliation(s)
- Shuai Ming
- Department of Ophthalmology, Henan Eye Hospital, Henan Provincial People’s Hospital, Zhengzhou, China
- Eye Institute, Henan Academy of Innovations in Medical Science, Zhengzhou, China
- Henan Clinical Research Center for Ocular Diseases, People’s Hospital of Zhengzhou University, Zhengzhou, China
| | - Qingge Guo
- Department of Ophthalmology, Henan Eye Hospital, Henan Provincial People’s Hospital, Zhengzhou, China
- Eye Institute, Henan Academy of Innovations in Medical Science, Zhengzhou, China
- Henan Clinical Research Center for Ocular Diseases, People’s Hospital of Zhengzhou University, Zhengzhou, China
| | - Wenjun Cheng
- Department of Ophthalmology, People’s Hospital of Zhengzhou University, Zhengzhou, China
| | - Bo Lei
- Department of Ophthalmology, Henan Eye Hospital, Henan Provincial People’s Hospital, Zhengzhou, China
- Eye Institute, Henan Academy of Innovations in Medical Science, Zhengzhou, China
- Henan Clinical Research Center for Ocular Diseases, People’s Hospital of Zhengzhou University, Zhengzhou, China
| |
Collapse
|
30
|
Benson R, Elia M, Hyams B, Chang JH, Hong JC. A Narrative Review on the Application of Large Language Models to Support Cancer Care and Research. Yearb Med Inform 2024; 33:90-98. [PMID: 40199294 PMCID: PMC12020524 DOI: 10.1055/s-0044-1800726] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/10/2025] Open
Abstract
OBJECTIVES The emergence of large language models has resulted in a significant shift in informatics research and carries promise in clinical cancer care. Here we provide a narrative review of the recent use of large language models (LLMs) to support cancer care, prevention, and research. METHODS We performed a search of the Scopus database for studies on the application of bidirectional encoder representations from transformers (BERT) and generative-pretrained transformer (GPT) LLMs in cancer care published between the start of 2021 and the end of 2023. We present salient and impactful papers related to each of these themes. RESULTS Studies identified focused on aspects of clinical decision support (CDS), cancer education, and support for research activities. The use of LLMs for CDS primarily focused on aspects of treatment and screening planning, treatment response, and the management of adverse events. Studies using LLMs for cancer education typically focused on question-answering, assessing cancer myths and misconceptions, and text summarization and simplification. Finally, studies using LLMs to support research activities focused on scientific writing and idea generation, cohort identification and extraction, clinical data processing, and NLP-centric tasks. CONCLUSIONS The application of LLMs in cancer care has shown promise across a variety of diverse use cases. Future research should utilize quantitative metrics, qualitative insights, and user insights in the development and evaluation of LLM-based cancer care tools. The development of open-source LLMs for use in cancer care research and activities should also be a priority.
Collapse
Affiliation(s)
- Ryzen Benson
- Department of Radiation Oncology, University of California, San Francisco, San Francisco, California
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California
| | - Marianna Elia
- Department of Radiation Oncology, University of California, San Francisco, San Francisco, California
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California
| | - Benjamin Hyams
- Department of Radiation Oncology, University of California, San Francisco, San Francisco, California
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California
- School of Medicine, University of California, San Francisco, San Francisco, California
| | - Ji Hyun Chang
- Department of Radiation Oncology, University of California, San Francisco, San Francisco, California
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California
- Department of Radiation Oncology, Seoul National University Hospital, Seoul National University College of Medicine, Seoul, Korea
| | - Julian C. Hong
- Department of Radiation Oncology, University of California, San Francisco, San Francisco, California
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California
- UCSF UC Berkeley Joint Program in Computational Precision Health (CPH), San Francisco, CA
| |
Collapse
|
31
|
Xu J, Lu L, Peng X, Pang J, Ding J, Yang L, Song H, Li K, Sun X, Zhang S. Data Set and Benchmark (MedGPTEval) to Evaluate Responses From Large Language Models in Medicine: Evaluation Development and Validation. JMIR Med Inform 2024; 12:e57674. [PMID: 38952020 PMCID: PMC11225096 DOI: 10.2196/57674] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2024] [Revised: 04/03/2024] [Accepted: 05/04/2024] [Indexed: 07/03/2024] Open
Abstract
Background Large language models (LLMs) have achieved great progress in natural language processing tasks and demonstrated the potential for use in clinical applications. Despite their capabilities, LLMs in the medical domain are prone to generating hallucinations (not fully reliable responses). Hallucinations in LLMs' responses create substantial risks, potentially threatening patients' physical safety. Thus, to perceive and prevent this safety risk, it is essential to evaluate LLMs in the medical domain and build a systematic evaluation. Objective We developed a comprehensive evaluation system, MedGPTEval, composed of criteria, medical data sets in Chinese, and publicly available benchmarks. Methods First, a set of evaluation criteria was designed based on a comprehensive literature review. Second, existing candidate criteria were optimized by using a Delphi method with 5 experts in medicine and engineering. Third, 3 clinical experts designed medical data sets to interact with LLMs. Finally, benchmarking experiments were conducted on the data sets. The responses generated by chatbots based on LLMs were recorded for blind evaluations by 5 licensed medical experts. The evaluation criteria that were obtained covered medical professional capabilities, social comprehensive capabilities, contextual capabilities, and computational robustness, with 16 detailed indicators. The medical data sets include 27 medical dialogues and 7 case reports in Chinese. Three chatbots were evaluated: ChatGPT by OpenAI; ERNIE Bot by Baidu, Inc; and Doctor PuJiang (Dr PJ) by Shanghai Artificial Intelligence Laboratory. Results Dr PJ outperformed ChatGPT and ERNIE Bot in the multiple-turn medical dialogues and case report scenarios. Dr PJ also outperformed ChatGPT in the semantic consistency rate and complete error rate category, indicating better robustness. However, Dr PJ had slightly lower scores in medical professional capabilities compared with ChatGPT in the multiple-turn dialogue scenario. Conclusions MedGPTEval provides comprehensive criteria to evaluate chatbots by LLMs in the medical domain, open-source data sets, and benchmarks assessing 3 LLMs. Experimental results demonstrate that Dr PJ outperforms ChatGPT and ERNIE Bot in social and professional contexts. Therefore, such an assessment system can be easily adopted by researchers in this community to augment an open-source data set.
Collapse
Affiliation(s)
- Jie Xu
- Shanghai Artificial Intelligence Laboratory, OpenMedLab, Shanghai, China
| | - Lu Lu
- Shanghai Artificial Intelligence Laboratory, OpenMedLab, Shanghai, China
| | - Xinwei Peng
- Shanghai Artificial Intelligence Laboratory, OpenMedLab, Shanghai, China
| | - Jiali Pang
- Shanghai Artificial Intelligence Laboratory, OpenMedLab, Shanghai, China
| | - Jinru Ding
- Shanghai Artificial Intelligence Laboratory, OpenMedLab, Shanghai, China
| | - Lingrui Yang
- Clinical Research and Innovation Unit, Xinhua Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Huan Song
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China
- Med-X Center for Informatics, Sichuan University, Chengdu, China
| | - Kang Li
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China
- Med-X Center for Informatics, Sichuan University, Chengdu, China
| | - Xin Sun
- Clinical Research and Innovation Unit, Xinhua Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai, China
| | - Shaoting Zhang
- Shanghai Artificial Intelligence Laboratory, OpenMedLab, Shanghai, China
| |
Collapse
|
32
|
Yadav GS, Pandit K, Connell PT, Erfani H, Nager CW. Comparative Analysis of Performance of Large Language Models in Urogynecology. UROGYNECOLOGY (PHILADELPHIA, PA.) 2024:02273501-990000000-00247. [PMID: 38954607 DOI: 10.1097/spv.0000000000001545] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2024]
Abstract
IMPORTANCE Despite growing popularity in medicine, data on large language models in urogynecology are lacking. OBJECTIVE The aim of this study was to compare the performance of ChatGPT-3.5, GPT-4, and Bard on the American Urogynecologic Society self-assessment examination. STUDY DESIGN The examination features 185 questions with a passing score of 80. We tested 3 models-ChatGPT-3.5, GPT-4, and Bard on every question. Dedicated accounts enabled controlled comparisons. Questions with prompts were inputted into each model's interface, and responses were evaluated for correctness, logical reasoning behind answer choice, and sourcing. Data on subcategory, question type, correctness rate, question difficulty, and reference quality were noted. The Fisher exact or χ2 test was used for statistical analysis. RESULTS Out of 185 questions, GPT-4 answered 61.6% questions correctly compared with 54.6% for GPT-3.5 and 42.7% for Bard. GPT-4 answered all questions, whereas GPT-3.5 and Bard declined to answer 4 and 25 questions, respectively. All models demonstrated logical reasoning in their correct responses. Performance of all large language models was inversely proportional to the difficulty level of the questions. Bard referenced sources 97.5% of the time, more often than GPT-4 (83.3%) and GPT-3.5 (39%). GPT-3.5 cited books and websites, whereas GPT-4 and Bard additionally cited journal articles and society guidelines. Median journal impact factor and number of citations were 3.6 with 20 citations for GPT-4 and 2.6 with 25 citations for Bard. CONCLUSIONS Although GPT-4 outperformed GPT-3.5 and Bard, none of the models achieved a passing score. Clinicians should use language models cautiously in patient care scenarios until more evidence emerges.
Collapse
Affiliation(s)
- Ghanshyam S Yadav
- From the Division of Urogynecology and Reconstructive Pelvic Surgery, UC San Diego, San Diego, CA
| | - Kshitij Pandit
- Department of Urology, UC San Diego School of Medicine, La Jolla, CA
| | - Phillip T Connell
- Division of Obstetrics and Gynecology, Baylor College of Medicine, Houston, TX
| | - Hadi Erfani
- Division of Gynecologic Oncology, University of Southern California, Los Angeles, CA
| | - Charles W Nager
- From the Division of Urogynecology and Reconstructive Pelvic Surgery, UC San Diego, San Diego, CA
| |
Collapse
|
33
|
Dennstädt F, Hastings J, Putora PM, Vu E, Fischer G, Süveg K, Glatzer M, Riggenbach E, Hà HL, Cihoric N. In Reply to Daungsupawong and Wiwanitkit. Adv Radiat Oncol 2024; 9:101511. [PMID: 38983697 PMCID: PMC11232351 DOI: 10.1016/j.adro.2024.101511] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/11/2024] Open
Affiliation(s)
- Fabio Dennstädt
- Department of Radiation Oncology, Kantonsspital St. Gallen, St. Gallen, Switzerland
- Department of Radiation Oncology, Inselspital, Bern University Hospital and University of Bern, Bern, Switzerland
| | - Janna Hastings
- School of Medicine, University of St. Gallen, St. Gallen, Switzerland
- Institute for Implementation Science in Health Care, University of Zurich, Zurich, Switzerland
| | - Paul Martin Putora
- Department of Radiation Oncology, Kantonsspital St. Gallen, St. Gallen, Switzerland
- Department of Radiation Oncology, Inselspital, Bern University Hospital and University of Bern, Bern, Switzerland
| | - Erwin Vu
- Department of Radiation Oncology, Kantonsspital St. Gallen, St. Gallen, Switzerland
| | - Galina Fischer
- Department of Radiation Oncology, Kantonsspital St. Gallen, St. Gallen, Switzerland
| | - Krisztian Süveg
- Department of Radiation Oncology, Kantonsspital St. Gallen, St. Gallen, Switzerland
| | - Markus Glatzer
- Department of Radiation Oncology, Kantonsspital St. Gallen, St. Gallen, Switzerland
| | - Elena Riggenbach
- Department of Radiation Oncology, Inselspital, Bern University Hospital and University of Bern, Bern, Switzerland
| | - Hông-Linh Hà
- Department of Radiation Oncology, Kantonsspital St. Gallen, St. Gallen, Switzerland
| | - Nikola Cihoric
- Department of Radiation Oncology, Inselspital, Bern University Hospital and University of Bern, Bern, Switzerland
| |
Collapse
|
34
|
Li DJ, Kao YC, Tsai SJ, Bai YM, Yeh TC, Chu CS, Hsu CW, Cheng SW, Hsu TW, Liang CS, Su KP. Comparing the performance of ChatGPT GPT-4, Bard, and Llama-2 in the Taiwan Psychiatric Licensing Examination and in differential diagnosis with multi-center psychiatrists. Psychiatry Clin Neurosci 2024; 78:347-352. [PMID: 38404249 DOI: 10.1111/pcn.13656] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 12/08/2023] [Accepted: 02/05/2024] [Indexed: 02/27/2024]
Abstract
AIM Large language models (LLMs) have been suggested to play a role in medical education and medical practice. However, the potential of their application in the psychiatric domain has not been well-studied. METHOD In the first step, we compared the performance of ChatGPT GPT-4, Bard, and Llama-2 in the 2022 Taiwan Psychiatric Licensing Examination conducted in traditional Mandarin. In the second step, we compared the scores of these three LLMs with those of 24 experienced psychiatrists in 10 advanced clinical scenario questions designed for psychiatric differential diagnosis. RESULT Only GPT-4 passed the 2022 Taiwan Psychiatric Licensing Examination (scoring 69 and ≥ 60 being considered a passing grade), while Bard scored 36 and Llama-2 scored 25. GPT-4 outperformed Bard and Llama-2, especially in the areas of 'Pathophysiology & Epidemiology' (χ2 = 22.4, P < 0.001) and 'Psychopharmacology & Other therapies' (χ2 = 15.8, P < 0.001). In the differential diagnosis, the mean score of the 24 experienced psychiatrists (mean 6.1, standard deviation 1.9) was higher than that of GPT-4 (5), Bard (3), and Llama-2 (1). CONCLUSION Compared to Bard and Llama-2, GPT-4 demonstrated superior abilities in identifying psychiatric symptoms and making clinical judgments. Besides, GPT-4's ability for differential diagnosis closely approached that of the experienced psychiatrists. GPT-4 revealed a promising potential as a valuable tool in psychiatric practice among the three LLMs.
Collapse
Affiliation(s)
- Dian-Jeng Li
- Department of Addiction Science, Kaohsiung Municipal Kai-Syuan Psychiatric Hospital, Kaohsiung, Taiwan
- Department of Nursing, Meiho University, Pingtung, Taiwan
| | - Yu-Chen Kao
- Department of Psychiatry, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan
- Department of Psychiatry, Tri-Service General Hospital, Beitou branch, Taipei, Taiwan
| | - Shih-Jen Tsai
- Department of Psychiatry, Taipei Veterans General Hospital, Taipei, Taiwan
- Department of Psychiatry, College of Medicine, National Yang Ming Chiao Tung University, Taipei, Taiwan
| | - Ya-Mei Bai
- Department of Psychiatry, Taipei Veterans General Hospital, Taipei, Taiwan
- Department of Psychiatry, College of Medicine, National Yang Ming Chiao Tung University, Taipei, Taiwan
- Institute of Brain Science, National Yang Ming Chiao Tung University, Taipei, Taiwan
| | - Ta-Chuan Yeh
- Department of Psychiatry, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan
| | - Che-Sheng Chu
- Center for Geriatric and Gerontology, Kaohsiung Veterans General Hospital, Kaohsiung, Taiwan
- Non-invasive Neuromodulation Consortium for Mental Disorders, Society of Psychophysiology, Taipei, Taiwan
- Graduate Institute of Medicine, College of Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan
- Department of Psychiatry, Kaohsiung Veterans General Hospital, Kaohsiung, Taiwan
| | - Chih-Wei Hsu
- Department of Psychiatry, Kaohsiung Chang Gung Memorial Hospital, Kaohsiung, Taiwan
| | - Szu-Wei Cheng
- Department of General Medicine, Chi Mei Medical Center, Tainan, Taiwan
- Mind-Body Interface Laboratory (MBI-Lab) and Department of Psychiatry, China Medical University Hospital, Taichung, Taiwan
| | - Tien-Wei Hsu
- Department of Psychiatry, E-DA Dachang Hospital, I-Shou University, Kaohsiung, Taiwan
- Department of Psychiatry, E-DA Hospital, I-Shou University, Kaohsiung, Taiwan
| | - Chih-Sung Liang
- Department of Psychiatry, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan
- Department of Psychiatry, Tri-Service General Hospital, Beitou branch, Taipei, Taiwan
| | - Kuan-Pin Su
- Mind-Body Interface Laboratory (MBI-Lab) and Department of Psychiatry, China Medical University Hospital, Taichung, Taiwan
- College of Medicine, China Medical University, Taichung, Taiwan
- An-Nan Hospital, China Medical University, Tainan, Taiwan
| |
Collapse
|
35
|
Ömür Arça D, Erdemir İ, Kara F, Shermatov N, Odacioğlu M, İbişoğlu E, Hanci FB, Sağiroğlu G, Hanci V. Assessing the readability, reliability, and quality of artificial intelligence chatbot responses to the 100 most searched queries about cardiopulmonary resuscitation: An observational study. Medicine (Baltimore) 2024; 103:e38352. [PMID: 39259094 PMCID: PMC11142831 DOI: 10.1097/md.0000000000038352] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Revised: 04/26/2024] [Accepted: 05/03/2024] [Indexed: 09/12/2024] Open
Abstract
This study aimed to evaluate the readability, reliability, and quality of responses by 4 selected artificial intelligence (AI)-based large language model (LLM) chatbots to questions related to cardiopulmonary resuscitation (CPR). This was a cross-sectional study. Responses to the 100 most frequently asked questions about CPR by 4 selected chatbots (ChatGPT-3.5 [Open AI], Google Bard [Google AI], Google Gemini [Google AI], and Perplexity [Perplexity AI]) were analyzed for readability, reliability, and quality. The chatbots were asked the following question: "What are the 100 most frequently asked questions about cardio pulmonary resuscitation?" in English. Each of the 100 queries derived from the responses was individually posed to the 4 chatbots. The 400 responses or patient education materials (PEM) from the chatbots were assessed for quality and reliability using the modified DISCERN Questionnaire, Journal of the American Medical Association and Global Quality Score. Readability assessment utilized 2 different calculators, which computed readability scores independently using metrics such as Flesch Reading Ease Score, Flesch-Kincaid Grade Level, Simple Measure of Gobbledygook, Gunning Fog Readability and Automated Readability Index. Analyzed 100 responses from each of the 4 chatbots. When the readability values of the median results obtained from Calculators 1 and 2 were compared with the 6th-grade reading level, there was a highly significant difference between the groups (P < .001). Compared to all formulas, the readability level of the responses was above 6th grade. It can be seen that the order of readability from easy to difficult is Bard, Perplexity, Gemini, and ChatGPT-3.5. The readability of the text content provided by all 4 chatbots was found to be above the 6th-grade level. We believe that enhancing the quality, reliability, and readability of PEMs will lead to easier understanding by readers and more accurate performance of CPR. So, patients who receive bystander CPR may experience an increased likelihood of survival.
Collapse
Affiliation(s)
- Dilek Ömür Arça
- Department of Anesthesiology and Reanimation, School of Medicine, Dokuz Eylul University, Izmir, Turkey
| | - İsmail Erdemir
- Department of Anesthesiology and Reanimation, School of Medicine, Dokuz Eylul University, Izmir, Turkey
| | - Fevzi Kara
- Department of Anesthesiology and Reanimation, School of Medicine, Dokuz Eylul University, Izmir, Turkey
| | - Nurgazy Shermatov
- Department of Anesthesiology and Reanimation, School of Medicine, Dokuz Eylul University, Izmir, Turkey
| | - Mürüvvet Odacioğlu
- Department of Anesthesiology and Reanimation, School of Medicine, Dokuz Eylul University, Izmir, Turkey
| | - Emel İbişoğlu
- Department of Anesthesiology and Reanimation, School of Medicine, Dokuz Eylul University, Izmir, Turkey
| | - Ferid Baran Hanci
- Departments of Faculty of Engineering, Ostim Technical University, Artificial Intelligence Engineering, Ankara, Turkey
| | - Gönül Sağiroğlu
- Department of Anesthesiology and Reanimation, School of Medicine, Dokuz Eylul University, Izmir, Turkey
| | - Volkan Hanci
- Department of Anesthesiology and Reanimation, School of Medicine, Dokuz Eylul University, Izmir, Turkey
| |
Collapse
|
36
|
Howard FM, Li A, Riffon MF, Garrett-Mayer E, Pearson AT. Characterizing the Increase in Artificial Intelligence Content Detection in Oncology Scientific Abstracts From 2021 to 2023. JCO Clin Cancer Inform 2024; 8:e2400077. [PMID: 38822755 PMCID: PMC11371107 DOI: 10.1200/cci.24.00077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Revised: 04/25/2024] [Accepted: 04/26/2024] [Indexed: 06/03/2024] Open
Abstract
PURPOSE Artificial intelligence (AI) models can generate scientific abstracts that are difficult to distinguish from the work of human authors. The use of AI in scientific writing and performance of AI detection tools are poorly characterized. METHODS We extracted text from published scientific abstracts from the ASCO 2021-2023 Annual Meetings. Likelihood of AI content was evaluated by three detectors: GPTZero, Originality.ai, and Sapling. Optimal thresholds for AI content detection were selected using 100 abstracts from before 2020 as negative controls, and 100 produced by OpenAI's GPT-3 and GPT-4 models as positive controls. Logistic regression was used to evaluate the association of predicted AI content with submission year and abstract characteristics, and adjusted odds ratios (aORs) were computed. RESULTS Fifteen thousand five hundred and fifty-three abstracts met inclusion criteria. Across detectors, abstracts submitted in 2023 were significantly more likely to contain AI content than those in 2021 (aOR range from 1.79 with Originality to 2.37 with Sapling). Online-only publication and lack of clinical trial number were consistently associated with AI content. With optimal thresholds, 99.5%, 96%, and 97% of GPT-3/4-generated abstracts were identified by GPTZero, Originality, and Sapling respectively, and no sampled abstracts from before 2020 were classified as AI generated by the GPTZero and Originality detectors. Correlation between detectors was low to moderate, with Spearman correlation coefficient ranging from 0.14 for Originality and Sapling to 0.47 for Sapling and GPTZero. CONCLUSION There is an increasing signal of AI content in ASCO abstracts, coinciding with the growing popularity of generative AI models.
Collapse
Affiliation(s)
- Frederick M. Howard
- Section of Hematology/Oncology, Department of Medicine, The University of Chicago, Chicago, IL
| | - Anran Li
- Section of Hematology/Oncology, Department of Medicine, The University of Chicago, Chicago, IL
| | - Mark F. Riffon
- Center for Research and Analytics, American Society of Clinical Oncology, Alexandria, VA
| | | | - Alexander T. Pearson
- Section of Hematology/Oncology, Department of Medicine, The University of Chicago, Chicago, IL
| |
Collapse
|
37
|
Wang L, Wan Z, Ni C, Song Q, Li Y, Clayton EW, Malin BA, Yin Z. A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.04.26.24306390. [PMID: 38712148 PMCID: PMC11071576 DOI: 10.1101/2024.04.26.24306390] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
Background The launch of the Chat Generative Pre-trained Transformer (ChatGPT) in November 2022 has attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including healthcare. Numerous studies have since been conducted regarding how to employ state-of-the-art LLMs in health-related scenarios to assist patients, doctors, and public health administrators. Objective This review aims to summarize the applications and concerns of applying conversational LLMs in healthcare and provide an agenda for future research on LLMs in healthcare. Methods We utilized PubMed, ACM, and IEEE digital libraries as primary sources for this review. We followed the guidance of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRIMSA) to screen and select peer-reviewed research articles that (1) were related to both healthcare applications and conversational LLMs and (2) were published before September 1st, 2023, the date when we started paper collection and screening. We investigated these papers and classified them according to their applications and concerns. Results Our search initially identified 820 papers according to targeted keywords, out of which 65 papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT from OpenAI (60), followed by Bard from Google (1), Large Language Model Meta AI (LLaMA) from Meta (1), and other LLMs (5). These papers were classified into four categories in terms of their applications: 1) summarization, 2) medical knowledge inquiry, 3) prediction, and 4) administration, and four categories of concerns: 1) reliability, 2) bias, 3) privacy, and 4) public acceptability. There are 49 (75%) research papers using LLMs for summarization and/or medical knowledge inquiry, and 58 (89%) research papers expressing concerns about reliability and/or bias. We found that conversational LLMs exhibit promising results in summarization and providing medical knowledge to patients with a relatively high accuracy. However, conversational LLMs like ChatGPT are not able to provide reliable answers to complex health-related tasks that require specialized domain expertise. Additionally, no experiments in our reviewed papers have been conducted to thoughtfully examine how conversational LLMs lead to bias or privacy issues in healthcare research. Conclusions Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications brought bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in healthcare.
Collapse
Affiliation(s)
- Leyao Wang
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Zhiyu Wan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, TN, USA, 37203
| | - Congning Ni
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Qingyuan Song
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Yang Li
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Ellen Wright Clayton
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, Tennessee, USA, 37203
- Center for Biomedical Ethics and Society, Vanderbilt University Medical Center, Nashville, Tennessee, USA, 37203
| | - Bradley A. Malin
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
- Department of Biomedical Informatics, Vanderbilt University Medical Center, TN, USA, 37203
- Department of Biostatistics, Vanderbilt University Medical Center, TN, USA, 37203
| | - Zhijun Yin
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
- Department of Biomedical Informatics, Vanderbilt University Medical Center, TN, USA, 37203
| |
Collapse
|
38
|
Lv X, Zhang X, Li Y, Ding X, Lai H, Shi J. Leveraging Large Language Models for Improved Patient Access and Self-Management: Assessor-Blinded Comparison Between Expert- and AI-Generated Content. J Med Internet Res 2024; 26:e55847. [PMID: 38663010 PMCID: PMC11082737 DOI: 10.2196/55847] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Revised: 03/04/2024] [Accepted: 03/19/2024] [Indexed: 05/12/2024] Open
Abstract
BACKGROUND While large language models (LLMs) such as ChatGPT and Google Bard have shown significant promise in various fields, their broader impact on enhancing patient health care access and quality, particularly in specialized domains such as oral health, requires comprehensive evaluation. OBJECTIVE This study aims to assess the effectiveness of Google Bard, ChatGPT-3.5, and ChatGPT-4 in offering recommendations for common oral health issues, benchmarked against responses from human dental experts. METHODS This comparative analysis used 40 questions derived from patient surveys on prevalent oral diseases, which were executed in a simulated clinical environment. Responses, obtained from both human experts and LLMs, were subject to a blinded evaluation process by experienced dentists and lay users, focusing on readability, appropriateness, harmlessness, comprehensiveness, intent capture, and helpfulness. Additionally, the stability of artificial intelligence responses was also assessed by submitting each question 3 times under consistent conditions. RESULTS Google Bard excelled in readability but lagged in appropriateness when compared to human experts (mean 8.51, SD 0.37 vs mean 9.60, SD 0.33; P=.03). ChatGPT-3.5 and ChatGPT-4, however, performed comparably with human experts in terms of appropriateness (mean 8.96, SD 0.35 and mean 9.34, SD 0.47, respectively), with ChatGPT-4 demonstrating the highest stability and reliability. Furthermore, all 3 LLMs received superior harmlessness scores comparable to human experts, with lay users finding minimal differences in helpfulness and intent capture between the artificial intelligence models and human responses. CONCLUSIONS LLMs, particularly ChatGPT-4, show potential in oral health care, providing patient-centric information for enhancing patient education and clinical care. The observed performance variations underscore the need for ongoing refinement and ethical considerations in health care settings. Future research focuses on developing strategies for the safe integration of LLMs in health care settings.
Collapse
Affiliation(s)
- Xiaolei Lv
- Department of Oral and Maxillofacial Implantology, Shanghai PerioImplant Innovation Center, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
- College of Stomatology, Shanghai Jiao Tong University, Shanghai, China
- National Center for Stomatology, Shanghai, China
- National Clinical Research Center for Oral Diseases, Shanghai, China
- Shanghai Key Laboratory of Stomatology, Shanghai, China
- Shanghai Research Institute of Stomatology, Shanghai, China
| | - Xiaomeng Zhang
- Department of Oral and Maxillofacial Implantology, Shanghai PerioImplant Innovation Center, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
- College of Stomatology, Shanghai Jiao Tong University, Shanghai, China
- National Center for Stomatology, Shanghai, China
- National Clinical Research Center for Oral Diseases, Shanghai, China
- Shanghai Key Laboratory of Stomatology, Shanghai, China
- Shanghai Research Institute of Stomatology, Shanghai, China
| | - Yuan Li
- Department of Oral and Maxillofacial Implantology, Shanghai PerioImplant Innovation Center, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
- College of Stomatology, Shanghai Jiao Tong University, Shanghai, China
- National Center for Stomatology, Shanghai, China
- National Clinical Research Center for Oral Diseases, Shanghai, China
- Shanghai Key Laboratory of Stomatology, Shanghai, China
- Shanghai Research Institute of Stomatology, Shanghai, China
| | - Xinxin Ding
- Department of Oral and Maxillofacial Implantology, Shanghai PerioImplant Innovation Center, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
- College of Stomatology, Shanghai Jiao Tong University, Shanghai, China
- National Center for Stomatology, Shanghai, China
- National Clinical Research Center for Oral Diseases, Shanghai, China
- Shanghai Key Laboratory of Stomatology, Shanghai, China
- Shanghai Research Institute of Stomatology, Shanghai, China
| | - Hongchang Lai
- Department of Oral and Maxillofacial Implantology, Shanghai PerioImplant Innovation Center, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
- College of Stomatology, Shanghai Jiao Tong University, Shanghai, China
- National Center for Stomatology, Shanghai, China
- National Clinical Research Center for Oral Diseases, Shanghai, China
- Shanghai Key Laboratory of Stomatology, Shanghai, China
- Shanghai Research Institute of Stomatology, Shanghai, China
| | - Junyu Shi
- Department of Oral and Maxillofacial Implantology, Shanghai PerioImplant Innovation Center, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China
- College of Stomatology, Shanghai Jiao Tong University, Shanghai, China
- National Center for Stomatology, Shanghai, China
- National Clinical Research Center for Oral Diseases, Shanghai, China
- Shanghai Key Laboratory of Stomatology, Shanghai, China
- Shanghai Research Institute of Stomatology, Shanghai, China
| |
Collapse
|
39
|
Sim JA, Huang X, Horan MR, Baker JN, Huang IC. Using natural language processing to analyze unstructured patient-reported outcomes data derived from electronic health records for cancer populations: a systematic review. Expert Rev Pharmacoecon Outcomes Res 2024; 24:467-475. [PMID: 38383308 PMCID: PMC11001514 DOI: 10.1080/14737167.2024.2322664] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2023] [Accepted: 02/20/2024] [Indexed: 02/23/2024]
Abstract
INTRODUCTION Patient-reported outcomes (PROs; symptoms, functional status, quality-of-life) expressed in the 'free-text' or 'unstructured' format within clinical notes from electronic health records (EHRs) offer valuable insights beyond biological and clinical data for medical decision-making. However, a comprehensive assessment of utilizing natural language processing (NLP) coupled with machine learning (ML) methods to analyze unstructured PROs and their clinical implementation for individuals affected by cancer remains lacking. AREAS COVERED This study aimed to systematically review published studies that used NLP techniques to extract and analyze PROs in clinical narratives from EHRs for cancer populations. We examined the types of NLP (with and without ML) techniques and platforms for data processing, analysis, and clinical applications. EXPERT OPINION Utilizing NLP methods offers a valuable approach for processing and analyzing unstructured PROs among cancer patients and survivors. These techniques encompass a broad range of applications, such as extracting or recognizing PROs, categorizing, characterizing, or grouping PROs, predicting or stratifying risk for unfavorable clinical results, and evaluating connections between PROs and adverse clinical outcomes. The employment of NLP techniques is advantageous in converting substantial volumes of unstructured PRO data within EHRs into practical clinical utilities for individuals with cancer.
Collapse
Affiliation(s)
- Jin-ah Sim
- Department of Epidemiology and Cancer Control, St. Jude Children’s Research Hospital, Memphis, TN, USA
- Department of AI Convergence, Hallym University, Chuncheon, Republic of Korea
| | - Xiaolei Huang
- Department of Computer Science, University of Memphis, Memphis, Tennessee, United States
| | - Madeline R. Horan
- Department of Epidemiology and Cancer Control, St. Jude Children’s Research Hospital, Memphis, TN, USA
| | - Justin N. Baker
- Department of Oncology, St. Jude Children’s Research Hospital, Memphis, TN, USA
| | - I-Chan Huang
- Department of Epidemiology and Cancer Control, St. Jude Children’s Research Hospital, Memphis, TN, USA
| |
Collapse
|
40
|
Putz F, Haderlein M, Lettmaier S, Semrau S, Fietkau R, Huang Y. Exploring the Capabilities and Limitations of Large Language Models for Radiation Oncology Decision Support. Int J Radiat Oncol Biol Phys 2024; 118:900-904. [PMID: 38401978 DOI: 10.1016/j.ijrobp.2023.11.062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Revised: 11/14/2023] [Accepted: 11/25/2023] [Indexed: 02/26/2024]
Affiliation(s)
- Florian Putz
- Department of Radiation Oncology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany; Comprehensive Cancer Center Erlangen-EMN (CCC ER-EMN), Erlangen, Germany
| | - Marlen Haderlein
- Department of Radiation Oncology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany; Comprehensive Cancer Center Erlangen-EMN (CCC ER-EMN), Erlangen, Germany
| | - Sebastian Lettmaier
- Department of Radiation Oncology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany; Comprehensive Cancer Center Erlangen-EMN (CCC ER-EMN), Erlangen, Germany
| | - Sabine Semrau
- Department of Radiation Oncology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany; Comprehensive Cancer Center Erlangen-EMN (CCC ER-EMN), Erlangen, Germany
| | - Rainer Fietkau
- Department of Radiation Oncology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany; Comprehensive Cancer Center Erlangen-EMN (CCC ER-EMN), Erlangen, Germany
| | - Yixing Huang
- Department of Radiation Oncology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany; Comprehensive Cancer Center Erlangen-EMN (CCC ER-EMN), Erlangen, Germany.
| |
Collapse
|
41
|
Floyd W, Kleber T, Carpenter DJ, Pasli M, Qazi J, Huang C, Leng J, Ackerson BG, Pierpoint M, Salama JK, Boyer MJ. Current Strengths and Weaknesses of ChatGPT as a Resource for Radiation Oncology Patients and Providers. Int J Radiat Oncol Biol Phys 2024; 118:905-915. [PMID: 39058798 DOI: 10.1016/j.ijrobp.2023.10.020] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Revised: 10/06/2023] [Accepted: 10/14/2023] [Indexed: 07/28/2024]
Abstract
PURPOSE Chat Generative Pre-Trained Transformer (ChatGPT), an artificial intelligence program that uses natural language processing to generate conversational-style responses to questions or inputs, is increasingly being used by both patients and health care professionals. This study aims to evaluate the accuracy and comprehensiveness of ChatGPT in radiation oncology-related domains, including answering common patient questions, summarizing landmark clinical research studies, and providing literature reviews with specific references supporting current standard-of-care clinical practice in radiation oncology. METHODS AND MATERIALS We assessed the performance of ChatGPT version 3.5 (ChatGPT3.5) in 3 areas. We evaluated ChatGPT3.5's ability to answer 28 templated patient-centered questions applied across 9 cancer types. We then tested ChatGPT3.5's ability to summarize specific portions of 10 landmark studies in radiation oncology. Next, we used ChatGPT3.5 to identify scientific studies supporting current standard-of-care practice in clinical radiation oncology for 5 different cancer types. Each response was graded independently by 2 reviewers, with discordant grades resolved by a third reviewer. RESULTS ChatGPT3.5 frequently generated inaccurate or incomplete responses. Only 39.7% of responses to patient-centered questions were considered correct and comprehensive. When summarizing landmark studies in radiation oncology, 35.0% of ChatGPT3.5's responses were accurate and comprehensive, improving to 43.3% when provided the full text of the study. ChatGPT3.5's ability to present a list of studies related to standard-of-care clinical practices was also unsatisfactory, with 50.6% of the provided studies fabricated. CONCLUSIONS ChatGPT should not be considered a reliable radiation oncology resource for patients or providers at this time, as it frequently generates inaccurate or incomplete responses. However, natural language programming-based artificial intelligence programs are rapidly evolving, and future versions of ChatGPT or similar programs may demonstrate improved performance in this domain.
Collapse
Affiliation(s)
- Warren Floyd
- Department of Radiation Oncology, University of Texas MD Anderson Cancer Center, Houston, Texas
| | - Troy Kleber
- Department of Radiation Oncology, University of Texas MD Anderson Cancer Center, Houston, Texas
| | - David J Carpenter
- Department of Radiation Oncology, Duke University School of Medicine, Durham, North Carolina; Radiation Oncology Clinical Service, Durham VA Health Care System, Durham, North Carolina
| | - Melisa Pasli
- Brody School of Medicine, East Carolina University, Greenville, North Carolina
| | - Jamiluddin Qazi
- Department of Radiation Oncology, Duke University School of Medicine, Durham, North Carolina; Radiation Oncology Clinical Service, Durham VA Health Care System, Durham, North Carolina
| | - Christina Huang
- Department of Radiation Oncology, Duke University School of Medicine, Durham, North Carolina; Radiation Oncology Clinical Service, Durham VA Health Care System, Durham, North Carolina
| | - Jim Leng
- Department of Radiation Oncology, Duke University School of Medicine, Durham, North Carolina; Radiation Oncology Clinical Service, Durham VA Health Care System, Durham, North Carolina
| | - Bradley G Ackerson
- Department of Radiation Oncology, Duke University School of Medicine, Durham, North Carolina; Radiation Oncology Clinical Service, Durham VA Health Care System, Durham, North Carolina
| | - Matthew Pierpoint
- Department of Radiation Oncology, Duke University School of Medicine, Durham, North Carolina
| | - Joseph K Salama
- Department of Radiation Oncology, Duke University School of Medicine, Durham, North Carolina; Radiation Oncology Clinical Service, Durham VA Health Care System, Durham, North Carolina
| | - Matthew J Boyer
- Department of Radiation Oncology, Duke University School of Medicine, Durham, North Carolina; Radiation Oncology Clinical Service, Durham VA Health Care System, Durham, North Carolina.
| |
Collapse
|
42
|
Chandra A, Chakraborty A. Exploring the role of large language models in radiation emergency response. JOURNAL OF RADIOLOGICAL PROTECTION : OFFICIAL JOURNAL OF THE SOCIETY FOR RADIOLOGICAL PROTECTION 2024; 44:011510. [PMID: 38324900 DOI: 10.1088/1361-6498/ad270c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Accepted: 02/07/2024] [Indexed: 02/09/2024]
Abstract
In recent times, the field of artificial intelligence (AI) has been transformed by the introduction of large language models (LLMs). These models, popularized by OpenAI's GPT-3, have demonstrated the emergent capabilities of AI in comprehending and producing text resembling human language, which has helped them transform several industries. But its role has yet to be explored in the nuclear industry, specifically in managing radiation emergencies. The present work explores LLMs' contextual awareness, natural language interaction, and their capacity to comprehend diverse queries in a radiation emergency response setting. In this study we identify different user types and their specific LLM use-cases in radiation emergencies. Their possible interactions with ChatGPT, a popular LLM, has also been simulated and preliminary results are presented. Drawing on the insights gained from this exercise and to address concerns of reliability and misinformation, this study advocates for expert guided and domain-specific LLMs trained on radiation safety protocols and historical data. This study aims to guide radiation emergency management practitioners and decision-makers in effectively incorporating LLMs into their decision support framework.
Collapse
Affiliation(s)
- Anirudh Chandra
- Radiation Safety Systems Division, Bhabha Atomic Research Centre, Mumbai 400085, India
| | - Abinash Chakraborty
- Health Physics Division, Bhabha Atomic Research Centre, Mumbai 400085, India
| |
Collapse
|
43
|
Kang J, Lafata K, Kim E, Yao C, Lin F, Rattay T, Nori H, Katsoulakis E, Lee CI. Artificial intelligence across oncology specialties: current applications and emerging tools. BMJ ONCOLOGY 2024; 3:e000134. [PMID: 39886165 PMCID: PMC11203066 DOI: 10.1136/bmjonc-2023-000134] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Accepted: 01/03/2024] [Indexed: 02/01/2025]
Abstract
Oncology is becoming increasingly personalised through advancements in precision in diagnostics and therapeutics, with more and more data available on both ends to create individualised plans. The depth and breadth of data are outpacing our natural ability to interpret it. Artificial intelligence (AI) provides a solution to ingest and digest this data deluge to improve detection, prediction and skill development. In this review, we provide multidisciplinary perspectives on oncology applications touched by AI-imaging, pathology, patient triage, radiotherapy, genomics-driven therapy and surgery-and integration with existing tools-natural language processing, digital twins and clinical informatics.
Collapse
Affiliation(s)
- John Kang
- Department of Radiation Oncology, University of Washington, Seattle, Washington, USA
| | - Kyle Lafata
- Department of Radiation Oncology, Duke University, Durham, North Carolina, USA
- Department of Radiology, Duke University, Durham, North Carolina, USA
- Department of Electrical and Computer Engineering, Duke University, Durham, North Carolina, USA
| | - Ellen Kim
- Department of Radiation Oncology, Brigham and Women's Hospital, Boston, Massachusetts, USA
- Department of Radiation Oncology, Dana-Farber Cancer Institute, Boston, Massachusetts, USA
| | - Christopher Yao
- Department of Otolaryngology-Head & Neck Surgery, University of Toronto, Toronto, Ontario, Canada
| | - Frank Lin
- Kinghorn Centre for Clinical Genomics, Garvan Institute of Medical Research, Darlinghurst, New South Wales, Australia
- NHMRC Clinical Trials Centre, Camperdown, New South Wales, Australia
- Faculty of Medicine, St Vincent's Clinical School, University of New South Wales, Sydney, New South Wales, Australia
| | - Tim Rattay
- Department of Genetics and Genome Biology, University of Leicester Cancer Research Centre, Leicester, UK
| | - Harsha Nori
- Microsoft Research, Redmond, Washington, USA
| | - Evangelia Katsoulakis
- Department of Radiation Oncology, University of South Florida, Tampa, Florida, USA
- Veterans Affairs Informatics and Computing Infrastructure, Salt Lake City, Utah, USA
| | | |
Collapse
|
44
|
Liao W, Liu Z, Dai H, Xu S, Wu Z, Zhang Y, Huang X, Zhu D, Cai H, Li Q, Liu T, Li X. Differentiating ChatGPT-Generated and Human-Written Medical Texts: Quantitative Study. JMIR MEDICAL EDUCATION 2023; 9:e48904. [PMID: 38153785 PMCID: PMC10784984 DOI: 10.2196/48904] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Revised: 08/03/2023] [Accepted: 09/10/2023] [Indexed: 12/29/2023]
Abstract
BACKGROUND Large language models, such as ChatGPT, are capable of generating grammatically perfect and human-like text content, and a large number of ChatGPT-generated texts have appeared on the internet. However, medical texts, such as clinical notes and diagnoses, require rigorous validation, and erroneous medical content generated by ChatGPT could potentially lead to disinformation that poses significant harm to health care and the general public. OBJECTIVE This study is among the first on responsible artificial intelligence-generated content in medicine. We focus on analyzing the differences between medical texts written by human experts and those generated by ChatGPT and designing machine learning workflows to effectively detect and differentiate medical texts generated by ChatGPT. METHODS We first constructed a suite of data sets containing medical texts written by human experts and generated by ChatGPT. We analyzed the linguistic features of these 2 types of content and uncovered differences in vocabulary, parts-of-speech, dependency, sentiment, perplexity, and other aspects. Finally, we designed and implemented machine learning methods to detect medical text generated by ChatGPT. The data and code used in this paper are published on GitHub. RESULTS Medical texts written by humans were more concrete, more diverse, and typically contained more useful information, while medical texts generated by ChatGPT paid more attention to fluency and logic and usually expressed general terminologies rather than effective information specific to the context of the problem. A bidirectional encoder representations from transformers-based model effectively detected medical texts generated by ChatGPT, and the F1 score exceeded 95%. CONCLUSIONS Although text generated by ChatGPT is grammatically perfect and human-like, the linguistic characteristics of generated medical texts were different from those written by human experts. Medical text generated by ChatGPT could be effectively detected by the proposed machine learning algorithms. This study provides a pathway toward trustworthy and accountable use of large language models in medicine.
Collapse
Affiliation(s)
- Wenxiong Liao
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
| | - Zhengliang Liu
- School of Computing, University of Georgia, Athens, GA, United States
| | - Haixing Dai
- School of Computing, University of Georgia, Athens, GA, United States
| | - Shaochen Xu
- School of Computing, University of Georgia, Athens, GA, United States
| | - Zihao Wu
- School of Computing, University of Georgia, Athens, GA, United States
| | - Yiyang Zhang
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
| | - Xiaoke Huang
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
| | - Dajiang Zhu
- Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, TX, United States
| | - Hongmin Cai
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
| | - Quanzheng Li
- Department of Radiology, Massachusetts General Hospital, Boston, MA, United States
| | - Tianming Liu
- School of Computing, University of Georgia, Athens, GA, United States
| | - Xiang Li
- Department of Radiology, Massachusetts General Hospital, Boston, MA, United States
| |
Collapse
|
45
|
Sallam M, Al-Salahat K. Below average ChatGPT performance in medical microbiology exam compared to university students. FRONTIERS IN EDUCATION 2023; 8. [DOI: 10.3389/feduc.2023.1333415] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/01/2024]
Abstract
BackgroundThe transformative potential of artificial intelligence (AI) in higher education is evident, with conversational models like ChatGPT poised to reshape teaching and assessment methods. The rapid evolution of AI models requires a continuous evaluation. AI-based models can offer personalized learning experiences but raises accuracy concerns. MCQs are widely used for competency assessment. The aim of this study was to evaluate ChatGPT performance in medical microbiology MCQs compared to the students’ performance.MethodsThe study employed an 80-MCQ dataset from a 2021 medical microbiology exam at the University of Jordan Doctor of Dental Surgery (DDS) Medical Microbiology 2 course. The exam contained 40 midterm and 40 final MCQs, authored by a single instructor without copyright issues. The MCQs were categorized based on the revised Bloom’s Taxonomy into four categories: Remember, Understand, Analyze, or Evaluate. Metrics, including facility index and discriminative efficiency, were derived from 153 midterm and 154 final exam DDS student performances. ChatGPT 3.5 was used to answer questions, and responses were assessed for correctness and clarity by two independent raters.ResultsChatGPT 3.5 correctly answered 64 out of 80 medical microbiology MCQs (80%) but scored below the student average (80.5/100 vs. 86.21/100). Incorrect ChatGPT responses were more common in MCQs with longer choices (p = 0.025). ChatGPT 3.5 performance varied across cognitive domains: Remember (88.5% correct), Understand (82.4% correct), Analyze (75% correct), Evaluate (72% correct), with no statistically significant differences (p = 0.492). Correct ChatGPT responses received statistically significant higher average clarity and correctness scores compared to incorrect responses.ConclusionThe study findings emphasized the need for ongoing refinement and evaluation of ChatGPT performance. ChatGPT 3.5 showed the potential to correctly and clearly answer medical microbiology MCQs; nevertheless, its performance was below-bar compared to the students. Variability in ChatGPT performance in different cognitive domains should be considered in future studies. The study insights could contribute to the ongoing evaluation of the AI-based models’ role in educational assessment and to augment the traditional methods in higher education.
Collapse
|
46
|
Toma A, Senkaiahliyan S, Lawler PR, Rubin B, Wang B. Generative AI could revolutionize health care - but not if control is ceded to big tech. Nature 2023; 624:36-38. [PMID: 38036861 DOI: 10.1038/d41586-023-03803-y] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2023]
|
47
|
Yu QX, Wu RC, Feng DC, Li DX. Re: ChatGPT encounters multiple opportunities and challenges in neurosurgery. Int J Surg 2023; 109:4393-4394. [PMID: 37720947 PMCID: PMC10720816 DOI: 10.1097/js9.0000000000000749] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Accepted: 08/25/2023] [Indexed: 09/19/2023]
Affiliation(s)
- Qing-xin Yu
- Department of pathology,Ningbo Clinical Pathology Diagnosis Center, Ningbo City, Zhejiang Province
| | - Rui-cheng Wu
- Department of Urology, Institute of Urology, West China Hospital, Sichuan University, Chengdu, Sichuan Province, People’s Republic of China
| | - De-chao Feng
- Department of Urology, Institute of Urology, West China Hospital, Sichuan University, Chengdu, Sichuan Province, People’s Republic of China
| | - Deng-xiong Li
- Department of Urology, Institute of Urology, West China Hospital, Sichuan University, Chengdu, Sichuan Province, People’s Republic of China
| |
Collapse
|
48
|
Shi Y, Ren P, Wang J, Han B, ValizadehAslani T, Agbavor F, Zhang Y, Hu M, Zhao L, Liang H. Leveraging GPT-4 for food effect summarization to enhance product-specific guidance development via iterative prompting. J Biomed Inform 2023; 148:104533. [PMID: 37918623 DOI: 10.1016/j.jbi.2023.104533] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Revised: 10/12/2023] [Accepted: 10/30/2023] [Indexed: 11/04/2023]
Abstract
Food effect summarization from New Drug Application (NDA) is an essential component of product-specific guidance (PSG) development and assessment, which provides the basis of recommendations for fasting and fed bioequivalence studies to guide the pharmaceutical industry for developing generic drug products. However, manual summarization of food effect from extensive drug application review documents is time-consuming. Therefore, there is a need to develop automated methods to generate food effect summary. Recent advances in natural language processing (NLP), particularly large language models (LLMs) such as ChatGPT and GPT-4, have demonstrated great potential in improving the effectiveness of automated text summarization, but its ability with regard to the accuracy in summarizing food effect for PSG assessment remains unclear. In this study, we introduce a simple yet effective approach,iterative prompting, which allows one to interact with ChatGPT or GPT-4 more effectively and efficiently through multi-turn interaction. Specifically, we propose a three-turn iterative prompting approach to food effect summarization in which the keyword-focused and length-controlled prompts are respectively provided in consecutive turns to refine the quality of the generated summary. We conduct a series of extensive evaluations, ranging from automated metrics to FDA professionals and even evaluation by GPT-4, on 100 NDA review documents selected over the past five years. We observe that the summary quality is progressively improved throughout the iterative prompting process. Moreover, we find that GPT-4 performs better than ChatGPT, as evaluated by FDA professionals (43% vs. 12%) and GPT-4 (64% vs. 35%). Importantly, all the FDA professionals unanimously rated that 85% of the summaries generated by GPT-4 are factually consistent with the golden reference summary, a finding further supported by GPT-4 rating of 72% consistency. Taken together, these results strongly suggest a great potential for GPT-4 to draft food effect summaries that could be reviewed by FDA professionals, thereby improving the efficiency of the PSG assessment cycle and promoting generic drug product development.
Collapse
Affiliation(s)
- Yiwen Shi
- College of Computing and Informatics, Drexel University, Philadelphia, PA, United States
| | - Ping Ren
- Office of Research and Standards, Office of Generic Drugs, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, MD, United States
| | - Jing Wang
- Office of Research and Standards, Office of Generic Drugs, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, MD, United States
| | - Biao Han
- Office of Research and Standards, Office of Generic Drugs, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, MD, United States
| | - Taha ValizadehAslani
- Department of Electrical and Computer Engineering, College of Engineering, Drexel University, Philadelphia, PA, United States
| | - Felix Agbavor
- School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, PA, United States
| | - Yi Zhang
- Office of Research and Standards, Office of Generic Drugs, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, MD, United States
| | - Meng Hu
- Office of Research and Standards, Office of Generic Drugs, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, MD, United States
| | - Liang Zhao
- Office of Research and Standards, Office of Generic Drugs, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, MD, United States
| | - Hualou Liang
- School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, PA, United States.
| |
Collapse
|
49
|
Huang Y, Gomaa A, Semrau S, Haderlein M, Lettmaier S, Weissmann T, Grigo J, Tkhayat HB, Frey B, Gaipl U, Distel L, Maier A, Fietkau R, Bert C, Putz F. Benchmarking ChatGPT-4 on a radiation oncology in-training exam and Red Journal Gray Zone cases: potentials and challenges for ai-assisted medical education and decision making in radiation oncology. Front Oncol 2023; 13:1265024. [PMID: 37790756 PMCID: PMC10543650 DOI: 10.3389/fonc.2023.1265024] [Citation(s) in RCA: 48] [Impact Index Per Article: 24.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Accepted: 08/23/2023] [Indexed: 10/05/2023] Open
Abstract
Purpose The potential of large language models in medicine for education and decision-making purposes has been demonstrated as they have achieved decent scores on medical exams such as the United States Medical Licensing Exam (USMLE) and the MedQA exam. This work aims to evaluate the performance of ChatGPT-4 in the specialized field of radiation oncology. Methods The 38th American College of Radiology (ACR) radiation oncology in-training (TXIT) exam and the 2022 Red Journal Gray Zone cases are used to benchmark the performance of ChatGPT-4. The TXIT exam contains 300 questions covering various topics of radiation oncology. The 2022 Gray Zone collection contains 15 complex clinical cases. Results For the TXIT exam, ChatGPT-3.5 and ChatGPT-4 have achieved the scores of 62.05% and 78.77%, respectively, highlighting the advantage of the latest ChatGPT-4 model. Based on the TXIT exam, ChatGPT-4's strong and weak areas in radiation oncology are identified to some extent. Specifically, ChatGPT-4 demonstrates better knowledge of statistics, CNS & eye, pediatrics, biology, and physics than knowledge of bone & soft tissue and gynecology, as per the ACR knowledge domain. Regarding clinical care paths, ChatGPT-4 performs better in diagnosis, prognosis, and toxicity than brachytherapy and dosimetry. It lacks proficiency in in-depth details of clinical trials. For the Gray Zone cases, ChatGPT-4 is able to suggest a personalized treatment approach to each case with high correctness and comprehensiveness. Importantly, it provides novel treatment aspects for many cases, which are not suggested by any human experts. Conclusion Both evaluations demonstrate the potential of ChatGPT-4 in medical education for the general public and cancer patients, as well as the potential to aid clinical decision-making, while acknowledging its limitations in certain domains. Owing to the risk of hallucinations, it is essential to verify the content generated by models such as ChatGPT for accuracy.
Collapse
Affiliation(s)
- Yixing Huang
- Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN (CCC ER-EMN), Erlangen, Germany
| | - Ahmed Gomaa
- Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN (CCC ER-EMN), Erlangen, Germany
| | - Sabine Semrau
- Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN (CCC ER-EMN), Erlangen, Germany
| | - Marlen Haderlein
- Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN (CCC ER-EMN), Erlangen, Germany
| | - Sebastian Lettmaier
- Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN (CCC ER-EMN), Erlangen, Germany
| | - Thomas Weissmann
- Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN (CCC ER-EMN), Erlangen, Germany
| | - Johanna Grigo
- Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN (CCC ER-EMN), Erlangen, Germany
| | - Hassen Ben Tkhayat
- Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - Benjamin Frey
- Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN (CCC ER-EMN), Erlangen, Germany
| | - Udo Gaipl
- Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN (CCC ER-EMN), Erlangen, Germany
| | - Luitpold Distel
- Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN (CCC ER-EMN), Erlangen, Germany
| | - Andreas Maier
- Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - Rainer Fietkau
- Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN (CCC ER-EMN), Erlangen, Germany
| | - Christoph Bert
- Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN (CCC ER-EMN), Erlangen, Germany
| | - Florian Putz
- Department of Radiation Oncology, University Hospital Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
- Comprehensive Cancer Center Erlangen-EMN (CCC ER-EMN), Erlangen, Germany
| |
Collapse
|
50
|
Beriwal S, Corrigan KL, McDermott PN, Ryckman J, Tsao MN, Zheng D, Joiner MC, Dominello MM, Burmeister J. Three Discipline Collaborative Radiation Therapy (3DCRT) special debate: Radiation oncology has become so technologically complex that basic fundamental physics should no longer be included in the modern curriculum for radiation oncology residents. J Appl Clin Med Phys 2023; 24:e14128. [PMID: 37606366 PMCID: PMC10476972 DOI: 10.1002/acm2.14128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Accepted: 08/02/2023] [Indexed: 08/23/2023] Open
Affiliation(s)
- Sushil Beriwal
- Department of Radiation OncologyAllegheny Health NetworkWexfordPennsylvaniaUSA
| | - Kelsey L. Corrigan
- Department of Radiation OncologyMD Anderson Cancer CenterHoustonTexasUSA
| | | | - Jeffrey Ryckman
- Camden Clark Comprehensive Regional Cancer CenterWest Virginia Cancer InstituteParkersburgWest VirginiaUSA
| | - May N. Tsao
- Department of Radiation OncologyUniversity of Toronto, Odette Cancer CentreToronto, ONCanada
| | - Dandan Zheng
- Department of Radiation OncologyUniversity of RochesterRochesterNew YorkUSA
| | - Michael C. Joiner
- Department of OncologyWayne State University School of MedicineDetroitMichiganUSA
| | - Michael M. Dominello
- Department of OncologyWayne State University School of MedicineDetroitMichiganUSA
| | - Jay Burmeister
- Department of OncologyWayne State University School of MedicineDetroitMichiganUSA
- Gershenson Radiation Oncology CenterBarbara Ann Karmanos Cancer InstituteDetroitMichiganUSA
| |
Collapse
|