1
|
Pamuk E, Bilen YE, Külekçi Ç, Kuşcu O. ChatGPT-4 vs. multi-disciplinary tumor board decisions for the therapeutic management of primary laryngeal cancer. Acta Otolaryngol 2025:1-6. [PMID: 40358250 DOI: 10.1080/00016489.2025.2502563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2025] [Accepted: 05/01/2025] [Indexed: 05/15/2025]
Abstract
BACKGROUND Artificial intelligence-based clinical decision support systems are promising tools for addressing the increasing complexity of oncological data and treatment. However, the integration and validation of models such as ChatGPT within multidisciplinary decision-making processes for head and neck cancers remain limited. OBJECTIVE To evaluate the performance of ChatGPT-4 in the management of primary laryngeal cancer and compare it with multidisciplinary tumor board (MDT) decisions. METHODS The medical records of 25 patients with untreated laryngeal cancer were evaluated using ChatGPT-4 for therapeutic recommendations. The coherence of responses was graded from Grade 1 (totally coherent) to Grade 4 (totally incoherent) and compared with actual MDT decisions. The association between patient features and response grades was also assessed. RESULTS ChatGPT-4 provided totally coherent (Grade 1) responses consistent with MDT decisions in 72% of the patients. The rates of Grade 2 and Grade 3 coherent responses were 20% and 8%, respectively. There were no totally incoherent responses. There was no significant association between the grade of coherence and T stage, N stage, tumor localization, differentiation, or age (p = 0.106, p = 0.588, p = 0.271, p = 0.677, p = 0.506, respectively). CONCLUSION With further improvements, ChatGPT-4 can be a promising adjunct tool for clinicians in decision-making for primary laryngeal cancer.
Collapse
Affiliation(s)
- Erim Pamuk
- Department of Otorhinolaryngology, Hacettepe University
| | | | - Çağrı Külekçi
- Department of Otorhinolaryngology, Hacettepe University
| | - Oğuz Kuşcu
- Department of Otorhinolaryngology, Hacettepe University
| |
Collapse
|
2
|
Banyi N, Ma B, Amanian A, Bur A, Abdalkhani A. Applications of Natural Language Processing in Otolaryngology: A Scoping Review. Laryngoscope 2025. [PMID: 40309961 DOI: 10.1002/lary.32198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2024] [Revised: 02/17/2025] [Accepted: 03/14/2025] [Indexed: 05/02/2025]
Abstract
OBJECTIVE To review the current literature on the applications of natural language processing (NLP) within the field of otolaryngology. DATA SOURCES MEDLINE, EMBASE, SCOPUS, Cochrane Library, Web of Science, and CINAHL. METHODS The preferred reporting Items for systematic reviews and meta-analyzes extension for scoping reviews checklist was followed. Databases were searched from the date of inception up to Dec 26, 2023. Original articles on the application of language-based models to otolaryngology patient care and research, regardless of publication date, were included. The studies were classified under the 2011 Oxford CEBM levels of evidence. RESULTS One-hundred sixty-six papers with a median publication year of 2024 (range 1982, 2024) were included. Sixty-one percent (102/166) of studies used ChatGPT and were published in 2023 or 2024. Sixty studies used NLP for clinical education and decision support, 42 for patient education, 14 for electronic medical record improvement, 5 for triaging, 4 for trainee education, 4 for patient monitoring, 3 for telemedicine, and 1 for medical translation. For research, 37 studies used NLP for extraction, classification, or analysis of data, 17 for thematic analysis, 5 for evaluating scientific reporting, and 4 for manuscript preparation. CONCLUSION The role of NLP in otolaryngology is evolving, with ChatGPT passing OHNS board simulations, though its clinical application requires improvement. NLP shows potential in patient education and post-treatment monitoring. NLP is effective at extracting data from unstructured or large data sets. There is limited research on NLP in trainee education and administrative tasks. Guidelines for NLP use in research are critical.
Collapse
Affiliation(s)
- Norbert Banyi
- The University of British Columbia, Faculty of Medicine, Vancouver, Canada
| | - Brian Ma
- Department of Cellular & Physiological Sciences, University of British Columbia, Vancouver, Canada
| | - Ameen Amanian
- Division of Otolaryngology-Head and Neck Surgery, Department of Surgery, University of British Columbia, Vancouver, Canada
| | - Andrés Bur
- Department of Otolaryngology-Head and Neck Surgery, University of Kansas Medical Centre, Kansas City, Kansas, USA
| | - Arman Abdalkhani
- Division of Otolaryngology-Head and Neck Surgery, Department of Surgery, University of British Columbia, Vancouver, Canada
| |
Collapse
|
3
|
Becerik Ç, Yıldız S, Tepe Karaca Ç, Toros SZ. Evaluation of the Usability of ChatGPT-4 and Google Gemini in Patient Education About Rhinosinusitis. Clin Otolaryngol 2025; 50:456-461. [PMID: 39776223 DOI: 10.1111/coa.14273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2024] [Revised: 11/08/2024] [Accepted: 12/17/2024] [Indexed: 01/11/2025]
Abstract
INTRODUCTION Artificial intelligence (AI) based chat robots are increasingly used by users for patient education about common diseases in the health field, as in every field. This study aims to evaluate and compare patient education materials on rhinosinusitis created by two frequently used chat robots, ChatGPT-4 and Google Gemini. METHOD One hundred nine questions taken from patient information websites were divided into 4 different categories: general knowledge, diagnosis, treatment, surgery and complications, then asked to chat robots. The answers given were evaluated by two different expert otolaryngologists, and on questions where the scores were different, a third, more experienced otolaryngologist finalised the evaluation. Questions were scored from 1 to 4: (1) comprehensive/correct, (2) incomplete/partially correct, (3) accurate and inaccurate data, potentially misleading and (4) completely inaccurate/irrelevant. RESULTS In evaluating the answers given by ChatGPT-4, all answers in the Diagnosis category were evaluated as comprehensive/correct. In the evaluation of the answers given by Google Gemini, the answers evaluated as completely inaccurate/irrelevant in the treatment category were found to be statistically significantly higher, and the answers evaluated as incomplete/partially correct in the surgery and complications category were found to be statistically significantly higher. In the comparison between the two chat robots, in the treatment category, ChatGPT-4 had a higher correct evaluation rate than Google Gemini and was found to be statistically significant. CONCLUSION The answers given by ChatGPT-4 and Google Gemini chat robots regarding rhinosinusitis were evaluated as sufficient and informative.
Collapse
Affiliation(s)
- Çağrı Becerik
- Department of Otorhinolaryngology, Kemalpaşa State Hospital, İzmir, Turkey
| | - Selçuk Yıldız
- University of Health Sciences, Haydarpaşa Numune Research and Training Hospital, Department of Otorhinolaryngology, İstanbul, Turkey
| | - Çiğdem Tepe Karaca
- University of Health Sciences, Haydarpaşa Numune Research and Training Hospital, Department of Otorhinolaryngology, İstanbul, Turkey
| | - Sema Zer Toros
- University of Health Sciences, Haydarpaşa Numune Research and Training Hospital, Department of Otorhinolaryngology, İstanbul, Turkey
| |
Collapse
|
4
|
Boztas AE, Ensari E. Comperative analysis of three chatbot responses on pediatric primary nocturnal enuresis. J Pediatr Urol 2025:S1477-5131(25)00258-X. [PMID: 40355311 DOI: 10.1016/j.jpurol.2025.04.031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/29/2024] [Revised: 04/22/2025] [Accepted: 04/26/2025] [Indexed: 05/14/2025]
Abstract
BACKGROUND The purpose of the study was to evaluate both the accuracy and reproducibility of the answers given by ChatGPT-4o®, Gemini® and Copilot® to frequently asked questions about pediatric primary enuresis nocturna. METHODS Forty frequently asked questions about primary nocturnal enuresis were asked 2 times, one week apart, on ChatGPT-4o, Gemini and Copilot. One of each pediatric surgeon and nephrologist independently scored the answers into 4 groups: comprehensive/correct (1), incomplete/partially correct (2), a mix of accurate and inaccurate/misleading (3), and completely inaccurate/irrelevant (4). The accuracy and reproducibility of each chatbots answers were evaluated. RESULTS In comparison of these most common used chatbots, the order of completely correct response rates from highest to lowest was Chat GPT-4o and followed by Copilot and Gemini. With an accuracy percentage of 92.5 %, ChatGPT-4o gave the most accurate responses of any AI chatbot. Gemini answered 50 % of questions correctly. Copilot was the weakest successful chatbot in answering questions about enuresis nocturna with 45 % of completely accurate answer ratio. Besides Copilot has a ratio of 2.5 % for completely inaccurate/irrelevant response. Reproducibility of ChatGPT-4o, Gemini and Copilots were 85 %, 77.5 %, 70 % respectively. CONCLUSION ChatGPT-4o is more successful in providing a high percentage of accurate responses regarding nocturnal enuresis. Both patients and their parents can use it, especially for simple, low-complexity medical questions. However, it should be used alongside expert healthcare proffesional.
Collapse
Affiliation(s)
- Asya Eylem Boztas
- Health and Science University Dr. Behcet Uz Pediatric Diseases and Surgery Training and Research Hospital, Department of Pediatric Surgery, Kultur mh. Dr.Mustafa Enver Bey cd. No:32 D:10 Konak, Izmir, Turkey.
| | - Esra Ensari
- Antalya City Hospital, Department of Paediatric Nephrology, 07080, Antalya, Turkey.
| |
Collapse
|
5
|
Patel TA, Michaelson G, Morton Z, Harris A, Smith B, Bourguillon R, Wu E, Eguia A, Maxwell JH. Use of ChatGPT for patient education involving HPV-associated oropharyngeal cancer. Am J Otolaryngol 2025; 46:104642. [PMID: 40279734 DOI: 10.1016/j.amjoto.2025.104642] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2025] [Accepted: 04/20/2025] [Indexed: 04/29/2025]
Abstract
OBJECTIVE This study aims to investigate the ability of ChatGPT to generate reliably accurate responses to patient-based queries specifically regarding oropharyngeal squamous cell carcinoma (OPSCC) of the head and neck. STUDY DESIGN Retrospective review of published abstracts. SETTING Publicly available generative artificial intelligence. METHODS ChatGPT 3.5 (May 2024) was queried with a set of 30 questions pertaining to HPV-associated oropharyngeal cancer that the average patient may ask. This set of questions was queried a total of four times preceded by a different prompt. The answer prompts for each question set were reviewed and graded on a four-part Likert scale. A Flesch-Kincaid reading level was also calculated for each prompt. RESULTS For all answer prompts (n = 120), 6.6 % were graded as mostly inaccurate, 7.5 % were graded as minorly inaccurate, 41.7 % were graded as accurate, and 44.2 % were graded as accurate and helpful. The average Flesch-Kincaid reading grade level was lowest for the responses without any prompt (11.77). Understandably, the highest grade levels were found in the physician-friend prompt (12.97). Of the 30 references, 25 (83.3 %) were found to be authentic published studies. Of the 25 authentic references, the answers accurately cited information found within the original source for 14 of the references (56 %). CONCLUSION ChatGPT was able to produce relatively accurate responses to example patient questions, but there was a high rate of false references. In addition, the reading level of the answer prompts was well above the Centers for Disease Control and Prevention (CDC) recommendations for the average patient.
Collapse
Affiliation(s)
- Terral A Patel
- Department of Otolaryngology - Head and Neck Surgery, University of Pittsburgh Medical Centre, Pittsburgh, PA, USA.
| | | | - Zoey Morton
- Department of Otolaryngology - Head and Neck Surgery, University of Pittsburgh Medical Centre, Pittsburgh, PA, USA
| | - Alexandria Harris
- Department of Otolaryngology - Head and Neck Surgery, University of Pittsburgh Medical Centre, Pittsburgh, PA, USA
| | - Brandon Smith
- Department of Otolaryngology - Head and Neck Surgery, University of Pittsburgh Medical Centre, Pittsburgh, PA, USA
| | - Richard Bourguillon
- Department of Otolaryngology - Head and Neck Surgery, University of Pittsburgh Medical Centre, Pittsburgh, PA, USA
| | - Eric Wu
- Department of Otolaryngology - Head and Neck Surgery, University of Pittsburgh Medical Centre, Pittsburgh, PA, USA
| | - Arturo Eguia
- Department of Otolaryngology - Head and Neck Surgery, University of Pittsburgh Medical Centre, Pittsburgh, PA, USA
| | - Jessica H Maxwell
- Department of Otolaryngology - Head and Neck Surgery, University of Pittsburgh Medical Centre, Pittsburgh, PA, USA
| |
Collapse
|
6
|
Grilo A, Marques C, Corte-Real M, Carolino E, Caetano M. Assessing the Quality and Reliability of ChatGPT's Responses to Radiotherapy-Related Patient Queries: Comparative Study With GPT-3.5 and GPT-4. JMIR Cancer 2025; 11:e63677. [PMID: 40239208 PMCID: PMC12017613 DOI: 10.2196/63677] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2024] [Revised: 01/30/2025] [Accepted: 02/27/2025] [Indexed: 04/18/2025] Open
Abstract
Background Patients frequently resort to the internet to access information about cancer. However, these websites often lack content accuracy and readability. Recently, ChatGPT, an artificial intelligence-powered chatbot, has signified a potential paradigm shift in how patients with cancer can access vast amounts of medical information, including insights into radiotherapy. However, the quality of the information provided by ChatGPT remains unclear. This is particularly significant given the general public's limited knowledge of this treatment and concerns about its possible side effects. Furthermore, evaluating the quality of responses is crucial, as misinformation can foster a false sense of knowledge and security, lead to noncompliance, and result in delays in receiving appropriate treatment. Objective This study aims to evaluate the quality and reliability of ChatGPT's responses to common patient queries about radiotherapy, comparing the performance of ChatGPT's two versions: GPT-3.5 and GPT-4. Methods We selected 40 commonly asked radiotherapy questions and entered the queries in both versions of ChatGPT. Response quality and reliability were evaluated by 16 radiotherapy experts using the General Quality Score (GQS), a 5-point Likert scale, with the median GQS determined based on the experts' ratings. Consistency and similarity of responses were assessed using the cosine similarity score, which ranges from 0 (complete dissimilarity) to 1 (complete similarity). Readability was analyzed using the Flesch Reading Ease Score, ranging from 0 to 100, and the Flesch-Kincaid Grade Level, reflecting the average number of years of education required for comprehension. Statistical analyses were performed using the Mann-Whitney test and effect size, with results deemed significant at a 5% level (P=.05). To assess agreement between experts, Krippendorff α and Fleiss κ were used. Results GPT-4 demonstrated superior performance, with a higher GQS and a lower number of scores of 1 and 2, compared to GPT-3.5. The Mann-Whitney test revealed statistically significant differences in some questions, with GPT-4 generally receiving higher ratings. The median (IQR) cosine similarity score indicated substantial similarity (0.81, IQR 0.05) and consistency in the responses of both versions (GPT-3.5: 0.85, IQR 0.04; GPT-4: 0.83, IQR 0.04). Readability scores for both versions were considered college level, with GPT-4 scoring slightly better in the Flesch Reading Ease Score (34.61) and Flesch-Kincaid Grade Level (12.32) compared to GPT-3.5 (32.98 and 13.32, respectively). Responses by both versions were deemed challenging for the general public. Conclusions Both GPT-3.5 and GPT-4 demonstrated having the capability to address radiotherapy concepts, with GPT-4 showing superior performance. However, both models present readability challenges for the general population. Although ChatGPT demonstrates potential as a valuable resource for addressing common patient queries related to radiotherapy, it is imperative to acknowledge its limitations, including the risks of misinformation and readability issues. In addition, its implementation should be supported by strategies to enhance accessibility and readability.
Collapse
Affiliation(s)
- Ana Grilo
- Research Center for Psychological Science of the Faculty of Psychology, University of Lisbon to CICPSI, Faculdade de Psicologia, Universidade de Lisboa, Av. D. João II, Lote 4.69.01, Parque das Nações, Lisboa, 1990-096, Portugal, 351 964371101
| | - Catarina Marques
- Escola Superior de Tecnologia da Saúde, Instituto Politécnico de Lisboa, Lisboa, Portugal
| | - Maria Corte-Real
- Escola Superior de Tecnologia da Saúde, Instituto Politécnico de Lisboa, Lisboa, Portugal
| | - Elisabete Carolino
- Research Center for Psychological Science of the Faculty of Psychology, University of Lisbon to CICPSI, Faculdade de Psicologia, Universidade de Lisboa, Av. D. João II, Lote 4.69.01, Parque das Nações, Lisboa, 1990-096, Portugal, 351 964371101
| | - Marco Caetano
- Escola Superior de Tecnologia da Saúde, Instituto Politécnico de Lisboa, Lisboa, Portugal
| |
Collapse
|
7
|
Al Barajraji M, Barrit S, Ben-Hamouda N, Harel E, Torcida N, Pizzarotti B, Massager N, Lechien JR. AI-Driven Information for Relatives of Patients with Malignant Middle Cerebral Artery Infarction: A Preliminary Validation Study Using GPT-4o. Brain Sci 2025; 15:391. [PMID: 40309831 PMCID: PMC12026103 DOI: 10.3390/brainsci15040391] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2025] [Revised: 04/04/2025] [Accepted: 04/10/2025] [Indexed: 05/02/2025] Open
Abstract
Purpose: This study examines GPT-4o's ability to communicate effectively with relatives of patients undergoing decompressive hemicraniectomy (DHC) after malignant middle cerebral artery infarction (MMCAI). Methods: GPT-4o was asked 25 common questions from patients' relatives about DHC for MMCAI, twice over a 7-day interval. Responses were rated for accuracy, clarity, relevance, completeness, sourcing, and usefulness by board-certified intensivist* (one), neurologists, and neurosurgeons using the Quality Analysis of Medical AI (QAMAI) tool. Interrater reliability and stability were measured using ICC and Pearson's correlation. Results: The total QAMAI scores were 22.32 ± 3.08 for the intensivist, 24.68 ± 2.8 for the neurologist, 23.36 ± 2.86 and 26.32 ± 2.91 for the neurosurgeons, representing moderate-to-high accuracy. The evaluators reported moderate ICC (0.631, 95% CI: 0.321-0.821). The highest subscores were for the categories of accuracy, clarity, and relevance while the poorest were associated with completeness, usefulness, and sourcing. GPT-4o did not systematically provide references for their responses. The stability analysis reported moderate-to-high stability. The readability assessment revealed an FRE score of 7.23, an FKG score of 15.87 and a GF index of 18.15. Conclusions: GPT-4o provides moderate-to-high quality information related to DHC for MMCAI, with strengths in accuracy, clarity, and relevance. However, limitations in completeness, sourcing, and readability may impact its effectiveness in patient or their relatives' education.
Collapse
Affiliation(s)
- Mejdeddine Al Barajraji
- Department of Neurosurgery, University Hospital of Lausanne and University of Lausanne, 1005 Lausanne, Switzerland;
| | - Sami Barrit
- Department of Neurosurgery, CHU Tivoli, 7110 La Louvière, Belgium; (S.B.); (N.M.)
| | - Nawfel Ben-Hamouda
- Department of Adult Intensive Care, University Hospital of Lausanne (CHUV), University of Lausanne, 1005 Lausanne, Switzerland;
| | - Ethan Harel
- Department of Neurosurgery, University Hospital of Lausanne and University of Lausanne, 1005 Lausanne, Switzerland;
| | - Nathan Torcida
- Department of Neurology, Hôpital Universitaire de Bruxelles (HUB), 1070 Brussels, Belgium;
| | - Beatrice Pizzarotti
- Department of Neurology, University Hospital of Lausanne (CHUV), University of Lausanne, 1011 Lausanne, Switzerland;
| | - Nicolas Massager
- Department of Neurosurgery, CHU Tivoli, 7110 La Louvière, Belgium; (S.B.); (N.M.)
| | - Jerome R. Lechien
- Department of Surgery, UMONS Research Institute for Health Sciences and Technology, University of Mons (UMons), 7000 Mons, Belgium;
- Department of Otolaryngology, Elsan Polyclinic of Poitiers, 86000 Poitiers, France
- Department of Otolaryngology-Head Neck Surgery, Foch Hospital, School of Medicine, UFR Simone Veil, Université Versailles Saint-Quentin-en-Yvelines (Paris Saclay University), 78035 Paris, France
| |
Collapse
|
8
|
Chen D, Avison K, Alnassar S, Huang RS, Raman S. Medical accuracy of artificial intelligence chatbots in oncology: a scoping review. Oncologist 2025; 30:oyaf038. [PMID: 40285677 PMCID: PMC12032582 DOI: 10.1093/oncolo/oyaf038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2024] [Accepted: 01/03/2025] [Indexed: 04/29/2025] Open
Abstract
BACKGROUND Recent advances in large language models (LLM) have enabled human-like qualities of natural language competency. Applied to oncology, LLMs have been proposed to serve as an information resource and interpret vast amounts of data as a clinical decision-support tool to improve clinical outcomes. OBJECTIVE This review aims to describe the current status of medical accuracy of oncology-related LLM applications and research trends for further areas of investigation. METHODS A scoping literature search was conducted on Ovid Medline for peer-reviewed studies published since 2000. We included primary research studies that evaluated the medical accuracy of a large language model applied in oncology settings. Study characteristics and primary outcomes of included studies were extracted to describe the landscape of oncology-related LLMs. RESULTS Sixty studies were included based on the inclusion and exclusion criteria. The majority of studies evaluated LLMs in oncology as a health information resource in question-answer style examinations (48%), followed by diagnosis (20%) and management (17%). The number of studies that evaluated the utility of fine-tuning and prompt-engineering LLMs increased over time from 2022 to 2024. Studies reported the advantages of LLMs as an accurate information resource, reduction of clinician workload, and improved accessibility and readability of clinical information, while noting disadvantages such as poor reliability, hallucinations, and need for clinician oversight. DISCUSSION There exists significant interest in the application of LLMs in clinical oncology, with a particular focus as a medical information resource and clinical decision support tool. However, further research is needed to validate these tools in external hold-out datasets for generalizability and to improve medical accuracy across diverse clinical scenarios, underscoring the need for clinician supervision of these tools.
Collapse
Affiliation(s)
- David Chen
- Princess Margaret Hospital Cancer Centre, Radiation Medicine Program, Toronto, ON M5G 2C4, Canada
- Temerty Faculty of Medicine, University of Toronto, Toronto, ON M5S 3K3, Canada
| | - Kate Avison
- Princess Margaret Hospital Cancer Centre, Radiation Medicine Program, Toronto, ON M5G 2C4, Canada
- Department of Systems Design Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada
| | - Saif Alnassar
- Princess Margaret Hospital Cancer Centre, Radiation Medicine Program, Toronto, ON M5G 2C4, Canada
- Department of Systems Design Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada
| | - Ryan S Huang
- Princess Margaret Hospital Cancer Centre, Radiation Medicine Program, Toronto, ON M5G 2C4, Canada
- Temerty Faculty of Medicine, University of Toronto, Toronto, ON M5S 3K3, Canada
| | - Srinivas Raman
- Princess Margaret Hospital Cancer Centre, Radiation Medicine Program, Toronto, ON M5G 2C4, Canada
- Temerty Faculty of Medicine, University of Toronto, Toronto, ON M5S 3K3, Canada
- Department of Radiation Oncology, University of Toronto, Toronto, ON M5T 1P5, Canada
- Department of Radiation Oncology, BC Cancer, Vancouver, BC V5Z 1G1, Canada
- Division of Radiation Oncology, University of British Columbia, Vancouver, BC V5Z 1M9, Canada
| |
Collapse
|
9
|
Tran QL, Huynh PP, Le B, Jiang N. Utilization of Artificial Intelligence in the Creation of Patient Information on Laryngology Topics. Laryngoscope 2025; 135:1295-1300. [PMID: 39503508 DOI: 10.1002/lary.31891] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2024] [Revised: 10/05/2024] [Accepted: 10/14/2024] [Indexed: 03/14/2025]
Abstract
OBJECTIVE To evaluate and compare the readability and quality of patient information generated by Chat-Generative Pre-Trained Transformer-3.5 (ChatGPT) and the American Academy of Otolaryngology-Head and Neck Surgery (AAO-HNS) using validated instruments including Flesch-Kincaid Grade Level (FKGL), Flesch Reading Ease, DISCERN, and Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P). METHODS ENTHealth.org and ChatGPT-3.5 were queried for patient information on laryngology topics. ChatGPT-3.5 was queried twice for a given topic to evaluate for reliability. This generated three de-identified text documents for each topic: one from AAO-HNS and two from ChatGPT (ChatGPT Output 1, ChatGPT Output 2). Grade level and reading ease were compared between the three sources using a one-way analysis of variance and Tukey's post hoc test. Independent t-tests were used to compare DISCERN and PEMAT understandability and actionability scores between AAO-HNS and ChatGPT Output 1. RESULTS Material generated from ChatGPT Output 1 and ChatGPT Output 2 were at least two reading grade levels higher than that of material from AAO-HNS (p < 0.001). Regarding reading ease, ChatGPT Output 1 and ChatGPT Output 2 documents had significantly lower mean scores compared to AAO-HNS (p < 0.001). Moreover, ChatGPT Output 1 material on vocal cord paralysis had a lower PEMAT-P understandability compared to that of AAO-HNS material (p > 0.05). CONCLUSION Patient information on the ENTHealth.org website for select laryngology topics was, on average, of a lower grade level and higher reading ease compared to that produced by ChatGPT, but interestingly with largely no difference in the quality of information provided. LEVEL OF EVIDENCE NA Laryngoscope, 135:1295-1300, 2025.
Collapse
Affiliation(s)
- Quynh-Lam Tran
- Department of Head and Neck Surgery, Kaiser Permanente, Oakland Medical Center, Oakland, California, U.S.A
- Geisel School of Medicine at Dartmouth, Hanover, New Hampshire, U.S.A
| | - Pauline P Huynh
- Department of Head and Neck Surgery, Kaiser Permanente, Oakland Medical Center, Oakland, California, U.S.A
| | - Bryan Le
- Department of Head and Neck Surgery, Kaiser Permanente, Oakland Medical Center, Oakland, California, U.S.A
| | - Nancy Jiang
- Department of Head and Neck Surgery, Kaiser Permanente, Oakland Medical Center, Oakland, California, U.S.A
| |
Collapse
|
10
|
Niriella MA, Premaratna P, Senanayake M, Kodisinghe S, Dassanayake U, Dassanayake A, Ediriweera DS, de Silva HJ. The reliability of freely accessible, baseline, general-purpose large language model generated patient information for frequently asked questions on liver disease: a preliminary cross-sectional study. Expert Rev Gastroenterol Hepatol 2025; 19:437-442. [PMID: 39985424 DOI: 10.1080/17474124.2025.2471874] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/13/2024] [Revised: 02/12/2025] [Accepted: 02/21/2025] [Indexed: 02/24/2025]
Abstract
BACKGROUND We assessed the use of large language models (LLMs) like ChatGPT-3.5 and Gemini against human experts as sources of patient information. RESEARCH DESIGN AND METHODS We compared the accuracy, completeness and quality of freely accessible, baseline, general-purpose LLM-generated responses to 20 frequently asked questions (FAQs) on liver disease, with those from two gastroenterologists, using the Kruskal-Wallis test. Three independent gastroenterologists blindly rated each response. RESULTS The expert and AI-generated responses displayed high mean scores across all domains, with no statistical difference between the groups for accuracy [H(2) = 0.421, p = 0.811], completeness [H(2) = 3.146, p = 0.207], or quality [H(2) = 3.350, p = 0.187]. We found no statistical difference between rank totals in accuracy [H(2) = 5.559, p = 0.062], completeness [H(2) = 0.104, p = 0.949], or quality [H(2) = 0.420, p = 0.810] between the three raters (R1, R2, R3). CONCLUSION Our findings outline the potential of freely accessible, baseline, general-purpose LLMs in providing reliable answers to FAQs on liver disease.
Collapse
|
11
|
Balci AS, Çakmak S. Evaluating the Accuracy and Readability of ChatGPT-4o's Responses to Patient-Based Questions about Keratoconus. Ophthalmic Epidemiol 2025:1-6. [PMID: 40154955 DOI: 10.1080/09286586.2025.2484760] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2024] [Revised: 02/11/2025] [Accepted: 03/19/2025] [Indexed: 04/01/2025]
Abstract
PURPOSE This study aimed to evaluate the accuracy and readability of responses generated by ChatGPT-4o, an advanced large language model, to frequently asked patient-centered questions about keratoconus. METHODS A cross-sectional, observational study was conducted using ChatGPT-4o to answer 30 potential questions that could be asked by patients with keratoconus. The accuracy of the responses was evaluated by two board-certified ophthalmologists and scored on a scale of 1 to 5. Readability was assessed using the Simple Measure of Gobbledygook (SMOG), Flesch-Kincaid Grade Level (FKGL), and Flesch Reading Ease (FRE) scores. Descriptive, treatment-related, and follow-up-related questions were analyzed, and statistical comparisons between these categories were performed. RESULTS The mean accuracy score for the responses was 4.48 ± 0.57 on a 5-point Likert scale. The interrater reliability, with an intraclass correlation coefficient of 0.769, indicated a strong level of agreement. Readability scores revealed a SMOG score of 15.49 ± 1.74, an FKGL score of 14.95 ± 1.95, and an FRE score of 27.41 ± 9.71, indicating that a high level of education is required to comprehend the responses. There was no significant difference in accuracy among the different question categories (p = 0.161), but readability varied significantly, with treatment-related questions being the easiest to understand. CONCLUSION ChatGPT-4o provides highly accurate responses to patient-centered questions about keratoconus, though the complexity of its language may limit accessibility for the general population. Further development is needed to enhance the readability of AI-generated medical content.
Collapse
Affiliation(s)
- Ali Safa Balci
- Department of Ophthalmology,Sehit Prof. Dr. Ilhan Varank Sancaktepe Training and Research Hospital, Sehit Prof. Dr. Ilhan Varank Sancaktepe Training and Research Hospita University of Health Sciences, Istanbul, Türkiye
| | - Semih Çakmak
- Department of Ophthalmology,Istanbul Faculty of Medicine, Istanbul University, Istanbul, Türkiye
| |
Collapse
|
12
|
Hunter N, Allen D, Xiao D, Cox M, Jain K. Patient education resources for oral mucositis: a google search and ChatGPT analysis. Eur Arch Otorhinolaryngol 2025; 282:1609-1618. [PMID: 39198303 DOI: 10.1007/s00405-024-08913-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Accepted: 08/11/2024] [Indexed: 09/01/2024]
Abstract
PURPOSE Oral mucositis affects 90% of patients receiving chemotherapy or radiation for head and neck malignancies. Many patients use the internet to learn about their condition and treatments; however, the quality of online resources is not guaranteed. Our objective was to determine the most common Google searches related to "oral mucositis" and assess the quality and readability of available resources compared to ChatGPT-generated responses. METHODS Data related to Google searches for "oral mucositis" were analyzed. People Also Ask (PAA) questions (generated by Google) related to searches for "oral mucositis" were documented. Google resources were rated on quality, understandability, ease of reading, and reading grade level using the Journal of the American Medical Association benchmark criteria, Patient Education Materials Assessment Tool, Flesch Reading Ease Score, and Flesh-Kincaid Grade Level, respectively. ChatGPT-generated responses to the most popular PAA questions were rated using identical metrics. RESULTS Google search popularity for "oral mucositis" has significantly increased since 2004. 78% of the Google resources answered the associated PAA question, and 6% met the criteria for universal readability. 100% of the ChatGPT-generated responses answered the prompt, and 20% met the criteria for universal readability when asked to write for the appropriate audience. CONCLUSION Most resources provided by Google do not meet the criteria for universal readability. When prompted specifically, ChatGPT-generated responses were consistently more readable than Google resources. After verification of accuracy by healthcare professionals, ChatGPT could be a reasonable alternative to generate universally readable patient education resources.
Collapse
Affiliation(s)
- Nathaniel Hunter
- McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - David Allen
- Department of Otorhinolaryngology-Head and Neck Surgery, The University of Texas Health Science Center at Houston, Houston, TX, USA.
| | - Daniel Xiao
- McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Madisyn Cox
- McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Kunal Jain
- Department of Otorhinolaryngology-Head and Neck Surgery, The University of Texas Health Science Center at Houston, Houston, TX, USA
| |
Collapse
|
13
|
Mete U, Özmen ÖA. Assessing the accuracy and reproducibility of ChatGPT for responding to patient inquiries about otosclerosis. Eur Arch Otorhinolaryngol 2025; 282:1567-1575. [PMID: 39461921 DOI: 10.1007/s00405-024-09039-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2024] [Accepted: 10/13/2024] [Indexed: 10/28/2024]
Abstract
BACKGROUND Patients increasingly use chatbots powered by artificial intelligence to seek information. However, there is a lack of reliable studies on the accuracy and reproducibility of the information provided by these models. Therefore, we conducted a study investigating the ChatGPT's responses to questions about otosclerosis. METHODS 96 otosclerosis-related questions were collected from internet searches and websites of professional institutions and societies. Questions are divided into four sub-categories. These questions were directed at the latest version of ChatGPT Plus, and these responses were assessed by two otorhinolaryngology surgeons. Accuracy was graded as correct, incomplete, mixed, and irrelevant. Reproducibility was evaluated by comparing the consistency of the two answers to each specific question. RESULTS The overall accuracy and reproducibility rates of GPT-4o for correct answers were found to be 64.60% and 89.60%, respectively. The findings showed correct answers for accuracy and reproducibility for basic knowledge were 64.70% and 91.20%; for diagnosis & management, 64.0% and 92.0%; for medical & surgical treatment, 52.95% and 82.35%; and for operative risks & postoperative period, 75.0% and 90.0%, respectively. There were no significant differences found between the answers and groups in terms of accuracy and reproducibility (p = 0.073 and p = 0.752, respectively). CONCLUSION GPT-4o achieved satisfactory accuracy results, except in the diagnosis & management and medical & surgical treatment categories. Reproducibility was generally high across all categories. With the audio and visual communication capabilities of GPT-4o, under the supervision of a medical professional, this model can be utilized to provide medical information and support for otosclerosis patients.
Collapse
Affiliation(s)
- Utku Mete
- Department of Otolaryngology-Head and Neck Surgery, Bursa Uludag University, Faculty of Medicine, Görükle Center Campus, Nilüfer, Bursa, 16059, Turkey.
| | - Ömer Afşın Özmen
- Department of Otolaryngology-Head and Neck Surgery, Bursa Uludag University, Faculty of Medicine, Görükle Center Campus, Nilüfer, Bursa, 16059, Turkey
| |
Collapse
|
14
|
Gary AA, Lai JM, Locatelli EVT, Falcone MM, Cavuoto KM. Accuracy and Readability of ChatGPT Responses to Patient-Centric Strabismus Questions. J Pediatr Ophthalmol Strabismus 2025:1-8. [PMID: 39969263 DOI: 10.3928/01913913-20250110-02] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/20/2025]
Abstract
PURPOSE To assess the medical accuracy and readability of responses provided by ChatGPT (OpenAI), the most widely used artificial intelligence-powered chat-bot, regarding questions about strabismus. METHODS Thirty-four questions were input into ChatGPT 3.5 (free version) and 4.0 (paid version) at three time intervals (day 0, 1 week, and 1 month) in two distinct geographic locations (California and Florida) in March 2024. Two pediatric ophthalmologists rated responses as "acceptable," "accurate but missing key information or minor inaccuracies," or "inaccurate and potentially harmful." The online tool, Readable, measured the Flesch-Kincaid Grade Level and Flesch Reading Ease Score to assess readability. RESULTS Overall, 64% of responses by ChatGPT were "acceptable;" but the proportion of "acceptable" responses differed by version (47% for ChatGPT 3.5 vs 53% for 4.0, P < .05) and state (77% of California vs 51% of Florida, P < .001). Responses in Florida were more likely to be "inaccurate and potentially harmful" compared to those in California (6.9% vs. 1.5%, P < .001). Over 1 month, the overall percentage of "acceptable" responses increased (60% at day 0, 64% at 1 week, and 67% at 1 month, P > .05), whereas "inaccurate and potentially harmful" responses decreased (5% at day 0, 5% at 1 week, and 3% at 1 month, P > .05). On average, responses scored a Flesch-Kincaid Grade Level score of 15, equating to a higher than high school grade reading level. CONCLUSIONS Although most of ChatGPT's responses to strabismus questions were clinically acceptable, there were variations in responses across time and geographic regions. The average reading level exceeded a high school level and demonstrated low readability. Although ChatGPT demonstrates potential as a supplementary resource for parents and patients with strabismus, improving the accuracy and readability of free versions of ChatGPT may increase its utility. [J Pediatr Ophthalmol Strabismus. 20XX;X(X):XXX-XXX.].
Collapse
|
15
|
Busch F, Hoffmann L, Rueger C, van Dijk EH, Kader R, Ortiz-Prado E, Makowski MR, Saba L, Hadamitzky M, Kather JN, Truhn D, Cuocolo R, Adams LC, Bressem KK. Current applications and challenges in large language models for patient care: a systematic review. COMMUNICATIONS MEDICINE 2025; 5:26. [PMID: 39838160 PMCID: PMC11751060 DOI: 10.1038/s43856-024-00717-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Accepted: 12/17/2024] [Indexed: 01/23/2025] Open
Abstract
BACKGROUND The introduction of large language models (LLMs) into clinical practice promises to improve patient education and empowerment, thereby personalizing medical care and broadening access to medical knowledge. Despite the popularity of LLMs, there is a significant gap in systematized information on their use in patient care. Therefore, this systematic review aims to synthesize current applications and limitations of LLMs in patient care. METHODS We systematically searched 5 databases for qualitative, quantitative, and mixed methods articles on LLMs in patient care published between 2022 and 2023. From 4349 initial records, 89 studies across 29 medical specialties were included. Quality assessment was performed using the Mixed Methods Appraisal Tool 2018. A data-driven convergent synthesis approach was applied for thematic syntheses of LLM applications and limitations using free line-by-line coding in Dedoose. RESULTS We show that most studies investigate Generative Pre-trained Transformers (GPT)-3.5 (53.2%, n = 66 of 124 different LLMs examined) and GPT-4 (26.6%, n = 33/124) in answering medical questions, followed by patient information generation, including medical text summarization or translation, and clinical documentation. Our analysis delineates two primary domains of LLM limitations: design and output. Design limitations include 6 second-order and 12 third-order codes, such as lack of medical domain optimization, data transparency, and accessibility issues, while output limitations include 9 second-order and 32 third-order codes, for example, non-reproducibility, non-comprehensiveness, incorrectness, unsafety, and bias. CONCLUSIONS This review systematically maps LLM applications and limitations in patient care, providing a foundational framework and taxonomy for their implementation and evaluation in healthcare settings.
Collapse
Affiliation(s)
- Felix Busch
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany.
| | - Lena Hoffmann
- Department of Neuroradiology, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt Universität zu Berlin, Berlin, Germany
| | - Christopher Rueger
- Department of Neuroradiology, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt Universität zu Berlin, Berlin, Germany
| | - Elon Hc van Dijk
- Department of Ophthalmology, Leiden University Medical Center, Leiden, The Netherlands
- Department of Ophthalmology, Sir Charles Gairdner Hospital, Perth, Australia
| | - Rawen Kader
- Division of Surgery and Interventional Sciences, University College London, London, United Kingdom
| | - Esteban Ortiz-Prado
- One Health Research Group, Faculty of Health Science, Universidad de Las Américas, Quito, Ecuador
| | - Marcus R Makowski
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Luca Saba
- Department of Radiology, Azienda Ospedaliero Universitaria (A.O.U.), Cagliari, Italy
| | - Martin Hadamitzky
- School of Medicine and Health, Institute for Cardiovascular Radiology and Nuclear Medicine, German Heart Center Munich, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Jakob Nikolas Kather
- Department of Medical Oncology, National Center for Tumor Diseases (NCT), Heidelberg University Hospital, Heidelberg, Germany
- Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, Technical University Dresden, Dresden, Germany
| | - Daniel Truhn
- Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen, Germany
| | - Renato Cuocolo
- Department of Medicine, Surgery and Dentistry, University of Salerno, Baronissi, Italy
| | - Lisa C Adams
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Keno K Bressem
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
- School of Medicine and Health, Institute for Cardiovascular Radiology and Nuclear Medicine, German Heart Center Munich, TUM University Hospital, Technical University of Munich, Munich, Germany
| |
Collapse
|
16
|
Blasingame MN, Koonce TY, Williams AM, Giuse DA, Su J, Krump PA, Giuse NB. Evaluating a large language model's ability to answer clinicians' requests for evidence summaries. J Med Libr Assoc 2025; 113:65-77. [PMID: 39975503 PMCID: PMC11835037 DOI: 10.5195/jmla.2025.1985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/21/2025] Open
Abstract
Objective This study investigated the performance of a generative artificial intelligence (AI) tool using GPT-4 in answering clinical questions in comparison with medical librarians' gold-standard evidence syntheses. Methods Questions were extracted from an in-house database of clinical evidence requests previously answered by medical librarians. Questions with multiple parts were subdivided into individual topics. A standardized prompt was developed using the COSTAR framework. Librarians submitted each question into aiChat, an internally managed chat tool using GPT-4, and recorded the responses. The summaries generated by aiChat were evaluated on whether they contained the critical elements used in the established gold-standard summary of the librarian. A subset of questions was randomly selected for verification of references provided by aiChat. Results Of the 216 evaluated questions, aiChat's response was assessed as "correct" for 180 (83.3%) questions, "partially correct" for 35 (16.2%) questions, and "incorrect" for 1 (0.5%) question. No significant differences were observed in question ratings by question category (p=0.73). For a subset of 30% (n=66) of questions, 162 references were provided in the aiChat summaries, and 60 (37%) were confirmed as nonfabricated. Conclusions Overall, the performance of a generative AI tool was promising. However, many included references could not be independently verified, and attempts were not made to assess whether any additional concepts introduced by aiChat were factually accurate. Thus, we envision this being the first of a series of investigations designed to further our understanding of how current and future versions of generative AI can be used and integrated into medical librarians' workflow.
Collapse
Affiliation(s)
- Mallory N Blasingame
- , Information Scientist & Assistant Director for Evidence Provision, Center for Knowledge Management, Vanderbilt University Medical Center, Nashville, TN
| | - Taneya Y Koonce
- , Deputy Director, Center for Knowledge Management, Vanderbilt University Medical Center, Nashville, TN
| | - Annette M Williams
- , Senior Information Scientist and Associate Director for Metadata Management, Center for Knowledge Management, Vanderbilt University Medical Center, Nashville, TN
| | - Dario A Giuse
- , Associate Professor, Department of Biomedical Informatics, Vanderbilt University School of Medicine and Vanderbilt University Medical Center, Nashville, TN
| | - Jing Su
- , Senior Information Scientist, Center for Knowledge Management, Vanderbilt University Medical Center, Nashville, TN
| | - Poppy A Krump
- , Information Scientist, Center for Knowledge Management, Vanderbilt University Medical Center, Nashville, TN
| | - Nunzia Bettinsoli Giuse
- , Professor of Biomedical Informatics and Professor of Medicine; Vice President for Knowledge Management; and Director, Center for Knowledge Management, Vanderbilt University Medical Center, Nashville, TN
| |
Collapse
|
17
|
Ghozali MT. Assessing ChatGPT's accuracy and reliability in asthma general knowledge: implications for artificial intelligence use in public health education. J Asthma 2025:1-9. [PMID: 39773167 DOI: 10.1080/02770903.2025.2450482] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2024] [Revised: 12/21/2024] [Accepted: 01/02/2025] [Indexed: 01/11/2025]
Abstract
BACKGROUND Integrating Artificial Intelligence (AI) into public health education represents a pivotal advancement in medical knowledge dissemination, particularly for chronic diseases such as asthma. This study assesses the accuracy and comprehensiveness of ChatGPT, a conversational AI model, in providing asthma-related information. METHODS Employing a rigorous mixed-methods approach, healthcare professionals evaluated ChatGPT's responses to the Asthma General Knowledge Questionnaire for Adults (AGKQA), a standardized instrument covering various asthma-related topics. Responses were graded for accuracy and completeness and analyzed using statistical tests to assess reproducibility and consistency. RESULTS ChatGPT showed notable proficiency in conveying asthma knowledge, with flawless success in the etiology and pathophysiology categories and substantial accuracy in medication information (70%). However, limitations were noted in medication-related responses, where mixed accuracy (30%) highlights the need for further refinement of ChatGPT's capabilities to ensure reliability in critical areas of asthma education. Reproducibility analysis demonstrated a consistent 100% rate across all categories, affirming ChatGPT's reliability in delivering uniform information. Statistical analyses further underscored ChatGPT's stability and reliability. CONCLUSION These findings underscore ChatGPT's promise as a valuable educational tool for asthma while emphasizing the necessity of ongoing improvements to address observed limitations, particularly regarding medication-related information.
Collapse
Affiliation(s)
- Muhammad Thesa Ghozali
- Department of Pharmaceutical Management, School of Pharmacy, Faculty of Medicine and Health Sciences, Universitas Muhammadiyah Yogyakarta
| |
Collapse
|
18
|
Gorris MA, Randle RW, Obermiller CS, Thomas J, Toro-Tobon D, Dream SY, Fackelmayer OJ, Pandian TK, Mayson SE. Assessing ChatGPT's Capability in Addressing Thyroid Cancer Patient Queries: A Comprehensive Mixed-Methods Evaluation. J Endocr Soc 2025; 9:bvaf003. [PMID: 39881674 PMCID: PMC11775116 DOI: 10.1210/jendso/bvaf003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/24/2024] [Indexed: 01/31/2025] Open
Abstract
Context Literature suggests patients with thyroid cancer have unmet informational needs in many aspects of care. Patients often turn to online resources for their health-related information, and generative artificial intelligence programs such as ChatGPT are an emerging and attractive resource for patients. Objective To assess the quality of ChatGPT's responses to thyroid cancer-related questions. Methods Four endocrinologists and 4 endocrine surgeons, all with expertise in thyroid cancer, evaluated the responses to 20 thyroid cancer-related questions. Responses were scored on a 7-point Likert scale in areas of accuracy, completeness, and overall satisfaction. Comments from the evaluators were aggregated and a qualitative analysis was performed. Results Overall, only 57%, 56%, and 52% of the responses "agreed" or "strongly agreed" that ChatGPT's answers were accurate, complete, and satisfactory, respectively. One hundred ninety-eight free-text comments were included in the qualitative analysis. The majority of comments were critical in nature. Several themes emerged, which included overemphasis of diet and iodine intake and its role in thyroid cancer, and incomplete or inaccurate information on risks of both thyroid surgery and radioactive iodine therapy. Conclusion Our study suggests that ChatGPT is not accurate or reliable enough at this time for unsupervised use as a patient information tool for thyroid cancer.
Collapse
Affiliation(s)
- Matthew A Gorris
- Division of Endocrinology and Metabolism, Wake Forest University School of Medicine, Winston Salem, NC 27101, USA
| | - Reese W Randle
- Department of Surgery, Section of Surgical Oncology, Wake Forest University School of Medicine, Winston Salem, NC 27101, USA
| | - Corey S Obermiller
- Informatics and Analytics, Department of Internal Medicine, Atrium Health Wake Forest Baptist, Winston Salem, NC 27157, USA
| | - Johnson Thomas
- Department of Endocrinology, Mercy Health, Springfield, MO 65807, USA
| | - David Toro-Tobon
- Division of Endocrinology, Diabetes, Metabolism, and Nutrition, Mayo Clinic, Rochester, MN 55905, USA
| | - Sophie Y Dream
- Division of Surgery, Medical College of Wisconsin, Milwaukee, WI 53226, USA
| | - Oliver J Fackelmayer
- Division of General, Endocrine and Metabolic Surgery, University of Kentucky, Lexington, KY 40508, USA
| | - T K Pandian
- Department of Surgery, Section of Surgical Oncology, Wake Forest University School of Medicine, Winston Salem, NC 27101, USA
| | - Sarah E Mayson
- Division of Endocrinology, Metabolism and Diabetes, University of Colorado School of Medicine, Aurora, CO 80045, USA
| |
Collapse
|
19
|
Zitu MM, Le TD, Duong T, Haddadan S, Garcia M, Amorrortu R, Zhao Y, Rollison DE, Thieu T. Large language models in cancer: potentials, risks, and safeguards. BJR ARTIFICIAL INTELLIGENCE 2025; 2:ubae019. [PMID: 39777117 PMCID: PMC11703354 DOI: 10.1093/bjrai/ubae019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Revised: 10/26/2024] [Accepted: 12/09/2024] [Indexed: 01/11/2025]
Abstract
This review examines the use of large language models (LLMs) in cancer, analysing articles sourced from PubMed, Embase, and Ovid Medline, published between 2017 and 2024. Our search strategy included terms related to LLMs, cancer research, risks, safeguards, and ethical issues, focusing on studies that utilized text-based data. 59 articles were included in the review, categorized into 3 segments: quantitative studies on LLMs, chatbot-focused studies, and qualitative discussions on LLMs on cancer. Quantitative studies highlight LLMs' advanced capabilities in natural language processing (NLP), while chatbot-focused articles demonstrate their potential in clinical support and data management. Qualitative research underscores the broader implications of LLMs, including the risks and ethical considerations. Our findings suggest that LLMs, notably ChatGPT, have potential in data analysis, patient interaction, and personalized treatment in cancer care. However, the review identifies critical risks, including data biases and ethical challenges. We emphasize the need for regulatory oversight, targeted model development, and continuous evaluation. In conclusion, integrating LLMs in cancer research offers promising prospects but necessitates a balanced approach focusing on accuracy, ethical integrity, and data privacy. This review underscores the need for further study, encouraging responsible exploration and application of artificial intelligence in oncology.
Collapse
Affiliation(s)
- Md Muntasir Zitu
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Tuan Dung Le
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Thanh Duong
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Shohreh Haddadan
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Melany Garcia
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Rossybelle Amorrortu
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Yayi Zhao
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Dana E Rollison
- Department of Cancer Epidemiology, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| | - Thanh Thieu
- Department of Machine Learning, Moffitt Cancer Center, Tampa, FL, Moffitt Cancer Center and Research Institute, Tampa, FL 33612, United States
| |
Collapse
|
20
|
Chen JS, Reddy AJ, Al-Sharif E, Shoji MK, Kalaw FGP, Eslani M, Lang PZ, Arya M, Koretz ZA, Bolo KA, Arnett JJ, Roginiel AC, Do JL, Robbins SL, Camp AS, Scott NL, Rudell JC, Weinreb RN, Baxter SL, Granet DB. Analysis of ChatGPT Responses to Ophthalmic Cases: Can ChatGPT Think like an Ophthalmologist? OPHTHALMOLOGY SCIENCE 2025; 5:100600. [PMID: 39346575 PMCID: PMC11437840 DOI: 10.1016/j.xops.2024.100600] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/20/2024] [Revised: 08/09/2024] [Accepted: 08/13/2024] [Indexed: 10/01/2024]
Abstract
Objective Large language models such as ChatGPT have demonstrated significant potential in question-answering within ophthalmology, but there is a paucity of literature evaluating its ability to generate clinical assessments and discussions. The objectives of this study were to (1) assess the accuracy of assessment and plans generated by ChatGPT and (2) evaluate ophthalmologists' abilities to distinguish between responses generated by clinicians versus ChatGPT. Design Cross-sectional mixed-methods study. Subjects Sixteen ophthalmologists from a single academic center, of which 10 were board-eligible and 6 were board-certified, were recruited to participate in this study. Methods Prompt engineering was used to ensure ChatGPT output discussions in the style of the ophthalmologist author of the Medical College of Wisconsin Ophthalmic Case Studies. Cases where ChatGPT accurately identified the primary diagnoses were included and then paired. Masked human-generated and ChatGPT-generated discussions were sent to participating ophthalmologists to identify the author of the discussions. Response confidence was assessed using a 5-point Likert scale score, and subjective feedback was manually reviewed. Main Outcome Measures Accuracy of ophthalmologist identification of discussion author, as well as subjective perceptions of human-generated versus ChatGPT-generated discussions. Results Overall, ChatGPT correctly identified the primary diagnosis in 15 of 17 (88.2%) cases. Two cases were excluded from the paired comparison due to hallucinations or fabrications of nonuser-provided data. Ophthalmologists correctly identified the author in 77.9% ± 26.6% of the 13 included cases, with a mean Likert scale confidence rating of 3.6 ± 1.0. No significant differences in performance or confidence were found between board-certified and board-eligible ophthalmologists. Subjectively, ophthalmologists found that discussions written by ChatGPT tended to have more generic responses, irrelevant information, hallucinated more frequently, and had distinct syntactic patterns (all P < 0.01). Conclusions Large language models have the potential to synthesize clinical data and generate ophthalmic discussions. While these findings have exciting implications for artificial intelligence-assisted health care delivery, more rigorous real-world evaluation of these models is necessary before clinical deployment. Financial Disclosures The author(s) have no proprietary or commercial interest in any materials discussed in this article.
Collapse
Affiliation(s)
- Jimmy S. Chen
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - Akshay J. Reddy
- School of Medicine, California University of Science and Medicine, Colton, California
| | - Eman Al-Sharif
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
- Surgery Department, College of Medicine, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia
| | - Marissa K. Shoji
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Fritz Gerald P. Kalaw
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - Medi Eslani
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Paul Z. Lang
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Malvika Arya
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Zachary A. Koretz
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Kyle A. Bolo
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Justin J. Arnett
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Aliya C. Roginiel
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Jiun L. Do
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Shira L. Robbins
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Andrew S. Camp
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Nathan L. Scott
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Jolene C. Rudell
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| | - Robert N. Weinreb
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - Sally L. Baxter
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California
| | - David B. Granet
- Viterbi Family Department of Ophthalmology, Shiley Eye Institute, University of California, San Diego, La Jolla, California
| |
Collapse
|
21
|
Anees M, Shaikh FA, Shaikh H, Siddiqui NA, Rehman ZU. Assessing the quality of ChatGPT's responses to questions related to radiofrequency ablation for varicose veins. J Vasc Surg Venous Lymphat Disord 2025; 13:101985. [PMID: 39332626 PMCID: PMC11764857 DOI: 10.1016/j.jvsv.2024.101985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2024] [Revised: 09/16/2024] [Accepted: 09/17/2024] [Indexed: 09/29/2024]
Abstract
OBJECTIVE This study aimed to evaluate the accuracy and reproducibility of information provided by ChatGPT, in response to frequently asked questions about radiofrequency ablation (RFA) for varicose veins. METHODS This cross-sectional study was conducted at The Aga Khan University Hospital, Karachi, Pakistan. A set of 18 frequently asked questions regarding RFA for varicose veins were compiled from credible online sources and presented to ChatGPT twice, separately, using the new chat option. Twelve experienced vascular surgeons (with >2 years of experience and ≥20 RFA procedures performed annually) independently evaluated the accuracy of the responses using a 4-point Likert scale and assessed their reproducibility. RESULTS Most evaluators were males (n = 10/12 [83.3%]) with an average of 12.3 ± 6.2 years of experience as a vascular surgeon. Six evaluators (50%) were from the UK followed by three from Saudi Arabia (25.0%), two from Pakistan (16.7%), and one from the United States (8.3%). Among the 216 accuracy grades, most of the evaluators graded the responses as comprehensive (n = 87/216 [40.3%]) or accurate but insufficient (n = 70/216 [32.4%]), whereas only 17.1% (n = 37/216) were graded as a mixture of both accurate and inaccurate information and 10.8% (n = 22/216) as entirely inaccurate. Overall, 89.8% of the responses (n = 194/216) were deemed reproducible. Of the total responses, 70.4% (n = 152/216) were classified as good quality and reproducible. The remaining responses were poor quality with 19.4% reproducible (n = 42/216) and 10.2% nonreproducible (n = 22/216). There was nonsignificant inter-rater disagreement among the vascular surgeons for overall responses (Fleiss' kappa, -0.028; P = .131). CONCLUSIONS ChatGPT provided generally accurate and reproducible information on RFA for varicose veins; however, variability in response quality and limited inter-rater reliability highlight the need for further improvements. Although it has the potential to enhance patient education and support healthcare decision-making, improvements in its training, validation, transparency, and mechanisms to address inaccurate or incomplete information are essential.
Collapse
Affiliation(s)
- Muhammad Anees
- Section of Vascular Surgery, Department of Surgery, Aga Khan University Hospital, Karachi, Pakistan
| | - Fareed Ahmed Shaikh
- Section of Vascular Surgery, Department of Surgery, Aga Khan University Hospital, Karachi, Pakistan.
| | | | - Nadeem Ahmed Siddiqui
- Section of Vascular Surgery, Department of Surgery, Aga Khan University Hospital, Karachi, Pakistan
| | - Zia Ur Rehman
- Section of Vascular Surgery, Department of Surgery, Aga Khan University Hospital, Karachi, Pakistan
| |
Collapse
|
22
|
Villarreal-Espinosa JB, Berreta RS, Allende F, Garcia JR, Ayala S, Familiari F, Chahla J. Accuracy assessment of ChatGPT responses to frequently asked questions regarding anterior cruciate ligament surgery. Knee 2024; 51:84-92. [PMID: 39241674 DOI: 10.1016/j.knee.2024.08.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Revised: 06/21/2024] [Accepted: 08/14/2024] [Indexed: 09/09/2024]
Abstract
BACKGROUND The emergence of artificial intelligence (AI) has allowed users to have access to large sources of information in a chat-like manner. Thereby, we sought to evaluate ChatGPT-4 response's accuracy to the 10 patient most frequently asked questions (FAQs) regarding anterior cruciate ligament (ACL) surgery. METHODS A list of the top 10 FAQs pertaining to ACL surgery was created after conducting a search through all Sports Medicine Fellowship Institutions listed on the Arthroscopy Association of North America (AANA) and American Orthopaedic Society of Sports Medicine (AOSSM) websites. A Likert scale was used to grade response accuracy by two sports medicine fellowship-trained surgeons. Cohen's kappa was used to assess inter-rater agreement. Reproducibility of the responses over time was also assessed. RESULTS Five of the 10 responses received a 'completely accurate' grade by two-fellowship trained surgeons with three additional replies receiving a 'completely accurate' status by at least one. Moreover, inter-rater reliability accuracy assessment revealed a moderate agreement between fellowship-trained attending physicians (weighted kappa = 0.57, 95% confidence interval 0.15-0.99). Additionally, 80% of the responses were reproducible over time. CONCLUSION ChatGPT can be considered an accurate additional tool to answer general patient questions regarding ACL surgery. None the less, patient-surgeon interaction should not be deferred and must continue to be the driving force for information retrieval. Thus, the general recommendation is to address any questions in the presence of a qualified specialist.
Collapse
Affiliation(s)
| | | | - Felicitas Allende
- Department of Orthopedics, Rush University Medical Center, Chicago, IL, USA
| | - José Rafael Garcia
- Department of Orthopedics, Rush University Medical Center, Chicago, IL, USA
| | - Salvador Ayala
- Department of Orthopedics, Rush University Medical Center, Chicago, IL, USA
| | | | - Jorge Chahla
- Department of Orthopedics, Rush University Medical Center, Chicago, IL, USA.
| |
Collapse
|
23
|
Swisher AR, Wu AW, Liu GC, Lee MK, Carle TR, Tang DM. Enhancing Health Literacy: Evaluating the Readability of Patient Handouts Revised by ChatGPT's Large Language Model. Otolaryngol Head Neck Surg 2024; 171:1751-1757. [PMID: 39105460 DOI: 10.1002/ohn.927] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2024] [Revised: 07/03/2024] [Accepted: 07/20/2024] [Indexed: 08/07/2024]
Abstract
OBJECTIVE To use an artificial intelligence (AI)-powered large language model (LLM) to improve readability of patient handouts. STUDY DESIGN Review of online material modified by AI. SETTING Academic center. METHODS Five handout materials obtained from the American Rhinologic Society (ARS) and the American Academy of Facial Plastic and Reconstructive Surgery websites were assessed using validated readability metrics. The handouts were inputted into OpenAI's ChatGPT-4 after prompting: "Rewrite the following at a 6th-grade reading level." The understandability and actionability of both native and LLM-revised versions were evaluated using the Patient Education Materials Assessment Tool (PEMAT). Results were compared using Wilcoxon rank-sum tests. RESULTS The mean readability scores of the standard (ARS, American Academy of Facial Plastic and Reconstructive Surgery) materials corresponded to "difficult," with reading categories ranging between high school and university grade levels. Conversely, the LLM-revised handouts had an average seventh-grade reading level. LLM-revised handouts had better readability in nearly all metrics tested: Flesch-Kincaid Reading Ease (70.8 vs 43.9; P < .05), Gunning Fog Score (10.2 vs 14.42; P < .05), Simple Measure of Gobbledygook (9.9 vs 13.1; P < .05), Coleman-Liau (8.8 vs 12.6; P < .05), and Automated Readability Index (8.2 vs 10.7; P = .06). PEMAT scores were significantly higher in the LLM-revised handouts for understandability (91 vs 74%; P < .05) with similar actionability (42 vs 34%; P = .15) when compared to the standard materials. CONCLUSION Patient-facing handouts can be augmented by ChatGPT with simple prompting to tailor information with improved readability. This study demonstrates the utility of LLMs to aid in rewriting patient handouts and may serve as a tool to help optimize education materials. LEVEL OF EVIDENCE Level VI.
Collapse
Affiliation(s)
- Austin R Swisher
- Department of Otolaryngology-Head and Neck Surgery, Mayo Clinic, Phoenix, Arizona, USA
| | - Arthur W Wu
- Division of Otolaryngology-Head and Neck Surgery, Cedars-Sinai, Los Angeles, California, USA
| | - Gene C Liu
- Division of Otolaryngology-Head and Neck Surgery, Cedars-Sinai, Los Angeles, California, USA
| | - Matthew K Lee
- Division of Otolaryngology-Head and Neck Surgery, Cedars-Sinai, Los Angeles, California, USA
| | - Taylor R Carle
- Division of Otolaryngology-Head and Neck Surgery, Cedars-Sinai, Los Angeles, California, USA
| | - Dennis M Tang
- Division of Otolaryngology-Head and Neck Surgery, Cedars-Sinai, Los Angeles, California, USA
| |
Collapse
|
24
|
McDarby M, Mroz EL, Hahne J, Malling CD, Carpenter BD, Parker PA. "Hospice Care Could Be a Compassionate Choice": ChatGPT Responses to Questions About Decision Making in Advanced Cancer. J Palliat Med 2024; 27:1618-1624. [PMID: 39263979 DOI: 10.1089/jpm.2024.0256] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/13/2024] Open
Abstract
Background: Patients with cancer use the internet to inform medical decision making. Objective: To examine the content of ChatGPT responses to a hypothetical patient question about decision making in advanced cancer. Design: We developed a medical advice-seeking vignette in English about a patient with metastatic melanoma. When inputting this vignette, we varied five characteristics (patient age, race, ethnicity, insurance status, and preexisting recommendation of hospice/the opinion of an adult daughter regarding the recommendation). ChatGPT responses (N = 96) were coded for mentions of: hospice care, palliative care, financial implications of treatment, second opinions, clinical trials, discussing the decision with loved ones, and discussing the decision with care providers. We conducted additional analyses to understand how ChatGPT described hospice and referenced the adult daughter. Data were analyzed using descriptive statistics and chi-square analysis. Results: Responses more frequently mentioned clinical trials for vignettes describing 45-year-old patients compared with 65- and 85-year-old patients. When vignettes mentioned a preexisting recommendation for hospice, responses more frequently mentioned seeking a second opinion and hospice care. ChatGPT's descriptions of hospice focused primarily on its ability to provide comfort and support. When vignettes referenced the daughter's opinion on the hospice recommendation, approximately one third of responses also referenced this, stating the importance of talking to her about treatment preferences and values. Conclusion: ChatGPT responses to questions about advanced cancer decision making can be heterogeneous based on demographic and clinical characteristics. Findings underscore the possible impact of this heterogeneity on treatment decision making in patients with cancer.
Collapse
Affiliation(s)
- Meghan McDarby
- Department of Psychiatry and Behavioral Sciences, Memorial Sloan Kettering Cancer Center, New York, New York, USA
| | - Emily L Mroz
- Section of Geriatrics, Department of Internal Medicine, Yale School of Medicine, New Haven, Connecticut, USA
- Nell Hodgson Woodruff School of Nursing, Emory University, Atlanta, Georgia, USA
| | - Jessica Hahne
- Department of Psychological and Brain Sciences, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Charlotte D Malling
- Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, New York, USA
| | - Brian D Carpenter
- Department of Psychological and Brain Sciences, Washington University in St. Louis, St. Louis, Missouri, USA
| | - Patricia A Parker
- Department of Psychiatry and Behavioral Sciences, Memorial Sloan Kettering Cancer Center, New York, New York, USA
| |
Collapse
|
25
|
Ho CN, Tian T, Ayers AT, Aaron RE, Phillips V, Wolf RM, Mathioudakis N, Dai T, Klonoff DC. Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review. BMC Med Inform Decis Mak 2024; 24:357. [PMID: 39593074 PMCID: PMC11590327 DOI: 10.1186/s12911-024-02757-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2024] [Accepted: 11/08/2024] [Indexed: 11/28/2024] Open
Abstract
BACKGROUND The large language models (LLMs), most notably ChatGPT, released since November 30, 2022, have prompted shifting attention to their use in medicine, particularly for supporting clinical decision-making. However, there is little consensus in the medical community on how LLM performance in clinical contexts should be evaluated. METHODS We performed a literature review of PubMed to identify publications between December 1, 2022, and April 1, 2024, that discussed assessments of LLM-generated diagnoses or treatment plans. RESULTS We selected 108 relevant articles from PubMed for analysis. The most frequently used LLMs were GPT-3.5, GPT-4, Bard, LLaMa/Alpaca-based models, and Bing Chat. The five most frequently used criteria for scoring LLM outputs were "accuracy", "completeness", "appropriateness", "insight", and "consistency". CONCLUSIONS The most frequently used criteria for defining high-quality LLMs have been consistently selected by researchers over the past 1.5 years. We identified a high degree of variation in how studies reported their findings and assessed LLM performance. Standardized reporting of qualitative evaluation metrics that assess the quality of LLM outputs can be developed to facilitate research studies on LLMs in healthcare.
Collapse
Affiliation(s)
- Cindy N Ho
- Diabetes Technology Society, Burlingame, CA, USA
| | - Tiffany Tian
- Diabetes Technology Society, Burlingame, CA, USA
| | | | | | - Vidith Phillips
- School of Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Risa M Wolf
- Division of Pediatric Endocrinology, The Johns Hopkins Hospital, Baltimore, MD, USA
- Hopkins Business of Health Initiative, Johns Hopkins University, Washington, DC, USA
| | | | - Tinglong Dai
- Hopkins Business of Health Initiative, Johns Hopkins University, Washington, DC, USA
- Carey Business School, Johns Hopkins University, Baltimore, MD, USA
- School of Nursing, Johns Hopkins University, Baltimore, MD, USA
| | - David C Klonoff
- Diabetes Research Institute, Mills-Peninsula Medical Center, 100 South San Mateo Drive, Room 1165, San Mateo, CA, 94401, USA.
| |
Collapse
|
26
|
Ostrowska M, Kacała P, Onolememen D, Vaughan-Lane K, Sisily Joseph A, Ostrowski A, Pietruszewska W, Banaszewski J, Wróbel MJ. To trust or not to trust: evaluating the reliability and safety of AI responses to laryngeal cancer queries. Eur Arch Otorhinolaryngol 2024; 281:6069-6081. [PMID: 38652298 PMCID: PMC11512842 DOI: 10.1007/s00405-024-08643-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2024] [Accepted: 03/26/2024] [Indexed: 04/25/2024]
Abstract
PURPOSE As online health information-seeking surges, concerns mount over the quality and safety of accessible content, potentially leading to patient harm through misinformation. On one hand, the emergence of Artificial Intelligence (AI) in healthcare could prevent it; on the other hand, questions raise regarding the quality and safety of the medical information provided. As laryngeal cancer is a prevalent head and neck malignancy, this study aims to evaluate the utility and safety of three large language models (LLMs) as sources of patient information about laryngeal cancer. METHODS A cross-sectional study was conducted using three LLMs (ChatGPT 3.5, ChatGPT 4.0, and Bard). A questionnaire comprising 36 inquiries about laryngeal cancer was categorised into diagnosis (11 questions), treatment (9 questions), novelties and upcoming treatments (4 questions), controversies (8 questions), and sources of information (4 questions). The population of reviewers consisted of 3 groups, including ENT specialists, junior physicians, and non-medicals, who graded the responses. Each physician evaluated each question twice for each model, while non-medicals only once. Everyone was blinded to the model type, and the question order was shuffled. Outcome evaluations were based on a safety score (1-3) and a Global Quality Score (GQS, 1-5). Results were compared between LLMs. The study included iterative assessments and statistical validations. RESULTS Analysis revealed that ChatGPT 3.5 scored highest in both safety (mean: 2.70) and GQS (mean: 3.95). ChatGPT 4.0 and Bard had lower safety scores of 2.56 and 2.42, respectively, with corresponding quality scores of 3.65 and 3.38. Inter-rater reliability was consistent, with less than 3% discrepancy. About 4.2% of responses fell into the lowest safety category (1), particularly in the novelty category. Non-medical reviewers' quality assessments correlated moderately (r = 0.67) with response length. CONCLUSIONS LLMs can be valuable resources for patients seeking information on laryngeal cancer. ChatGPT 3.5 provided the most reliable and safe responses among the models evaluated.
Collapse
Affiliation(s)
- Magdalena Ostrowska
- Department of Otolaryngology and Laryngological Oncology, Collegium Medicum, Nicolaus Copernicus University in Torun, ul.Marie Sklodowskiej-Curie 9, 85-094, Bydgoszcz, Poland
| | - Paulina Kacała
- ENT Scientific Club, Department of Otolaryngology and Laryngological Oncology, Collegium Medicum, Nicolaus Copernicus University in Torun, ul.Marie Sklodowskiej-Curie 9, 85-094, Bydgoszcz, Poland
| | - Deborah Onolememen
- ENT Scientific Club, Department of Otolaryngology and Laryngological Oncology, Collegium Medicum, Nicolaus Copernicus University in Torun, ul.Marie Sklodowskiej-Curie 9, 85-094, Bydgoszcz, Poland
| | - Katie Vaughan-Lane
- ENT Scientific Club, Department of Otolaryngology and Laryngological Oncology, Collegium Medicum, Nicolaus Copernicus University in Torun, ul.Marie Sklodowskiej-Curie 9, 85-094, Bydgoszcz, Poland.
| | - Anitta Sisily Joseph
- ENT Scientific Club, Department of Otolaryngology and Laryngological Oncology, Collegium Medicum, Nicolaus Copernicus University in Torun, ul.Marie Sklodowskiej-Curie 9, 85-094, Bydgoszcz, Poland
| | - Adam Ostrowski
- Department of Urology, Collegium Medicum, Nicolaus Copernicus University in Torun, ul.Marie Sklodowskiej-Curie 9, 85-094, Bydgoszcz, Poland
| | - Wioletta Pietruszewska
- Department of Otolaryngology, Laryngological Oncology, Audiology and Phoniatrics, Medical University of Lodz, ul Żeromskiego 113, 90-549, Lodz, Poland
| | - Jacek Banaszewski
- Department of Otolaryngology, Head and Neck Oncology, Poznan University of Medical Science, ul Przybyszewskiego 49, 60-355, Poznań, Poland
| | - Maciej J Wróbel
- Department of Otolaryngology and Laryngological Oncology, Collegium Medicum, Nicolaus Copernicus University in Torun, ul.Marie Sklodowskiej-Curie 9, 85-094, Bydgoszcz, Poland
| |
Collapse
|
27
|
Khaldi A, Machayekhi S, Salvagno M, Maniaci A, Vaira LA, La Via L, Taccone FS, Lechien JR. Accuracy of ChatGPT responses on tracheotomy for patient education. Eur Arch Otorhinolaryngol 2024; 281:6167-6172. [PMID: 39356355 DOI: 10.1007/s00405-024-08859-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2024] [Accepted: 07/18/2024] [Indexed: 10/03/2024]
Abstract
OBJECTIVE To investigate the accuracy of information provided by ChatGPT-4o to patients about tracheotomy. METHODS Twenty common questions of patients about tracheotomy were presented to ChatGPT-4o twice (7-day intervals). The accuracy, clarity, relevance, completeness, referencing, and usefulness of responses were assessed by a board-certified otolaryngologist and a board-certified intensive care unit practitioner with the Quality Analysis of Medical Artificial Intelligence (QAMAI) tool. The interrater reliability and the stability of the ChatGPT-4o responses were evaluated with intraclass correlation coefficient (ICC) and Pearson correlation analysis. RESULTS The total scores of QAMAI were 22.85 ± 4.75 for the intensive care practitioner and 21.45 ± 3.95 for the otolaryngologist, which consists of moderate-to-high accuracy. The otolaryngologist and the ICU practitioner reported high ICC (0.807; 95%CI: 0.655-0.911). The highest QAMAI scores have been found for clarity and completeness of explanations. The QAMAI scores for the accuracy of the information and the referencing were the lowest. The information related to the post-laryngectomy tracheostomy remains incomplete or erroneous. ChatGPT-4o did not provide references for their responses. The stability analysis reported high stability in regenerated questions. CONCLUSION The accuracy of ChatGPT-4o is moderate-to-high in providing information related to the tracheotomy. However, patients using ChatGPT-4o need to be cautious about the information related to tracheotomy care, steps, and the differences between temporary and permanent tracheotomies.
Collapse
Affiliation(s)
- Amina Khaldi
- Intensive Care Unit, EpiCURA Hospital, Hornu, Belgium
| | | | | | - Antonino Maniaci
- Faculty of Medicine and Surgery, University of Enna Kore, Enna, 94100, Italy
| | - Luigi A Vaira
- Maxillofacial Surgery Operative Unit, Department of Medicine, Surgery and Pharmacy, University of Sassari, Viale San Pietro 43/B, Sassari, 07100, Italy
| | - Luigi La Via
- Department of Anesthesia and Intensive Care, University Hospital Policlinico "G.Rodolico-San Marco", Catania, Italy
| | | | - Jerome R Lechien
- Department of Surgery, UMONS Research Institute for Health Sciences and Technology, University of Mons (UMons), Mons, Belgium.
- Department of Otolaryngology, Elsan Polyclinic of Poitiers, Poitiers, France.
- Department of Otolaryngology-Head Neck Surgery, School of Medicine, UFR Simone Veil, Foch Hospital, Université Versailles Saint-Quentin-en-Yvelines (Paris Saclay University), Paris, France.
| |
Collapse
|
28
|
Aydin S, Karabacak M, Vlachos V, Margetis K. Large language models in patient education: a scoping review of applications in medicine. Front Med (Lausanne) 2024; 11:1477898. [PMID: 39534227 PMCID: PMC11554522 DOI: 10.3389/fmed.2024.1477898] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2024] [Accepted: 10/03/2024] [Indexed: 11/16/2024] Open
Abstract
Introduction Large Language Models (LLMs) are sophisticated algorithms that analyze and generate vast amounts of textual data, mimicking human communication. Notable LLMs include GPT-4o by Open AI, Claude 3.5 Sonnet by Anthropic, and Gemini by Google. This scoping review aims to synthesize the current applications and potential uses of LLMs in patient education and engagement. Materials and methods Following the PRISMA-ScR checklist and methodologies by Arksey, O'Malley, and Levac, we conducted a scoping review. We searched PubMed in June 2024, using keywords and MeSH terms related to LLMs and patient education. Two authors conducted the initial screening, and discrepancies were resolved by consensus. We employed thematic analysis to address our primary research question. Results The review identified 201 studies, predominantly from the United States (58.2%). Six themes emerged: generating patient education materials, interpreting medical information, providing lifestyle recommendations, supporting customized medication use, offering perioperative care instructions, and optimizing doctor-patient interaction. LLMs were found to provide accurate responses to patient queries, enhance existing educational materials, and translate medical information into patient-friendly language. However, challenges such as readability, accuracy, and potential biases were noted. Discussion LLMs demonstrate significant potential in patient education and engagement by creating accessible educational materials, interpreting complex medical information, and enhancing communication between patients and healthcare providers. Nonetheless, issues related to the accuracy and readability of LLM-generated content, as well as ethical concerns, require further research and development. Future studies should focus on improving LLMs and ensuring content reliability while addressing ethical considerations.
Collapse
Affiliation(s)
- Serhat Aydin
- School of Medicine, Koç University, Istanbul, Türkiye
| | - Mert Karabacak
- Department of Neurosurgery, Mount Sinai Health System, New York, NY, United States
| | - Victoria Vlachos
- College of Human Ecology, Cornell University, Ithaca, NY, United States
| | | |
Collapse
|
29
|
McMahon AK, Terry RS, Ito WE, Molina WR, Whiles BB. Battle of the bots: a comparative analysis of ChatGPT and bing AI for kidney stone-related questions. World J Urol 2024; 42:600. [PMID: 39470812 DOI: 10.1007/s00345-024-05326-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Accepted: 10/06/2024] [Indexed: 11/01/2024] Open
Abstract
OBJECTIVES To evaluate and compare the performance of ChatGPT™ (Open AI®) and Bing AI™ (Microsoft®) for responding to kidney stone treatment-related questions in accordance with the American Urological Association (AUA) guidelines and assess factors such as appropriateness, emphasis on consulting healthcare providers, references, and adherence to guidelines by each chatbot. METHODS We developed 20 kidney stone evaluation and treatment-related questions based on the AUA Surgical Management of Stones guideline. Questions were asked to ChatGPT and Bing AI chatbots. We compared their responses utilizing the brief DISCERN tool as well as response appropriateness. RESULTS ChatGPT significantly outperformed Bing AI for questions 1-3, which evaluate clarity, achievement, and relevance of responses (12.77 ± 1.71 vs. 10.17 ± 3.27; p < 0.01). In contrast, Bing AI always incorporated references, whereas ChatGPT never did. Consequently, the results for questions 4-6, which evaluated the quality of sources, were consistently favored Bing AI over ChatGPT (10.8 vs. 4.28; p < 0.01). Notably, neither chatbot offered guidance against guidelines for pre-operative testing. However, recommendations against guidelines were notable for specific scenarios: 30.5% for the treatment of adults with ureteral stones, 52.5% for adults with renal stones, and 20.5% for all patient treatment. CONCLUSIONS ChatGPT significantly outperformed Bing AI in terms of providing responses with clear aim, achieving such aim, and relevant and appropriate responses based on AUA surgical stone management guidelines. However, Bing AI provides references, allowing information quality assessment. Additional studies are needed to further evaluate these chatbots and their potential use by clinicians and patients for urologic healthcare-related questions.
Collapse
Affiliation(s)
- Amber K McMahon
- Department of Urology, University of Kansas Medical Center, Kansas City, KS, USA
| | - Russell S Terry
- Department of Urology, University of Florida College of Medicine, Gainesville, FL, USA
| | - Willian E Ito
- Department of Urology, University of Kansas Medical Center, Kansas City, KS, USA
| | - Wilson R Molina
- Department of Urology, University of Kansas Medical Center, Kansas City, KS, USA
| | - Bristol B Whiles
- Department of Urology, University of Kansas Medical Center, Kansas City, KS, USA.
| |
Collapse
|
30
|
Mete U. Evaluating the Performance of ChatGPT, Gemini, and Bing Compared with Resident Surgeons in the Otorhinolaryngology In-service Training Examination. Turk Arch Otorhinolaryngol 2024; 62:48-57. [PMID: 39463066 PMCID: PMC11572338 DOI: 10.4274/tao.2024.3.5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 05/07/2024] [Indexed: 10/29/2024] Open
Abstract
Objective Large language models (LLMs) are used in various fields for their ability to produce human-like text. They are particularly useful in medical education, aiding clinical management skills and exam preparation for residents. To evaluate and compare the performance of ChatGPT (GPT-4), Gemini, and Bing with each other and with otorhinolaryngology residents in answering in-service training exam questions and provide insights into the usefulness of these models in medical education and healthcare. Methods Eight otorhinolaryngology in-service training exams were used for comparison. 316 questions were prepared from the Resident Training Textbook of the Turkish Society of Otorhinolaryngology Head and Neck Surgery. These questions were presented to the three artificial intelligence models. The exam results were evaluated to determine the accuracy of both models and residents. Results GPT-4 achieved the highest accuracy among the LLMs at 54.75% (GPT-4 vs. Gemini p=0.002, GPT-4 vs. Bing p<0.001), followed by Gemini at 40.50% and Bing at 37.00% (Gemini vs. Bing p=0.327). However, senior residents outperformed all LLMs and other residents with an accuracy rate of 75.5% (p<0.001). The LLMs could only compete with junior residents. GPT- 4 and Gemini performed similarly to juniors, whose accuracy level was 46.90% (p=0.058 and p=0.120, respectively). However, juniors still outperformed Bing (p=0.019). Conclusion The LLMs currently have limitations in achieving the same medical accuracy as senior and mid-level residents. However, they outperform in specific subspecialties, indicating the potential usefulness in certain medical fields.
Collapse
Affiliation(s)
- Utku Mete
- Bursa Uludağ University Faculty of Medicine Department of Otorhinolaryngology, Bursa, Türkiye
| |
Collapse
|
31
|
Carl N, Schramm F, Haggenmüller S, Kather JN, Hetz MJ, Wies C, Michel MS, Wessels F, Brinker TJ. Large language model use in clinical oncology. NPJ Precis Oncol 2024; 8:240. [PMID: 39443582 PMCID: PMC11499929 DOI: 10.1038/s41698-024-00733-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Accepted: 10/12/2024] [Indexed: 10/25/2024] Open
Abstract
Large language models (LLMs) are undergoing intensive research for various healthcare domains. This systematic review and meta-analysis assesses current applications, methodologies, and the performance of LLMs in clinical oncology. A mixed-methods approach was used to extract, summarize, and compare methodological approaches and outcomes. This review includes 34 studies. LLMs are primarily evaluated on their ability to answer oncologic questions across various domains. The meta-analysis highlights a significant performance variance, influenced by diverse methodologies and evaluation criteria. Furthermore, differences in inherent model capabilities, prompting strategies, and oncological subdomains contribute to heterogeneity. The lack of use of standardized and LLM-specific reporting protocols leads to methodological disparities, which must be addressed to ensure comparability in LLM research and ultimately leverage the reliable integration of LLM technologies into clinical practice.
Collapse
Affiliation(s)
- Nicolas Carl
- Department of Digital Prevention, Diagnostics and Therapy Guidance, German Cancer Research Center (DKFZ), Heidelberg, Germany
- Department of Urology and Urological Surgery, University Medical Center Mannheim, Ruprecht-Karls University Heidelberg, Mannheim, Germany
| | - Franziska Schramm
- Department of Digital Prevention, Diagnostics and Therapy Guidance, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Sarah Haggenmüller
- Department of Digital Prevention, Diagnostics and Therapy Guidance, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Jakob Nikolas Kather
- Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, Technical University Dresden, Dresden, Germany
| | - Martin J Hetz
- Department of Digital Prevention, Diagnostics and Therapy Guidance, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Christoph Wies
- Department of Digital Prevention, Diagnostics and Therapy Guidance, German Cancer Research Center (DKFZ), Heidelberg, Germany
- Medical Faculty, Ruprecht-Karls University Heidelberg, Heidelberg, Germany
| | - Maurice Stephan Michel
- Department of Urology and Urological Surgery, University Medical Center Mannheim, Ruprecht-Karls University Heidelberg, Mannheim, Germany
| | - Frederik Wessels
- Department of Urology and Urological Surgery, University Medical Center Mannheim, Ruprecht-Karls University Heidelberg, Mannheim, Germany
| | - Titus J Brinker
- Department of Digital Prevention, Diagnostics and Therapy Guidance, German Cancer Research Center (DKFZ), Heidelberg, Germany.
| |
Collapse
|
32
|
Lechien JR. Generative AI and Otolaryngology-Head & Neck Surgery. Otolaryngol Clin North Am 2024; 57:753-765. [PMID: 38839556 DOI: 10.1016/j.otc.2024.04.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2024]
Abstract
The increasing development of artificial intelligence (AI) generative models in otolaryngology-head and neck surgery will progressively change our practice. Practitioners and patients have access to AI resources, improving information, knowledge, and practice of patient care. This article summarizes the currently investigated applications of AI generative models, particularly Chatbot Generative Pre-trained Transformer, in otolaryngology-head and neck surgery.
Collapse
Affiliation(s)
- Jérôme R Lechien
- Research Committee of Young Otolaryngologists of the International Federation of Otorhinolaryngological Societies (IFOS), Paris, France; Division of Laryngology and Broncho-esophagology, Department of Otolaryngology-Head Neck Surgery, EpiCURA Hospital, UMONS Research Institute for Health Sciences and Technology, University of Mons (UMons), Mons, Belgium; Department of Otorhinolaryngology and Head and Neck Surgery, Foch Hospital, Paris Saclay University, Phonetics and Phonology Laboratory (UMR 7018 CNRS, Université Sorbonne Nouvelle/Paris 3), Paris, France; Department of Otorhinolaryngology and Head and Neck Surgery, CHU Saint-Pierre, Brussels, Belgium.
| |
Collapse
|
33
|
Grossman S, Zerilli T, Nathan JP. Appropriateness of ChatGPT as a resource for medication-related questions. Br J Clin Pharmacol 2024; 90:2691-2695. [PMID: 39096130 DOI: 10.1111/bcp.16212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2024] [Revised: 07/04/2024] [Accepted: 07/22/2024] [Indexed: 08/04/2024] Open
Abstract
With its increasing popularity, healthcare professionals and patients may use ChatGPT to obtain medication-related information. This study was conducted to assess ChatGPT's ability to provide satisfactory responses (i.e., directly answers the question, accurate, complete and relevant) to medication-related questions posed to an academic drug information service. ChatGPT responses were compared to responses generated by the investigators through the use of traditional resources, and references were evaluated. Thirty-nine questions were entered into ChatGPT; the three most common categories were therapeutics (8; 21%), compounding/formulation (6; 15%) and dosage (5; 13%). Ten (26%) questions were answered satisfactorily by ChatGPT. Of the 29 (74%) questions that were not answered satisfactorily, deficiencies included lack of a direct response (11; 38%), lack of accuracy (11; 38%) and/or lack of completeness (12; 41%). References were included with eight (29%) responses; each included fabricated references. Presently, healthcare professionals and consumers should be cautioned against using ChatGPT for medication-related information.
Collapse
Affiliation(s)
- Sara Grossman
- LIU Pharmacy, Arnold & Marie Schwartz College of Pharmacy and Health Sciences, Brooklyn, New York, USA
| | - Tina Zerilli
- LIU Pharmacy, Arnold & Marie Schwartz College of Pharmacy and Health Sciences, Brooklyn, New York, USA
| | - Joseph P Nathan
- LIU Pharmacy, Arnold & Marie Schwartz College of Pharmacy and Health Sciences, Brooklyn, New York, USA
| |
Collapse
|
34
|
Irmici G, Cozzi A, Della Pepa G, De Berardinis C, D'Ascoli E, Cellina M, Cè M, Depretto C, Scaperrotta G. How do large language models answer breast cancer quiz questions? A comparative study of GPT-3.5, GPT-4 and Google Gemini. LA RADIOLOGIA MEDICA 2024; 129:1463-1467. [PMID: 39138732 DOI: 10.1007/s11547-024-01872-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Accepted: 08/01/2024] [Indexed: 08/15/2024]
Abstract
Applications of large language models (LLMs) in the healthcare field have shown promising results in processing and summarizing multidisciplinary information. This study evaluated the ability of three publicly available LLMs (GPT-3.5, GPT-4, and Google Gemini-then called Bard) to answer 60 multiple-choice questions (29 sourced from public databases, 31 newly formulated by experienced breast radiologists) about different aspects of breast cancer care: treatment and prognosis, diagnostic and interventional techniques, imaging interpretation, and pathology. Overall, the rate of correct answers significantly differed among LLMs (p = 0.010): the best performance was achieved by GPT-4 (95%, 57/60) followed by GPT-3.5 (90%, 54/60) and Google Gemini (80%, 48/60). Across all LLMs, no significant differences were observed in the rates of correct replies to questions sourced from public databases and newly formulated ones (p ≥ 0.593). These results highlight the potential benefits of LLMs in breast cancer care, which will need to be further refined through in-context training.
Collapse
Affiliation(s)
- Giovanni Irmici
- Breast Radiology Department, Fondazione IRCCS Istituto Nazionale dei Tumori, Via Giacomo Venezian 1, 20133, Milano, Italy.
| | - Andrea Cozzi
- Imaging Institute of Southern Switzerland (IIMSI), Ente Ospedaliero Cantonale (EOC), Lugano, Switzerland
| | - Gianmarco Della Pepa
- Breast Radiology Department, Fondazione IRCCS Istituto Nazionale dei Tumori, Via Giacomo Venezian 1, 20133, Milano, Italy
| | - Claudia De Berardinis
- Breast Radiology Department, Fondazione IRCCS Istituto Nazionale dei Tumori, Via Giacomo Venezian 1, 20133, Milano, Italy
| | - Elisa D'Ascoli
- Breast Radiology Department, Fondazione IRCCS Istituto Nazionale dei Tumori, Via Giacomo Venezian 1, 20133, Milano, Italy
| | - Michaela Cellina
- Radiology Department, ASST Fatebenefratelli Sacco, Milano, Italy
| | - Maurizio Cè
- Postgraduation School in Radiodiagnostics, Università degli Studi di Milano, Milano, Italy
| | - Catherine Depretto
- Breast Radiology Department, Fondazione IRCCS Istituto Nazionale dei Tumori, Via Giacomo Venezian 1, 20133, Milano, Italy
| | - Gianfranco Scaperrotta
- Breast Radiology Department, Fondazione IRCCS Istituto Nazionale dei Tumori, Via Giacomo Venezian 1, 20133, Milano, Italy
| |
Collapse
|
35
|
Gargari OK, Fatehi F, Mohammadi I, Firouzabadi SR, Shafiee A, Habibi G. Diagnostic accuracy of large language models in psychiatry. Asian J Psychiatr 2024; 100:104168. [PMID: 39111087 DOI: 10.1016/j.ajp.2024.104168] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/27/2024] [Revised: 07/20/2024] [Accepted: 07/22/2024] [Indexed: 09/13/2024]
Abstract
INTRODUCTION Medical decision-making is crucial for effective treatment, especially in psychiatry where diagnosis often relies on subjective patient reports and a lack of high-specificity symptoms. Artificial intelligence (AI), particularly Large Language Models (LLMs) like GPT, has emerged as a promising tool to enhance diagnostic accuracy in psychiatry. This comparative study explores the diagnostic capabilities of several AI models, including Aya, GPT-3.5, GPT-4, GPT-3.5 clinical assistant (CA), Nemotron, and Nemotron CA, using clinical cases from the DSM-5. METHODS We curated 20 clinical cases from the DSM-5 Clinical Cases book, covering a wide range of psychiatric diagnoses. Four advanced AI models (GPT-3.5 Turbo, GPT-4, Aya, Nemotron) were tested using prompts to elicit detailed diagnoses and reasoning. The models' performances were evaluated based on accuracy and quality of reasoning, with additional analysis using the Retrieval Augmented Generation (RAG) methodology for models accessing the DSM-5 text. RESULTS The AI models showed varied diagnostic accuracy, with GPT-3.5 and GPT-4 performing notably better than Aya and Nemotron in terms of both accuracy and reasoning quality. While models struggled with specific disorders such as cyclothymic and disruptive mood dysregulation disorders, others excelled, particularly in diagnosing psychotic and bipolar disorders. Statistical analysis highlighted significant differences in accuracy and reasoning, emphasizing the superiority of the GPT models. DISCUSSION The application of AI in psychiatry offers potential improvements in diagnostic accuracy. The superior performance of the GPT models can be attributed to their advanced natural language processing capabilities and extensive training on diverse text data, enabling more effective interpretation of psychiatric language. However, models like Aya and Nemotron showed limitations in reasoning, indicating a need for further refinement in their training and application. CONCLUSION AI holds significant promise for enhancing psychiatric diagnostics, with certain models demonstrating high potential in interpreting complex clinical descriptions accurately. Future research should focus on expanding the dataset and integrating multimodal data to further enhance the diagnostic capabilities of AI in psychiatry.
Collapse
Affiliation(s)
- Omid Kohandel Gargari
- Farzan Artificial Intelligence Team, Farzan Clinical Research Institute, Tehran, Islamic Republic of Iran
| | - Farhad Fatehi
- Centre for Health Services Research, Faculty of Medicine, The University of Queensland, Brisbane, Australia; School of Psychological Sciences, Monash University, Melbourne, Australia
| | - Ida Mohammadi
- Farzan Artificial Intelligence Team, Farzan Clinical Research Institute, Tehran, Islamic Republic of Iran
| | - Shahryar Rajai Firouzabadi
- Farzan Artificial Intelligence Team, Farzan Clinical Research Institute, Tehran, Islamic Republic of Iran
| | - Arman Shafiee
- Farzan Artificial Intelligence Team, Farzan Clinical Research Institute, Tehran, Islamic Republic of Iran
| | - Gholamreza Habibi
- Farzan Artificial Intelligence Team, Farzan Clinical Research Institute, Tehran, Islamic Republic of Iran.
| |
Collapse
|
36
|
Tam TYC, Sivarajkumar S, Kapoor S, Stolyar AV, Polanska K, McCarthy KR, Osterhoudt H, Wu X, Visweswaran S, Fu S, Mathur P, Cacciamani GE, Sun C, Peng Y, Wang Y. A framework for human evaluation of large language models in healthcare derived from literature review. NPJ Digit Med 2024; 7:258. [PMID: 39333376 PMCID: PMC11437138 DOI: 10.1038/s41746-024-01258-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2024] [Accepted: 09/11/2024] [Indexed: 09/29/2024] Open
Abstract
With generative artificial intelligence (GenAI), particularly large language models (LLMs), continuing to make inroads in healthcare, assessing LLMs with human evaluations is essential to assuring safety and effectiveness. This study reviews existing literature on human evaluation methodologies for LLMs in healthcare across various medical specialties and addresses factors such as evaluation dimensions, sample types and sizes, selection, and recruitment of evaluators, frameworks and metrics, evaluation process, and statistical analysis type. Our literature review of 142 studies shows gaps in reliability, generalizability, and applicability of current human evaluation practices. To overcome such significant obstacles to healthcare LLM developments and deployments, we propose QUEST, a comprehensive and practical framework for human evaluation of LLMs covering three phases of workflow: Planning, Implementation and Adjudication, and Scoring and Review. QUEST is designed with five proposed evaluation principles: Quality of Information, Understanding and Reasoning, Expression Style and Persona, Safety and Harm, and Trust and Confidence.
Collapse
Affiliation(s)
- Thomas Yu Chow Tam
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA
| | | | - Sumit Kapoor
- Department of Critical Care Medicine, University of Pittsburgh Medical Center, Pittsburgh, PA, USA
| | - Alisa V Stolyar
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA
| | - Katelyn Polanska
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA
| | - Karleigh R McCarthy
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA
| | - Hunter Osterhoudt
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA
| | - Xizhi Wu
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA
| | - Shyam Visweswaran
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
- Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA, USA
| | - Sunyang Fu
- Department of Clinical and Health Informatics, Center for Translational AI Excellence and Applications in Medicine, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Piyush Mathur
- Department of Anesthesiology, Cleveland Clinic, Cleveland, OH, USA
- BrainX AI ReSearch, BrainX LLC, Cleveland, OH, USA
| | - Giovanni E Cacciamani
- Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA
| | - Cong Sun
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| | - Yifan Peng
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| | - Yanshan Wang
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, USA.
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA.
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.
- Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA, USA.
- Hillman Cancer Center, University of Pittsburgh Medical Center, Pittsburgh, PA, USA.
| |
Collapse
|
37
|
Alami K, Willemse E, Quiriny M, Lipski S, Laurent C, Donquier V, Digonnet A. Evaluation of ChatGPT-4's Performance in Therapeutic Decision-Making During Multidisciplinary Oncology Meetings for Head and Neck Squamous Cell Carcinoma. Cureus 2024; 16:e68808. [PMID: 39376890 PMCID: PMC11456411 DOI: 10.7759/cureus.68808] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/04/2024] [Indexed: 10/09/2024] Open
Abstract
Objectives First reports suggest that artificial intelligence (AI) such as ChatGPT-4 (Open AI, ChatGPT-4, San Francisco, USA) might represent reliable tools for therapeutic decisions in some medical conditions. This study aims to assess the decisional capacity of ChatGPT-4 in patients with head and neck carcinomas, using the multidisciplinary oncology meeting (MOM) and the National Comprehensive Cancer Network (NCCN) decision as references. Methods This retrospective study included 263 patients with squamous cell carcinoma of the oral cavity, oropharynx, hypopharynx, and larynx who were followed at our institution between January 1, 2016, and December 31, 2021. The recommendation of GPT4 for the first- and second-line treatments was compared to the MOM decision and NCCN guidelines. The degrees of agreement were calculated using the Kappa method, which measures the degree of agreement between two evaluators. Results ChatGPT-4 demonstrated a moderate agreement in first-line treatment recommendations (Kappa = 0.48) and a substantial agreement (Kappa = 0.78) in second-line treatment recommendations compared to the decisions from MOM. A substantial agreement with the NCCN guidelines for both first- and second-line treatments was observed (Kappa = 0.72 and 0.66, respectively). The degree of agreement decreased when the decision included gastrostomy, patients over 70, and those with comorbidities. Conclusions The study illustrates that while ChatGPT-4 can significantly support clinical decision-making in oncology by aligning closely with expert recommendations and established guidelines, ongoing enhancements and training are crucial. The findings advocate for the continued evolution of AI tools to better handle the nuanced aspects of patient health profiles, thus broadening their applicability and reliability in clinical practice.
Collapse
Affiliation(s)
- Kenza Alami
- Otolaryngology, Jules Bordet Institute, Bruxelles, BEL
| | | | - Marie Quiriny
- Surgical Oncology, Jules Bordet Institute, Bruxelles, BEL
| | - Samuel Lipski
- Surgical Oncology, Jules Bordet Institute, Bruxelles, BEL
| | - Celine Laurent
- Otolaryngology - Head and Neck Surgery, Hôpital Ambroise-Paré, Mons, BEL
- Otolaryngology - Head and Neck Surgery, Hôpital Universitaire de Bruxelles (HUB) Erasme Hospital, Bruxelles, BEL
| | | | | |
Collapse
|
38
|
Lechien JR, Rameau A. Applications of ChatGPT in Otolaryngology-Head Neck Surgery: A State of the Art Review. Otolaryngol Head Neck Surg 2024; 171:667-677. [PMID: 38716790 DOI: 10.1002/ohn.807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 04/01/2024] [Accepted: 04/19/2024] [Indexed: 08/28/2024]
Abstract
OBJECTIVE To review the current literature on the application, accuracy, and performance of Chatbot Generative Pre-Trained Transformer (ChatGPT) in Otolaryngology-Head and Neck Surgery. DATA SOURCES PubMED, Cochrane Library, and Scopus. REVIEW METHODS A comprehensive review of the literature on the applications of ChatGPT in otolaryngology was conducted according to Preferred Reporting Items for Systematic Reviews and Meta-analyses statement. CONCLUSIONS ChatGPT provides imperfect patient information or general knowledge related to diseases found in Otolaryngology-Head and Neck Surgery. In clinical practice, despite suboptimal performance, studies reported that the model is more accurate in providing diagnoses, than in suggesting the most adequate additional examinations and treatments related to clinical vignettes or real clinical cases. ChatGPT has been used as an adjunct tool to improve scientific reports (referencing, spelling correction), to elaborate study protocols, or to take student or resident exams reporting several levels of accuracy. The stability of ChatGPT responses throughout repeated questions appeared high but many studies reported some hallucination events, particularly in providing scientific references. IMPLICATIONS FOR PRACTICE To date, most applications of ChatGPT are limited in generating disease or treatment information, and in the improvement of the management of clinical cases. The lack of comparison of ChatGPT performance with other large language models is the main limitation of the current research. Its ability to analyze clinical images has not yet been investigated in otolaryngology although upper airway tract or ear images are an important step in the diagnosis of most common ear, nose, and throat conditions. This review may help otolaryngologists to conceive new applications in further research.
Collapse
Affiliation(s)
- Jérôme R Lechien
- Research Committee of Young Otolaryngologists of the International Federation of Otorhinolaryngological Societies (IFOS), Paris, France
- Division of Laryngology and Broncho-Esophagology, Department of Otolaryngology-Head Neck Surgery, EpiCURA Hospital, UMONS Research Institute for Health Sciences and Technology, University of Mons (UMons), Mons, Belgium
- Department of Otorhinolaryngology and Head and Neck Surgery, Foch Hospital, Phonetics and Phonology Laboratory (UMR 7018 CNRS, Université Sorbonne Nouvelle/Paris 3), Paris Saclay University, Paris, France
- Department of Otorhinolaryngology and Head and Neck Surgery, CHU Saint-Pierre, Brussels, Belgium
| | - Anais Rameau
- Department of Otolaryngology-Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medicine, New York City, New York, USA
| |
Collapse
|
39
|
Hassona Y, Alqaisi D, Al-Haddad A, Georgakopoulou EA, Malamos D, Alrashdan MS, Sawair F. How good is ChatGPT at answering patients' questions related to early detection of oral (mouth) cancer? Oral Surg Oral Med Oral Pathol Oral Radiol 2024; 138:269-278. [PMID: 38714483 DOI: 10.1016/j.oooo.2024.04.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Revised: 03/22/2024] [Accepted: 04/14/2024] [Indexed: 05/10/2024]
Abstract
OBJECTIVES To examine the quality, reliability, readability, and usefulness of ChatGPT in promoting oral cancer early detection. STUDY DESIGN About 108 patient-oriented questions about oral cancer early detection were compiled from expert panels, professional societies, and web-based tools. Questions were categorized into 4 topic domains and ChatGPT 3.5 was asked each question independently. ChatGPT answers were evaluated regarding quality, readability, actionability, and usefulness using. Two experienced reviewers independently assessed each response. RESULTS Questions related to clinical appearance constituted 36.1% (n = 39) of the total questions. ChatGPT provided "very useful" responses to the majority of questions (75%; n = 81). The mean Global Quality Score was 4.24 ± 1.3 of 5. The mean reliability score was 23.17 ± 9.87 of 25. The mean understandability score was 76.6% ± 25.9% of 100, while the mean actionability score was 47.3% ± 18.9% of 100. The mean FKS reading ease score was 38.4% ± 29.9%, while the mean SMOG index readability score was 11.65 ± 8.4. No misleading information was identified among ChatGPT responses. CONCLUSION ChatGPT is an attractive and potentially useful resource for informing patients about early detection of oral cancer. Nevertheless, concerns do exist about readability and actionability of the offered information.
Collapse
Affiliation(s)
- Yazan Hassona
- Faculty of Dentistry, Centre for Oral Diseases Studies (CODS), Al-Ahliyya Amman University, Jordan; School of Dentistry, The University of Jordan, Jordan.
| | - Dua'a Alqaisi
- School of Dentistry, The University of Jordan, Jordan
| | | | - Eleni A Georgakopoulou
- Molecular Carcinogenesis Group, Department of Histology and Embryology, Medical School, National and Kapodistrian University of Athens, Greece
| | - Dimitris Malamos
- Oral Medicine Clinic of the National Organization for the Provision of Health, Athens, Greece
| | - Mohammad S Alrashdan
- Department of Oral and Craniofacial Health Sciences, College of Dental Medicine, University of Sharjah, Sharjah, United Arab Emirates
| | - Faleh Sawair
- School of Dentistry, The University of Jordan, Jordan
| |
Collapse
|
40
|
Cornelison BR, Erstad BL, Edwards C. Accuracy of a chatbot in answering questions that patients should ask before taking a new medication. J Am Pharm Assoc (2003) 2024; 64:102110. [PMID: 38670493 DOI: 10.1016/j.japh.2024.102110] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 04/16/2024] [Accepted: 04/19/2024] [Indexed: 04/28/2024]
Abstract
BACKGROUND The potential uses of artificial intelligence have extended into the fields of health care delivery and education. However, challenges are associated with introducing innovative technologies into health care, particularly with respect to information quality. OBJECTIVE This study aimed to evaluate the accuracy of answers provided by a chatbot in response to questions that patients should ask before taking a new medication. METHODS Twelve questions obtained from the Agency for Healthcare Research and Quality were asked to a chatbot for the top 20 drugs. Two reviewers independently evaluated and rated each response on a 6-point scale for accuracy and a 3-point scale for completeness with a score of 2 considered adequate. Accuracy was determined using clinical expertise and a drug information database. After the independent reviews, answers were compared, and discrepancies were assigned a consensus score. RESULTS Of 240 responses, 222 (92.5%) were assessed as completely accurate. Of the inaccurate responses, 10 (4.2%) were mostly accurate, 5 (2.1%) were more accurate than inaccurate, 2 (0.8%) were equal parts accurate and inaccurate, and 1 (0.4%) was more inaccurate than accurate. Of the 240 responses, 194 (80.8%) were comprehensively complete. There were 235 (97.9%) responses that scored 2 or higher. Five responses (2.1%) were considered incomplete. CONCLUSION Using a chatbot to answer questions commonly asked by patients is mostly accurate but may include inaccurate information or lack valuable information for patients.
Collapse
|
41
|
Hassona Y, Alqaisi DA. "My kid has autism": An interesting conversation with ChatGPT. SPECIAL CARE IN DENTISTRY 2024; 44:1296-1299. [PMID: 38415857 DOI: 10.1111/scd.12983] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2024] [Revised: 02/11/2024] [Accepted: 02/16/2024] [Indexed: 02/29/2024]
Affiliation(s)
- Yazan Hassona
- Faculty of Dentistry, Centre for Oral Diseases Studies, Al-Ahliyya Amman University, Amman, Jordan
- School of Dentistry, The University of Jordan, Amman, Jordan
| | - Dua A Alqaisi
- School of Dentistry, The University of Jordan, Amman, Jordan
| |
Collapse
|
42
|
Ahmed SK. The future of oral cancer care: Integrating ChatGPT into clinical practice. ORAL ONCOLOGY REPORTS 2024; 10:100317. [DOI: 10.1016/j.oor.2024.100317] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2025]
|
43
|
Blasingame MN, Koonce TY, Williams AM, Giuse DA, Su J, Krump PA, Giuse NB. Evaluating a Large Language Model's Ability to Answer Clinicians' Requests for Evidence Summaries. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.05.01.24306691. [PMID: 38746273 PMCID: PMC11092721 DOI: 10.1101/2024.05.01.24306691] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2024]
Abstract
Objective This study investigated the performance of a generative artificial intelligence (AI) tool using GPT-4 in answering clinical questions in comparison with medical librarians' gold-standard evidence syntheses. Methods Questions were extracted from an in-house database of clinical evidence requests previously answered by medical librarians. Questions with multiple parts were subdivided into individual topics. A standardized prompt was developed using the COSTAR framework. Librarians submitted each question into aiChat, an internally-managed chat tool using GPT-4, and recorded the responses. The summaries generated by aiChat were evaluated on whether they contained the critical elements used in the established gold-standard summary of the librarian. A subset of questions was randomly selected for verification of references provided by aiChat. Results Of the 216 evaluated questions, aiChat's response was assessed as "correct" for 180 (83.3%) questions, "partially correct" for 35 (16.2%) questions, and "incorrect" for 1 (0.5%) question. No significant differences were observed in question ratings by question category (p=0.39). For a subset of 30% (n=66) of questions, 162 references were provided in the aiChat summaries, and 60 (37%) were confirmed as nonfabricated. Conclusions Overall, the performance of a generative AI tool was promising. However, many included references could not be independently verified, and attempts were not made to assess whether any additional concepts introduced by aiChat were factually accurate. Thus, we envision this being the first of a series of investigations designed to further our understanding of how current and future versions of generative AI can be used and integrated into medical librarians' workflow.
Collapse
|
44
|
Lee TJ, Rao AK, Campbell DJ, Radfar N, Dayal M, Khrais A. Evaluating ChatGPT-3.5 and ChatGPT-4.0 Responses on Hyperlipidemia for Patient Education. Cureus 2024; 16:e61067. [PMID: 38803402 PMCID: PMC11128363 DOI: 10.7759/cureus.61067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/25/2024] [Indexed: 05/29/2024] Open
Abstract
Introduction Hyperlipidemia is prevalent worldwide and affects a significant number of US adults. It significantly contributes to ischemic heart disease and millions of deaths annually. With the increasing use of the internet for health information, tools like ChatGPT (OpenAI, San Francisco, CA, USA) have gained traction. ChatGPT version 4.0, launched in March 2023, offers enhanced features over its predecessor but requires a monthly fee. This study compares the accuracy, comprehensibility, and response length of the free and paid versions of ChatGPT for patient education on hyperlipidemia. Materials and methods ChatGPT versions 3.5 and 4.0 were prompted in three different ways and 25 questions from the Cleveland Clinic's frequently asked questions (FAQs) on hyperlipidemia. Prompts included no prompting (Form 1), patient-friendly prompting (Form 2), and physician-level prompting (Form 3). Responses were categorized as incorrect, partially correct, or correct. Additionally, the grade level and word count from each response were recorded for analysis. Results Overall, scoring frequencies for ChatGPT version 3.5 were: five (6.67%) incorrect, 18 partially correct (24%), and 52 (69.33%) correct. Scoring frequencies for ChatGPT version 4.0 were: one (1.33%) incorrect, 18 (24.00%) partially correct, and 56 (74.67%) correct. Correct answers did not significantly differ between ChatGPT version 3.5 and ChatGPT version 4.0 (p = 0.586). ChatGPT version 3.5 had a significantly higher grade reading level than version 4.0 (p = 0.0002). ChatGPT version 3.5 had a significantly higher word count than version 4.0 (p = 0.0073). Discussion There was no significant difference in accuracy between the free and paid versions of hyperlipidemia FAQs. Both versions provided accurate but sometimes partially complete responses. Version 4.0 offered more concise and readable information, aligning with the readability of most online medical resources despite exceeding the National Institutes of Health's (NIH's) recommended eighth-grade reading level. The paid version demonstrated superior adaptability in tailoring responses based on the input. Conclusion Both versions of ChatGPT provide reliable medical information, with the paid version offering more adaptable and readable responses. Healthcare providers can recommend ChatGPT as a source of patient education, regardless of the version used. Future research should explore diverse question formulations and ChatGPT's handling of incorrect information.
Collapse
Affiliation(s)
- Thomas J Lee
- Department of Medicine, Rutgers University New Jersey Medical School, Newark, USA
| | - Abhinav K Rao
- Department of Medicine, Trident Medical Center, Charleston, USA
| | - Daniel J Campbell
- Department of Otolaryngology - Head and Neck Surgery, Thomas Jefferson University Hospital, Philadelphia, USA
| | - Navid Radfar
- Department of Medicine, Rutgers University New Jersey Medical School, Newark, USA
| | - Manik Dayal
- Department of Medicine, Rutgers University New Jersey Medical School, Newark, USA
| | - Ayham Khrais
- Department of Medicine, Rutgers University New Jersey Medical School, Newark, USA
| |
Collapse
|
45
|
Haase I, Xiong T, Rissmann A, Knitza J, Greenfield J, Krusche M. ChatSLE: consulting ChatGPT-4 for 100 frequently asked lupus questions. THE LANCET. RHEUMATOLOGY 2024; 6:e196-e199. [PMID: 38508817 DOI: 10.1016/s2665-9913(24)00056-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Revised: 02/18/2024] [Accepted: 02/19/2024] [Indexed: 03/22/2024]
Affiliation(s)
- Isabell Haase
- University Medical Center Hamburg-Eppendorf, Division of Rheumatology and Systemic Inflammatory Diseases, III Department of Medicine, 20246 Hamburg, Germany.
| | - Tingting Xiong
- University Medical Center Hamburg-Eppendorf, Division of Rheumatology and Systemic Inflammatory Diseases, III Department of Medicine, 20246 Hamburg, Germany
| | - Antonia Rissmann
- University Medical Center Hamburg-Eppendorf, Division of Rheumatology and Systemic Inflammatory Diseases, III Department of Medicine, 20246 Hamburg, Germany
| | - Johannes Knitza
- Institute for Digital Medicine, University Hospital Marburg, Philipps-University Marburg, Marburg, Germany
| | - Julia Greenfield
- Institute for Digital Medicine, University Hospital Marburg, Philipps-University Marburg, Marburg, Germany
| | - Martin Krusche
- University Medical Center Hamburg-Eppendorf, Division of Rheumatology and Systemic Inflammatory Diseases, III Department of Medicine, 20246 Hamburg, Germany
| |
Collapse
|
46
|
Buhr CR, Smith H, Huppertz T, Bahr-Hamm K, Matthias C, Cuny C, Snijders JP, Ernst BP, Blaikie A, Kelsey T, Kuhn S, Eckrich J. Assessing unknown potential-quality and limitations of different large language models in the field of otorhinolaryngology. Acta Otolaryngol 2024; 144:237-242. [PMID: 38781053 DOI: 10.1080/00016489.2024.2352843] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Accepted: 05/03/2024] [Indexed: 05/25/2024]
Abstract
BACKGROUND Large Language Models (LLMs) might offer a solution for the lack of trained health personnel, particularly in low- and middle-income countries. However, their strengths and weaknesses remain unclear. AIMS/OBJECTIVES Here we benchmark different LLMs (Bard 2023.07.13, Claude 2, ChatGPT 4) against six consultants in otorhinolaryngology (ORL). MATERIAL AND METHODS Case-based questions were extracted from literature and German state examinations. Answers from Bard 2023.07.13, Claude 2, ChatGPT 4, and six ORL consultants were rated blindly on a 6-point Likert-scale for medical adequacy, comprehensibility, coherence, and conciseness. Given answers were compared to validated answers and evaluated for hazards. A modified Turing test was performed and character counts were compared. RESULTS LLMs answers ranked inferior to consultants in all categories. Yet, the difference between consultants and LLMs was marginal, with the clearest disparity in conciseness and the smallest in comprehensibility. Among LLMs Claude 2 was rated best in medical adequacy and conciseness. Consultants' answers matched the validated solution in 93% (228/246), ChatGPT 4 in 85% (35/41), Claude 2 in 78% (32/41), and Bard 2023.07.13 in 59% (24/41). Answers were rated as potentially hazardous in 10% (24/246) for ChatGPT 4, 14% (34/246) for Claude 2, 19% (46/264) for Bard 2023.07.13, and 6% (71/1230) for consultants. CONCLUSIONS AND SIGNIFICANCE Despite consultants superior performance, LLMs show potential for clinical application in ORL. Future studies should assess their performance on larger scale.
Collapse
Affiliation(s)
- Christoph R Buhr
- Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany
- School of Medicine, University of St Andrews, St Andrews, UK
| | - Harry Smith
- School of Computer Science, University of St Andrews, St Andrews, UK
| | - Tilman Huppertz
- Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany
| | - Katharina Bahr-Hamm
- Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany
| | - Christoph Matthias
- Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany
| | - Clemens Cuny
- Outpatient Clinic, Clemens Cuny, Dieburg, Germany
| | | | | | - Andrew Blaikie
- School of Medicine, University of St Andrews, St Andrews, UK
| | - Tom Kelsey
- School of Computer Science, University of St Andrews, St Andrews, UK
| | - Sebastian Kuhn
- Institute for Digital Medicine, Philipps-University Marburg, University Hospital of Giessen and Marburg, Marburg, Germany
| | - Jonas Eckrich
- Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany
| |
Collapse
|