1
|
Peters M, Le Clercq M, Yanni A, Vanden Eynden X, Martin L, Vanden Haute N, Tancredi S, De Passe C, Boutremans E, Lechien J, Dequanter D. ChatGPT and trainee performances in the management of maxillofacial patients. JOURNAL OF STOMATOLOGY, ORAL AND MAXILLOFACIAL SURGERY 2025; 126:102090. [PMID: 39332706 DOI: 10.1016/j.jormas.2024.102090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/23/2024] [Revised: 08/24/2024] [Accepted: 09/22/2024] [Indexed: 09/29/2024]
Abstract
INTRODUCTION ChatGPT is an artificial intelligence based large language model with the ability to generate human-like response to text input, its performance has already been the subject of several studies in different fields. The aim of this study was to evaluate the performance of ChatGPT in the management of maxillofacial clinical cases. MATERIALS AND METHODS A total of 38 clinical cases consulting at the Stomatology-Maxillofacial Surgery Department were prospectively recruited and presented to ChatGPT, which was interrogated for diagnosis, differential diagnosis, management and treatment. The performance of trainees and ChatGPT was compared by three blinded board-certified maxillofacial surgeons using the AIPI score. RESULTS The average total AIPI score assigned to the practitioners was 18.71 and 16.39 to ChatGPT, significantly lower (p < 0.001). According to the experts, ChatGPT was significantly less effective for diagnosis and treatment (p < 0.001). Following two of the three experts, ChatGPT was significantly less effective in considering patient data (p = 0.001) and suggesting additional examinations (p < 0.0001). The primary diagnosis proposed by ChatGPT was judged by the experts as not plausible and /or incomplete in 2.63 % to 18 % of the cases, the additional examinations were associated with inadequate examinations in 2.63 %, to 21.05 % of the cases and proposed an association of pertinent, but incomplete therapeutic findings in 18.42 % to 47.37 % of the cases, while the therapeutic findings were considered pertinent, necessary and inadequate in 18.42 % of cases. CONCLUSIONS ChatGPT appears less efficient in diagnosis, the selection of the most adequate additional examination and the proposition of pertinent and necessary therapeutic approaches.
Collapse
Affiliation(s)
- Mélissa Peters
- Department of Stomatology, Oral & Maxillofacial Surgery, CHU Saint Pierre, Brussels, Belgium.
| | - Maxime Le Clercq
- Department of Stomatology, Oral & Maxillofacial Surgery, CHU Saint Pierre, Brussels, Belgium
| | - Antoine Yanni
- Department of Stomatology, Oral & Maxillofacial Surgery, CHU Saint Pierre, Brussels, Belgium
| | - Xavier Vanden Eynden
- Department of Stomatology, Oral & Maxillofacial Surgery, CHU Saint Pierre, Brussels, Belgium
| | - Lalmand Martin
- Department of Stomatology, Oral & Maxillofacial Surgery, CHU Saint Pierre, Brussels, Belgium
| | - Noémie Vanden Haute
- Department of Stomatology, Oral & Maxillofacial Surgery, CHU Saint Pierre, Brussels, Belgium
| | - Szonja Tancredi
- Department of Stomatology, Oral & Maxillofacial Surgery, CHU Saint Pierre, Brussels, Belgium
| | - Céline De Passe
- Department of Stomatology, Oral & Maxillofacial Surgery, CHU Saint Pierre, Brussels, Belgium
| | - Edward Boutremans
- Department of Stomatology, Oral & Maxillofacial Surgery, CHU Saint Pierre, Brussels, Belgium
| | - Jerome Lechien
- Faculty of Medicine, Department of Human Anatomy and Experimental Oncology UMONS, Mons, Belgium; Phonetics and Phonology Laboratory (UMR 7018 CNRS, Université Sorbonne Nouvelle/Paris 3), Department of Otorhinolaryngology and Head and Neck Surgery, Foch Hospital, School of Medicine, UFR Simone Veil, Université Versailles Saint-Quentin-en-Yvelines (Paris Saclay University), Paris, France; Department of Otorhinolaryngology and Head and Neck Surgery, CHU Saint-Pierre, Brussels, Belgium; Research Committee of Young Otolaryngologists of the International Federation of Otorhinolaryngological Societies (IFOS), Paris, France; Young Confederation of the European Oto-Rhino-Laryngological Head and Neck Surgery Societies (Y-CEORLHNS), Dublin, Ireland; Division of Laryngology and Broncho-Esophagology, Department of Otolaryngology-Head Neck Surgery, EpiCURA Hospital, UMONS Research Institute for Health Sciences and Technology, University of Mons (UMons), Mons, Belgium
| | - Didier Dequanter
- Department of Stomatology, Oral & Maxillofacial Surgery, CHU Saint Pierre, Brussels, Belgium; Faculty of Medicine, Department of Human Anatomy and Experimental Oncology UMONS, Mons, Belgium
| |
Collapse
|
2
|
Hoch CC, Funk PF, Guntinas-Lichius O, Volk GF, Lüers JC, Hussain T, Wirth M, Schmidl B, Wollenberg B, Alfertshofer M. Harnessing advanced large language models in otolaryngology board examinations: an investigation using python and application programming interfaces. Eur Arch Otorhinolaryngol 2025; 282:3317-3328. [PMID: 40281318 PMCID: PMC12122622 DOI: 10.1007/s00405-025-09404-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2024] [Accepted: 04/07/2025] [Indexed: 04/29/2025]
Abstract
PURPOSE This study aimed to explore the capabilities of advanced large language models (LLMs), including OpenAI's GPT-4 variants, Google's Gemini series, and Anthropic's Claude series, in addressing highly specialized otolaryngology board examination questions. Additionally, the study included a longitudinal assessment of GPT-3.5 Turbo, which was evaluated using the same set of questions one year ago to identify changes in its performance over time. METHODS We utilized a question bank comprising 2,576 multiple-choice and single-choice questions from a German online education platform tailored for otolaryngology board certification preparation. The questions were submitted to 11 different LLMs, including GPT-3.5 Turbo, GPT-4 variants, Gemini models, and Claude models, through Application Programming Interfaces (APIs) using Python scripts, facilitating efficient data collection and processing. RESULTS GPT-4o demonstrated the highest accuracy among all models, particularly excelling in categories such as allergology and head and neck tumors. While the Claude models showed competitive performance, they generally lagged behind the GPT-4 variants. A comparison of GPT-3.5 Turbo's performance revealed a significant decline in accuracy over the past year. Newer LLMs displayed varied performance levels, with single-choice questions consistently yielding higher accuracy than multiple-choice questions across all models. CONCLUSION While newer LLMs show strong potential in addressing specialized medical content, the observed decline in GPT-3.5 Turbo's performance over time underscores the necessity for continuous evaluation. This study highlights the critical need for ongoing optimization and efficient API usage to improve LLMs potential for applications in medical education and certification.
Collapse
Affiliation(s)
- Cosima C Hoch
- Department of Otolaryngology, Head and Neck Surgery, TUM School of Medicine and Health, Technical University of Munich (TUM), Ismaningerstrasse 22, 81675, Munich, Germany.
| | - Paul F Funk
- Department of Otorhinolaryngology, Jena University Hospital, Friedrich-Schiller-University Jena, 07747, Jena, Germany
| | - Orlando Guntinas-Lichius
- Department of Otorhinolaryngology, Jena University Hospital, Friedrich-Schiller-University Jena, 07747, Jena, Germany
| | - Gerd Fabian Volk
- Department of Otorhinolaryngology, Jena University Hospital, Friedrich-Schiller-University Jena, 07747, Jena, Germany
| | - Jan-Christoffer Lüers
- Department of Otorhinolaryngology, Head and Neck Surgery, Medical Faculty, University of Cologne, 50937, Cologne, Germany
| | - Timon Hussain
- Department of Otolaryngology, Head and Neck Surgery, TUM School of Medicine and Health, Technical University of Munich (TUM), Ismaningerstrasse 22, 81675, Munich, Germany
| | - Markus Wirth
- Department of Otolaryngology, Head and Neck Surgery, TUM School of Medicine and Health, Technical University of Munich (TUM), Ismaningerstrasse 22, 81675, Munich, Germany
| | - Benedikt Schmidl
- Department of Otolaryngology, Head and Neck Surgery, TUM School of Medicine and Health, Technical University of Munich (TUM), Ismaningerstrasse 22, 81675, Munich, Germany
| | - Barbara Wollenberg
- Department of Otolaryngology, Head and Neck Surgery, TUM School of Medicine and Health, Technical University of Munich (TUM), Ismaningerstrasse 22, 81675, Munich, Germany
| | - Michael Alfertshofer
- Department of Oral and Maxillofacial Surgery, Institute of Health, Charité- Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität zu Berlin, Berlin, 10117, Berlin, Germany
| |
Collapse
|
3
|
Omar M, Agbareia R, Glicksberg BS, Nadkarni GN, Klang E. Benchmarking the Confidence of Large Language Models in Answering Clinical Questions: Cross-Sectional Evaluation Study. JMIR Med Inform 2025; 13:e66917. [PMID: 40378406 PMCID: PMC12101789 DOI: 10.2196/66917] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Revised: 01/31/2025] [Accepted: 01/31/2025] [Indexed: 05/18/2025] Open
Abstract
Background The capabilities of large language models (LLMs) to self-assess their own confidence in answering questions within the biomedical realm remain underexplored. Objective This study evaluates the confidence levels of 12 LLMs across 5 medical specialties to assess LLMs' ability to accurately judge their own responses. Methods We used 1965 multiple-choice questions that assessed clinical knowledge in the following areas: internal medicine, obstetrics and gynecology, psychiatry, pediatrics, and general surgery. Models were prompted to provide answers and to also provide their confidence for the correct answers (score: range 0%-100%). We calculated the correlation between each model's mean confidence score for correct answers and the overall accuracy of each model across all questions. The confidence scores for correct and incorrect answers were also analyzed to determine the mean difference in confidence, using 2-sample, 2-tailed t tests. Results The correlation between the mean confidence scores for correct answers and model accuracy was inverse and statistically significant (r=-0.40; P=.001), indicating that worse-performing models exhibited paradoxically higher confidence. For instance, a top-performing model-GPT-4o-had a mean accuracy of 74% (SD 9.4%), with a mean confidence of 63% (SD 8.3%), whereas a low-performing model-Qwen2-7B-showed a mean accuracy of 46% (SD 10.5%) but a mean confidence of 76% (SD 11.7%). The mean difference in confidence between correct and incorrect responses was low for all models, ranging from 0.6% to 5.4%, with GPT-4o having the highest mean difference (5.4%, SD 2.3%; P=.003). Conclusions Better-performing LLMs show more aligned overall confidence levels. However, even the most accurate models still show minimal variation in confidence between right and wrong answers. This may limit their safe use in clinical settings. Addressing overconfidence could involve refining calibration methods, performing domain-specific fine-tuning, and involving human oversight when decisions carry high risks. Further research is needed to improve these strategies before broader clinical adoption of LLMs.
Collapse
Affiliation(s)
- Mahmud Omar
- Division of Data-Driven and Digital Medicine (D3M), Department of Medicine, Icahn School of Medicine at Mount Sinai, Gustave L. Levy Place New York, New York, NY, 10029, United States, 1 212 241 6500
| | - Reem Agbareia
- Ophthalmology Department, Hadassah Medical Center, Jerusalem, Israel
| | - Benjamin S Glicksberg
- Division of Data-Driven and Digital Medicine (D3M), Department of Medicine, Icahn School of Medicine at Mount Sinai, Gustave L. Levy Place New York, New York, NY, 10029, United States, 1 212 241 6500
| | - Girish N Nadkarni
- Division of Data-Driven and Digital Medicine (D3M), Department of Medicine, Icahn School of Medicine at Mount Sinai, Gustave L. Levy Place New York, New York, NY, 10029, United States, 1 212 241 6500
| | - Eyal Klang
- Division of Data-Driven and Digital Medicine (D3M), Department of Medicine, Icahn School of Medicine at Mount Sinai, Gustave L. Levy Place New York, New York, NY, 10029, United States, 1 212 241 6500
| |
Collapse
|
4
|
Yitzhaki S, Peled N, Kaplan E, Kadmon G, Nahum E, Gendler Y, Weissbach A. Comparing ChatGPT-4 and a Paediatric Intensive Care Specialist in Responding to Medical Education Questions: A Multicenter Evaluation. J Paediatr Child Health 2025. [PMID: 40331496 DOI: 10.1111/jpc.70080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/31/2025] [Revised: 03/19/2025] [Accepted: 04/26/2025] [Indexed: 05/08/2025]
Abstract
OBJECTIVE To compare the performance of the Generative Pre-trained Transformer model 4 (ChatGPT-4) with that of a paediatric intensive care unit (PICU) specialist in responding to open-ended medical education questions. METHODS A comparative analysis was conducted using 100 educational questions sourced from a PICU trainee WhatsApp forum, covering factual knowledge and clinical reasoning. Ten PICU specialists from multiple tertiary paediatric centres independently evaluated 20 sets of paired responses from ChatGPT-4 and a PICU specialist (the original respondent to the forum questions), assessing overall superiority, completeness, accuracy, and integration potential. RESULTS After excluding one question requiring a visual aid, 198 paired evaluations were made (96 factual knowledge and 102 clinical reasoning). ChatGPT-4's responses were significantly longer than those of the PICU specialist (median words: 189 vs. 41; p < 0.0001). ChatGPT-4 was preferred in 60% of factual knowledge comparisons (p < 0.001), while the PICU specialist's responses were preferred in 67% of clinical reasoning comparisons (p < 0.0001). ChatGPT-4 demonstrated superior completeness in factual knowledge (p = 0.02) but lower accuracy in clinical reasoning (p < 0.0001). Integration of both answers was favoured in 37% of cases (95% CI, 31%-44%). CONCLUSIONS ChatGPT-4 shows promise as a tool for factual medical education in the PICU, excelling in completeness. However, it requires oversight in clinical reasoning tasks, where the PICU specialist's responses remain superior. Expert review is essential before using ChatGPT-4 independently in PICU education and in other similarly underexplored medical fields.
Collapse
Affiliation(s)
- Shai Yitzhaki
- Department of Pediatrics A, Schneider Children's Medical Center of Israel, Petach Tikva, Israel
- Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel
| | - Nadav Peled
- Adelson School of Medicine, Ariel University, Ariel, Israel
| | - Eytan Kaplan
- Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel
- Pediatric Intensive Care Unit, Schneider Children's Medical Center of Israel, Petach Tikva, Israel
| | - Gili Kadmon
- Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel
- Pediatric Intensive Care Unit, Schneider Children's Medical Center of Israel, Petach Tikva, Israel
| | - Elhanan Nahum
- Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel
- Pediatric Intensive Care Unit, Schneider Children's Medical Center of Israel, Petach Tikva, Israel
| | - Yulia Gendler
- Department of Nursing at the School of Health Sciences, Ariel University, Ariel, Israel
| | - Avichai Weissbach
- Faculty of Medical and Health Sciences, Tel Aviv University, Tel Aviv, Israel
- Pediatric Intensive Care Unit, Schneider Children's Medical Center of Israel, Petach Tikva, Israel
| |
Collapse
|
5
|
Chen TA, Lin KC, Lin MH, Chang HT, Chen YC, Chen TJ. While GPT-3.5 is unable to pass the Physician Licensing Exam in Taiwan, GPT-4 successfully meets the criteria. J Chin Med Assoc 2025; 88:352-360. [PMID: 40083047 DOI: 10.1097/jcma.0000000000001225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 03/16/2025] Open
Abstract
BACKGROUND This study investigates the performance of ChatGPT-3.5 and ChatGPT-4 in answering medical questions from Taiwan's Physician Licensing Exam, ranging from basic medical knowledge to specialized clinical topics. It aims to understand these artificial intelligence (AI) models' capabilities in a non-English context, specifically traditional Chinese. METHODS The study incorporated questions from the Taiwan Physician Licensing Exam in 2022, excluding image-based queries. Each question was manually input into ChatGPT, and responses were compared with official answers from Taiwan's Ministry of Examination. Differences across specialties and question types were assessed using the Kruskal-Wallis and Fisher's exact tests. RESULTS ChatGPT-3.5 achieved an average accuracy of 67.7% in basic medical sciences and 53.2% in clinical medicine. Meanwhile, ChatGPT-4 significantly outperformed ChatGPT-3.5, with average accuracies of 91.9% and 90.7%, respectively. ChatGPT-3.5 scored above 60.0% in seven out of 10 basic medical science subjects and three of 14 clinical subjects, while ChatGPT-4 scored above 60.0% in every subject. The type of question did not significantly affect accuracy rates. CONCLUSION ChatGPT-3.5 showed proficiency in basic medical sciences but was less reliable in clinical medicine, whereas ChatGPT-4 demonstrated strong capabilities in both areas. However, their proficiency varied across different specialties. The type of question had minimal impact on performance. This study highlights the potential of AI models in medical education and non-English languages examination and the need for cautious and informed implementation in educational settings due to variability across specialties.
Collapse
Affiliation(s)
- Tsung-An Chen
- Department of Family Medicine, Taipei Veterans General Hospital, Taipei, Taiwan, ROC
| | - Kuan-Chen Lin
- Department of Family Medicine, Taipei Veterans General Hospital, Taipei, Taiwan, ROC
| | - Ming-Hwai Lin
- Department of Family Medicine, Taipei Veterans General Hospital, Taipei, Taiwan, ROC
- School of Medicine, National Yang Ming Chiao Tung University, Taipei, Taiwan, ROC
| | - Hsiao-Ting Chang
- Department of Family Medicine, Taipei Veterans General Hospital, Taipei, Taiwan, ROC
- School of Medicine, National Yang Ming Chiao Tung University, Taipei, Taiwan, ROC
| | - Yu-Chun Chen
- Department of Family Medicine, Taipei Veterans General Hospital, Taipei, Taiwan, ROC
- School of Medicine, National Yang Ming Chiao Tung University, Taipei, Taiwan, ROC
- Big Data Center, Taipei Veterans General Hospital, Taipei, Taiwan, ROC
- Institute of Hospital and Health Care Administration, National Yang Ming Chiao Tung University, Taipei, Taiwan, ROC
- Department of Family Medicine, Taipei Veterans General Hospital Yuli Branch, Hualien, Taiwan, ROC
| | - Tzeng-Ji Chen
- Department of Family Medicine, Taipei Veterans General Hospital Hsinchu Branch, Hsinchu, Taiwan, ROC
- Department of Post-Baccalaureate Medicine, National Chung Hsing University, Taichung, Taiwan, ROC
| |
Collapse
|
6
|
Liu J, Gu J, Tong M, Yue Y, Qiu Y, Zeng L, Yu Y, Yang F, Zhao S. Evaluating the agreement between ChatGPT-4 and validated questionnaires in screening for anxiety and depression in college students: a cross-sectional study. BMC Psychiatry 2025; 25:359. [PMID: 40211256 PMCID: PMC11983836 DOI: 10.1186/s12888-025-06798-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/05/2024] [Accepted: 03/31/2025] [Indexed: 04/12/2025] Open
Abstract
BACKGROUND The Chat Generative Pre-trained Transformer (ChatGPT), an artificial intelligence-based web application, has demonstrated substantial potential across various knowledge domains, particularly in medicine. This cross-sectional study assessed the validity and possible usefulness of the ChatGPT-4 in assessing anxiety and depression by comparing two questionnaires. METHODS This study tasked ChatGPT-4 with generating a structured interview questionnaire based on the validated Patient Health Questionnaire-9 (PHQ-9) and Generalized Anxiety Disorder Scale-7 (GAD-7). These new measures were referred to as GPT-PHQ-9 and GPT-GAD-7. This study utilized Spearman correlation analysis, Intra-class correlation coefficients (ICC), Youden's index, receiver operating characteristic (ROC) and Bland-Altman plots to evaluate the consistency between scores from a ChatGPT-4 adapted questionnaire and those from a validated questionnaire. RESULTS A total of 200 college students participated. Cronbach's α indicated acceptable reliability for both GPT-PHQ-9 (α = 0.75) and GPT-GAD-7 (α = 0.76). ICC values were 0.80 for PHQ-9 and 0.70 for GAD-7. Spearman's correlation showed moderate associations with PHQ-9 (p = 0.63) and GAD-7 (p = 0.68). ROC curve analysis revealed optimal cutoffs of 9.5 for depressive symptoms and 6.5 for anxiety symptoms, both with high sensitivity and specificity. CONCLUSIONS The questionnaire adapted by ChatGPT-4 demonstrated good consistency with the validated questionnaire. Future studies should investigate the usefulness of the ChatGPT designed questionnaire in different populations.
Collapse
Affiliation(s)
- Jiali Liu
- School of Nursing, Hubei University of Chinese Medicine, No. 16 West Huangjiahu Road, Hongshan District, Wuhan, 430065, China
| | - Juan Gu
- School of Nursing, Hubei University of Chinese Medicine, No. 16 West Huangjiahu Road, Hongshan District, Wuhan, 430065, China
| | - Mengjie Tong
- School of Nursing, Hubei University of Chinese Medicine, No. 16 West Huangjiahu Road, Hongshan District, Wuhan, 430065, China
| | - Yake Yue
- School of Nursing, Hubei University of Chinese Medicine, No. 16 West Huangjiahu Road, Hongshan District, Wuhan, 430065, China
| | - Yufei Qiu
- School of Nursing, Hubei University of Chinese Medicine, No. 16 West Huangjiahu Road, Hongshan District, Wuhan, 430065, China
| | - Lijuan Zeng
- School of Nursing, Hubei University of Chinese Medicine, No. 16 West Huangjiahu Road, Hongshan District, Wuhan, 430065, China
| | - Yiqing Yu
- School of Nursing, Hubei University of Chinese Medicine, No. 16 West Huangjiahu Road, Hongshan District, Wuhan, 430065, China
| | - Fen Yang
- School of Nursing, Hubei University of Chinese Medicine, No. 16 West Huangjiahu Road, Hongshan District, Wuhan, 430065, China.
- Hubei Shizhen Laboratory, Wuhan, China.
- Nursing Department, Hubei Provincial Hospital of Traditional Chinese Medicine, No. 856 Luoyu Road, Hongshan District, Wuhan, Hubei, China.
| | - Shuyan Zhao
- Nursing Department, Hubei Provincial Hospital of Traditional Chinese Medicine, No. 856 Luoyu Road, Hongshan District, Wuhan, Hubei, China.
| |
Collapse
|
7
|
Arbel Y, Gimmon Y, Shmueli L. Evaluating the Potential of Large Language Models for Vestibular Rehabilitation Education: A Comparison of ChatGPT, Google Gemini, and Clinicians. Phys Ther 2025; 105:pzaf010. [PMID: 39932784 PMCID: PMC11994992 DOI: 10.1093/ptj/pzaf010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/19/2024] [Revised: 09/13/2024] [Accepted: 10/09/2024] [Indexed: 02/13/2025]
Abstract
OBJECTIVE This study aimed to compare the performance of 2 large language models, ChatGPT (Generative Pre-trained Transformer) and Google Gemini, against experienced physical therapists and students in responding to multiple-choice questions related to vestibular rehabilitation. The study further aimed to assess the accuracy of ChatGPT's responses by board-certified otoneurologists. METHODS This study was conducted among 30 physical therapist professionals experienced with vestibular rehabilitation and 30 physical therapist students. They were asked to complete a vestibular knowledge test (VKT) consisting of 20 multiple-choice questions that were divided into 3 categories: (1) Clinical Knowledge, (2) Basic Clinical Practice, and (3) Clinical Reasoning. ChatGPT and Google Gemini were tasked with answering the same 20 VKT questions. Three board-certified otoneurologists independently evaluated the accuracy of each response using a 4-level scale, ranging from comprehensive to completely incorrect. RESULTS ChatGPT outperformed Google Gemini with a 70% score on the VKT test, while Gemini scored 60%. Both excelled in Clinical Knowledge scoring 100% but struggled in Clinical Reasoning with ChatGPT scoring 50% and Gemini scoring 25%. According to 3 otoneurologic experts, ChatGPT's accuracy was considered "comprehensive" in 45% of the 20 questions, while 25% were found to be completely incorrect. ChatGPT provided "comprehensive" responses in 50% of Clinical Knowledge and Basic Clinical Practice questions, but only 25% in Clinical Reasoning. CONCLUSION Caution is advised when using ChatGPT and Google Gemini due to their limited accuracy in clinical reasoning. While they provide accurate responses concerning Clinical Knowledge, their reliance on web information may lead to inconsistencies. ChatGPT performed better than Gemini. Health care professionals should carefully formulate questions and be aware of the potential influence of the online prevalence of information on ChatGPT's and Google Gemini's responses. Combining clinical expertise and clinical guidelines with ChatGPT and Google Gemini can maximize benefits while mitigating limitations. The results are based on current models of ChatGPT3.5 and Google Gemini. Future iterations of these models are expected to offer improved accuracy as the underlying modeling and algorithms are further refined. IMPACT This study highlights the potential utility of large language models like ChatGPT in supplementing clinical knowledge for physical therapists, while underscoring the need for caution in domains requiring complex clinical reasoning. The findings emphasize the importance of integrating technological tools carefully with human expertise to enhance patient care and rehabilitation outcomes.
Collapse
Affiliation(s)
- Yael Arbel
- Department of Management, Bar-Ilan University, Ramat-Gan 52900, Israel
| | - Yoav Gimmon
- Department of Physical Therapy, Faculty of Social Welfare & Health Studies, University of Haifa, Haifa, Israel
- Department of Otolaryngology–Head and Neck Surgery, Sheba Medical Center, Tel-Hashomer, Israel
| | - Liora Shmueli
- Department of Management, Bar-Ilan University, Ramat-Gan 52900, Israel
| |
Collapse
|
8
|
Bereuter JP, Geissler ME, Klimova A, Steiner RP, Pfeiffer K, Kolbinger FR, Wiest IC, Muti HS, Kather JN. Benchmarking Vision Capabilities of Large Language Models in Surgical Examination Questions. JOURNAL OF SURGICAL EDUCATION 2025; 82:103442. [PMID: 39923296 DOI: 10.1016/j.jsurg.2025.103442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/14/2024] [Revised: 11/11/2024] [Accepted: 01/20/2025] [Indexed: 02/11/2025]
Abstract
OBJECTIVE Recent studies investigated the potential of large language models (LLMs) for clinical decision making and answering exam questions based on text input. Recent developments of LLMs have extended these models with vision capabilities. These image processing LLMs are called vision-language models (VLMs). However, there is limited investigation on the applicability of VLMs and their capabilities of answering exam questions with image content. Therefore, the aim of this study was to examine the performance of publicly accessible LLMs in 2 different surgical question sets consisting of text and image questions. DESIGN Original text and image exam questions from 2 different surgical question subsets from the German Medical Licensing Examination (GMLE) and United States Medical Licensing Examination (USMLE) were collected and answered by publicly available LLMs (GPT-4, Claude-3 Sonnet, Gemini-1.5). LLM outputs were benchmarked for their accuracy in answering text and image questions. Additionally, the LLMs' performance was compared to students' performance based on their average historical performance (AHP) in these exams. Moreover, variations of LLM performance were analyzed in relation to question difficulty and respective image type. RESULTS Overall, all LLMs achieved scores equivalent to passing grades (≥60%) on surgical text questions across both datasets. On image-based questions, only GPT-4 exceeded the score required to pass, significantly outperforming Claude-3 and Gemini-1.5 (GPT: 78% vs. Claude-3: 58% vs. Gemini-1.5: 57.3%; p < 0.001). Additionally, GPT-4 outperformed students on both text (GPT: 83.7% vs. AHP students: 67.8%; p < 0.001) and image questions (GPT: 78% vs. AHP students: 67.4%; p < 0.001). CONCLUSION GPT-4 demonstrated substantial capabilities in answering surgical text and image exam questions. Therefore, it holds considerable potential for the use in surgical decision making and education of students and trainee surgeons.
Collapse
Affiliation(s)
- Jean-Paul Bereuter
- Department of Visceral, Thoracic and Vascular Surgery, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany.
| | - Mark Enrik Geissler
- Department of Visceral, Thoracic and Vascular Surgery, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany; Else Kroener Fresenius Center for Digital Health, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany
| | - Anna Klimova
- Institute for Medical Informatics and Biometry, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany
| | - Robert-Patrick Steiner
- Institute of Pharmacology and Toxicology, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany
| | - Kevin Pfeiffer
- Else Kroener Fresenius Center for Digital Health, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany
| | - Fiona R Kolbinger
- Department of Visceral, Thoracic and Vascular Surgery, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany; Weldon School of Biomedical Engineering, Purdue University, West Lafayette, Indiana
| | - Isabella C Wiest
- Else Kroener Fresenius Center for Digital Health, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany; Department of Medicine II, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
| | - Hannah Sophie Muti
- Department of Visceral, Thoracic and Vascular Surgery, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany; Else Kroener Fresenius Center for Digital Health, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany; Medical Oncology, National Center for Tumor Diseases, University Hospital Heidelberg, Heidelberg, Germany
| | - Jakob Nikolas Kather
- Else Kroener Fresenius Center for Digital Health, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany; Medical Oncology, National Center for Tumor Diseases, University Hospital Heidelberg, Heidelberg, Germany; Department of Medicine I, Faculty of Medicine and University Hospital Carl Gustav Carus, TUD Dresden University of Technology, Dresden, Germany
| |
Collapse
|
9
|
AlGain S, Marra AR, Kobayashi T, Marra PS, Celeghini PD, Hsieh MK, Shatari MA, Althagafi S, Alayed M, Ranavaya JI, Boodhoo NA, Meade NO, Fu D, Sampson MM, Rodriguez-Nava G, Zimmet AN, Ha D, Alsuhaibani M, Huddleston BS, Salinas JL. Can we rely on artificial intelligence to guide antimicrobial therapy? A systematic literature review. ANTIMICROBIAL STEWARDSHIP & HEALTHCARE EPIDEMIOLOGY : ASHE 2025; 5:e90. [PMID: 40226293 PMCID: PMC11986881 DOI: 10.1017/ash.2025.47] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/18/2024] [Revised: 01/25/2025] [Accepted: 01/28/2025] [Indexed: 04/15/2025]
Abstract
Background Artificial intelligence (AI) has the potential to enhance clinical decision-making, including in infectious diseases. By improving antimicrobial resistance prediction and optimizing antibiotic prescriptions, these technologies may support treatment strategies and address critical gaps in healthcare. This study evaluates the effectiveness of AI in guiding appropriate antibiotic prescriptions for infectious diseases through a systematic literature review. Methods We conducted a systematic review of studies evaluating AI (machine learning or large language models) used for guidance on prescribing appropriate antibiotics in infectious disease cases. Searches were performed in PubMed, CINAHL, Embase, Scopus, Web of Science, and Google Scholar for articles published up to October 25, 2024. Inclusion criteria focused on studies assessing the performance of AI in clinical practice, with outcomes related to antimicrobial management and decision-making. Results Seventeen studies used machine learning as part of clinical decision support systems (CDSS). They improved prediction of antimicrobial resistance and optimized antimicrobial use. Six studies focused on large language models to guide antimicrobial therapy; they had higher prescribing error rates, patient safety risks, and needed precise prompts to ensure accurate responses. Conclusions AI, particularly machine learning integrated into CDSS, holds promise in enhancing clinical decision-making and improving antimicrobial management. However, large language models currently lack the reliability required for complex clinical applications. The indispensable role of infectious disease specialists remains critical for ensuring accurate, personalized, and safe treatment strategies. Rigorous validation and regular updates are essential before the successful integration of AI into clinical practice.
Collapse
Affiliation(s)
- Sulwan AlGain
- King Faisal Specialist Hospital and Research Center, Riyadh, Saudi Arabia
- Division of Infectious Diseases & Geographic Medicine, Stanford University, Stanford, CA, USA
| | - Alexandre R. Marra
- Hospital Israelita Albert Einstein, São Paulo, SP, Brazil
- University of Iowa Hospitals and Clinics, Iowa City, IA, USA
| | - Takaaki Kobayashi
- University of Iowa Hospitals and Clinics, Iowa City, IA, USA
- Department of Internal Medicine, University of Kentucky, Lexington, KY, USA
| | - Pedro S. Marra
- School of Medicine, University of California, San Francisco, San Francisco, CA, USA
| | | | | | | | - Samiyah Althagafi
- Pediatric Infectious Diseases, King Abdullah Specialized Children’s Hospital, MNGHA, Jeddah, Saudi Arabia
| | - Maria Alayed
- King Faisal Specialist Hospital and Research Center, Riyadh, Saudi Arabia
| | - Jamila I Ranavaya
- Department of Internal Medicine, University of Kentucky, Lexington, KY, USA
| | - Nicole A. Boodhoo
- Department of Epidemiology, University of Iowa College of Public Health, Iowa City, IA, USA
| | - Nicholas O. Meade
- Department of Internal Medicine, University of Kentucky, Lexington, KY, USA
| | - Daniel Fu
- Pritzker School of Medicine, University of Chicago, Chicago, IL, USA
| | - Mindy Marie Sampson
- Division of Infectious Diseases & Geographic Medicine, Stanford University, Stanford, CA, USA
| | | | - Alex N. Zimmet
- Division of Infectious Diseases & Geographic Medicine, Stanford University, Stanford, CA, USA
| | - David Ha
- Division of Infectious Diseases & Geographic Medicine, Stanford University, Stanford, CA, USA
| | | | | | - Jorge L. Salinas
- Division of Infectious Diseases & Geographic Medicine, Stanford University, Stanford, CA, USA
| |
Collapse
|
10
|
Murthy AB, Palaniappan V, Radhakrishnan S, Rajaa S, Karthikeyan K. A Comparative Analysis of the Performance of Large Language Models and Human Respondents in Dermatology. Indian Dermatol Online J 2025; 16:241-247. [PMID: 40125046 PMCID: PMC11927985 DOI: 10.4103/idoj.idoj_221_24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2024] [Revised: 08/14/2024] [Accepted: 08/23/2024] [Indexed: 03/25/2025] Open
Abstract
Background With the growing interest in generative artificial intelligence (AI), the scientific community is witnessing the vast utility of large language models (LLMs) with chat interfaces such as ChatGPT and Microsoft Bing Chat in the medical field and research. This study aimed to investigate the accuracy of ChatGPT and Microsoft Bing Chat to answer questions on Dermatology, Venereology, and Leprosy, the frequency of artificial hallucinations, and to compare their performance with human respondents. Aim and Objectives The primary objective of the study was to compare the knowledge and interpretation abilities of LLMs (ChatGPT v3.5 and Microsoft Bing Chat) with human respondents (12 final-year postgraduates) and the secondary objective was to assess the incidence of artificial hallucinations with 60 questions prepared by the authors, including multiple choice questions (MCQs), fill-in-the-blanks and scenario-based questions. Materials and Methods The authors accessed two commercially available large language models (LLMs) with chat interfaces namely ChatGPT version 3.5 (OpenAI; San Francisco, CA) and Microsoft Bing Chat from August 10th to August 23rd, 2023. Results In our testing set of 60 questions, Bing Chat outperformed ChatGPT and human respondents with a mean correct response score of 46.9 ± 0.7. The mean correct responses by ChatGPT and human respondents were 35.9 ± 0.5 and 25.8 ± 11.0, respectively. The overall accuracy of human respondents, ChatGPT and Bing Chat was observed to be 43%, 59.8%, and 78.2%, respectively. Of the MCQs, fill-in-the-blanks, and scenario-based questions, Bing Chat had the highest accuracy in all types of questions with statistical significance (P < 0.001 by ANOVA test). Topic-wise assessment of the performance of LLMs showed that Bing Chat performed better in all topics except vascular disorders, inflammatory disorders, and leprosy. Bing Chat performed better in answering easy and medium-difficulty questions with accuracies of 85.7% and 78%, respectively. In comparison, ChatGPT performed well on hard questions with an accuracy of 55% with statistical significance (P < 0.001 by ANOVA test). The mean number of questions answered by the human respondents among the 10 questions with multiple correct responses was 3 ± 1.4. The accuracy of LLMs in answering questions with multiple correct responses was assessed by employing two prompts. ChatGPT and Bing Chat could answer 3.1 ± 0.3 and 4 ± 0 questions respectively without prompting. On evaluating the ability of logical reasoning by the LLMs, it was found that ChatGPT gave logical reasoning in 47 ± 0.4 questions and Bing Chat in 53.9 ± 0.5 questions, irrespective of the correctness of the responses. ChatGPT exhibited artificial hallucination in 4 questions, even with 12 repeated inputs, which was not observed in Bing chat. Limitations Variability in respondent accuracy, a small question set, and exclusion of newer AI models and image-based assessments. Conclusion This study showed an overall better performance of LLMs compared to human respondents. However, the LLMs were less accurate than respondents in topics like inflammatory disorders and leprosy. Proper regulations concerning the use of LLMs are the need of the hour to avoid potential misuse.
Collapse
Affiliation(s)
- Aravind Baskar Murthy
- Department of Dermatology, Venereology and Leprosy, SRM Medical College Hospital and Research Center, Chengalpet, Tamil Nadu, India
| | - Vijayasankar Palaniappan
- Department of Dermatology, Venereology and Leprosy, Sri Manakula Vinayagar Medical College and Hospital, Pondicherry, India
| | - Suganya Radhakrishnan
- Department of Dermatology, Venereology and Leprosy, Sri Manakula Vinayagar Medical College and Hospital, Pondicherry, India
| | - Sathish Rajaa
- Department of Community Medicine, ESIC Medical College and Hospital, Chennai, Tamil Nadu, India
| | - Kaliaperumal Karthikeyan
- Department of Dermatology, Venereology and Leprosy, Sri Manakula Vinayagar Medical College and Hospital, Pondicherry, India
| |
Collapse
|
11
|
Seifen C, Huppertz T, Gouveris H, Bahr-Hamm K, Pordzik J, Eckrich J, Smith H, Kelsey T, Blaikie A, Matthias C, Kuhn S, Buhr CR. Chasing sleep physicians: ChatGPT-4o on the interpretation of polysomnographic results. Eur Arch Otorhinolaryngol 2025; 282:1631-1639. [PMID: 39427271 PMCID: PMC11890353 DOI: 10.1007/s00405-024-08985-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2024] [Accepted: 08/27/2024] [Indexed: 10/22/2024]
Abstract
BACKGROUND From a healthcare professional's perspective, the use of ChatGPT (Open AI), a large language model (LLM), offers huge potential as a practical and economic digital assistant. However, ChatGPT has not yet been evaluated for the interpretation of polysomnographic results in patients with suspected obstructive sleep apnea (OSA). AIMS/OBJECTIVES To evaluate the agreement of polysomnographic result interpretation between ChatGPT-4o and a board-certified sleep physician and to shed light into the role of ChatGPT-4o in the field of medical decision-making in sleep medicine. MATERIAL AND METHODS For this proof-of-concept study, 40 comprehensive patient profiles were designed, which represent a broad and typical spectrum of cases, ensuring a balanced distribution of demographics and clinical characteristics. After various prompts were tested, one prompt was used for initial diagnosis of OSA and a further for patients with positive airway pressure (PAP) therapy intolerance. Each polysomnographic result was independently evaluated by ChatGPT-4o and a board-certified sleep physician. Diagnosis and therapy suggestions were analyzed for agreement. RESULTS ChatGPT-4o and the sleep physician showed 97% (29/30) concordance in the diagnosis of the simple cases. For the same cases the two assessment instances unveiled 100% (30/30) concordance regarding therapy suggestions. For cases with intolerance of treatment with positive airway pressure (PAP) ChatGPT-4o and the sleep physician revealed 70% (7/10) concordance in the diagnosis and 44% (22/50) concordance for therapy suggestions. CONCLUSION AND SIGNIFICANCE Precise prompting improves the output of ChatGPT-4o and provides sleep physician-like polysomnographic result interpretation. Although ChatGPT shows some shortcomings in offering treatment advice, our results provide evidence for AI assisted automation and economization of polysomnographic interpretation by LLMs. Further research should explore data protection issues and demonstrate reproducibility with real patient data on a larger scale.
Collapse
Affiliation(s)
- Christopher Seifen
- Sleep Medicine Center & Department of Otolaryngology, Head and Neck Surgery, University Medical Center Mainz, Mainz, Germany
| | - Tilman Huppertz
- Sleep Medicine Center & Department of Otolaryngology, Head and Neck Surgery, University Medical Center Mainz, Mainz, Germany
| | - Haralampos Gouveris
- Sleep Medicine Center & Department of Otolaryngology, Head and Neck Surgery, University Medical Center Mainz, Mainz, Germany
| | - Katharina Bahr-Hamm
- Sleep Medicine Center & Department of Otolaryngology, Head and Neck Surgery, University Medical Center Mainz, Mainz, Germany
| | - Johannes Pordzik
- Sleep Medicine Center & Department of Otolaryngology, Head and Neck Surgery, University Medical Center Mainz, Mainz, Germany
| | - Jonas Eckrich
- Sleep Medicine Center & Department of Otolaryngology, Head and Neck Surgery, University Medical Center Mainz, Mainz, Germany
| | - Harry Smith
- School of Computer Science, University of St Andrews, St Andrews, UK
| | - Tom Kelsey
- School of Computer Science, University of St Andrews, St Andrews, UK
| | - Andrew Blaikie
- School of Medicine, University of St Andrews, St Andrews, UK
| | - Christoph Matthias
- Sleep Medicine Center & Department of Otolaryngology, Head and Neck Surgery, University Medical Center Mainz, Mainz, Germany
| | - Sebastian Kuhn
- Institute for Digital Medicine, University Hospital of Giessen and Marburg, Philipps-University Marburg, Marburg, Germany
| | - Christoph Raphael Buhr
- Sleep Medicine Center & Department of Otolaryngology, Head and Neck Surgery, University Medical Center Mainz, Mainz, Germany.
- School of Medicine, University of St Andrews, St Andrews, UK.
| |
Collapse
|
12
|
Nieves-Lopez B, Bechtle AR, Traverse J, Klifto C, Schoch BS, Aziz KT. Evaluating the Evolution of ChatGPT as an Information Resource in Shoulder and Elbow Surgery. Orthopedics 2025; 48:e69-e74. [PMID: 39879624 DOI: 10.3928/01477447-20250123-03] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/31/2025]
Abstract
BACKGROUND The purpose of this study was to evaluate the performance and evolution of Chat Generative Pre-Trained Transformer (ChatGPT; OpenAI) as a resource for shoulder and elbow surgery information by assessing its accuracy on the American Academy of Orthopaedic Surgeons shoulder-elbow self-assessment questions. We hypothesized that both ChatGPT models would demonstrate proficiency and that there would be significant improvement with progressive iterations. MATERIALS AND METHODS A total of 200 questions were selected from the 2019 and 2021 American Academy of Orthopaedic Surgeons shoulder-elbow self-assessment questions. ChatGPT 3.5 and 4 were used to evaluate all questions. Questions with non-text data were excluded (114 questions). Remaining questions were input into ChatGPT and categorized as follows: anatomy, arthroplasty, basic science, instability, miscellaneous, nonoperative, and trauma. ChatGPT's performances were quantified and compared across categories with chi-square tests. The continuing medical education credit threshold of 50% was used to determine proficiency. Statistical significance was set at P<.05. RESULTS ChatGPT 3.5 and 4 answered 52.3% and 73.3% of the questions correctly, respectively (P=.003). ChatGPT 3.5 performed significantly better in the instability category (P=.037). ChatGPT 4's performance did not significantly differ across categories (P=.841). ChatGPT 4 performed significantly better than ChatGPT 3.5 in all categories except instability and miscellaneous. CONCLUSION ChatGPT 3.5 and 4 exceeded the proficiency threshold. ChatGPT 4 performed better than ChatGPT 3.5, showing an increased capability to correctly answer shoulder and elbow-focused questions. Further refinement of ChatGPT's training may improve its performance and utility as a resource. Currently, ChatGPT remains unable to answer questions at a high enough accuracy to replace clinical decision-making. [Orthopedics. 2025;48(2):e69-e74.].
Collapse
|
13
|
Sismanoglu S, Capan BS. Performance of artificial intelligence on Turkish dental specialization exam: can ChatGPT-4.0 and gemini advanced achieve comparable results to humans? BMC MEDICAL EDUCATION 2025; 25:214. [PMID: 39930399 PMCID: PMC11809121 DOI: 10.1186/s12909-024-06389-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/04/2024] [Accepted: 11/21/2024] [Indexed: 02/14/2025]
Abstract
BACKGROUND AI-powered chatbots have spread to various fields including dental education and clinical assistance to treatment planning. The aim of this study is to assess and compare leading AI-powered chatbot performances in dental specialization exam (DUS) administered in Turkey and compare it with the best performer of that year. METHODS DUS questions for 2020 and 2021 were directed to ChatGPT-4.0 and Gemini Advanced individually. DUS questions were manually entered into AI-powered chatbot in their original form, in Turkish. The results obtained were compared with each other and the year's best performers. Candidates who score at least 45 points on this centralized exam are deemed to have passed and are eligible to select their preferred department and institution. The data was statistically analyzed using Pearson's chi-squared test (p < 0.05). RESULTS ChatGPT-4.0 received 83.3% correct response rate on the 2020 exam, while Gemini Advanced received 65% correct response rate. On the 2021 exam, ChatGPT-4.0 received 80.5% correct response rate, whereas Gemini Advanced received 60.2% correct response rate. ChatGPT-4.0 outperformed Gemini Advanced in both exams (p < 0.05). AI-powered chatbots performed worse in overall score (for 2020: ChatGPT-4.0, 65,5 and Gemini Advanced, 50.1; for 2021: ChatGPT-4.0, 65,6 and Gemini Advanced, 48.6) when compared to overall scores of the best performer of that year (68.5 points for year 2020 and 72.3 points for year 2021). This poor performance also includes the basic sciences and clinical sciences sections (p < 0.001). Additionally, periodontology was the clinical specialty in which both AI-powered chatbots achieved the best results, the lowest performance was determined in the endodontics and orthodontics. CONCLUSION AI-powered chatbots, namely ChatGPT-4.0 and Gemini Advanced, passed the DUS by exceeding the threshold score of 45. However, they still lagged behind the top performers of that year, particularly in basic sciences, clinical sciences, and overall score. Additionally, they exhibited lower performance in some clinical specialties such as endodontics and orthodontics.
Collapse
Affiliation(s)
- Soner Sismanoglu
- Faculty of Dentistry, Department of Restorative Dentistry, Istanbul University-Cerrahpasa, Istanbul, Turkey
| | - Belen Sirinoglu Capan
- Faculty of Dentistry, Department of Pediatric Dentistry, Istanbul University-Cerrahpasa, Istanbul, Turkey.
| |
Collapse
|
14
|
Merlino DJ, Brufau SR, Saieed G, Van Abel KM, Price DL, Archibald DJ, Ator GA, Carlson ML. Comparative Assessment of Otolaryngology Knowledge Among Large Language Models. Laryngoscope 2025; 135:629-634. [PMID: 39305216 DOI: 10.1002/lary.31781] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2024] [Revised: 08/18/2024] [Accepted: 08/23/2024] [Indexed: 01/14/2025]
Abstract
OBJECTIVE The purpose of this study was to evaluate the performance of advanced large language models from OpenAI (GPT-3.5 and GPT-4), Google (PaLM2 and MedPaLM), and an open source model from Meta (Llama3:70b) in answering clinical test multiple choice questions in the field of otolaryngology-head and neck surgery. METHODS A dataset of 4566 otolaryngology questions was used; each model was provided a standardized prompt followed by a question. One hundred questions that were answered incorrectly by all models were further interrogated to gain insight into the causes of incorrect answers. RESULTS GPT4 was the most accurate, correctly answering 3520 of 4566 questions (77.1%). MedPaLM correctly answered 3223 of 4566 (70.6%) questions, while llama3:70b, GPT3.5, and PaLM2 were correct on 3052 of 4566 (66.8%), 2672 of 4566 (58.5%), and 2583 of 4566 (56.5%) questions. Three hundred and sixty-nine questions were answered incorrectly by all models. Prompts to provide reasoning improved accuracy in all models: GPT4 changed from incorrect to correct answer 31% of the time, while GPT3.5, Llama3, PaLM2, and MedPaLM corrected their responses 25%, 18%, 19%, and 17% of the time, respectively. CONCLUSION Large language models vary in their understanding of otolaryngology-specific clinical knowledge. OpenAI's GPT4 has a strong understanding of core concepts as well as detailed information in the field of otolaryngology. Its baseline understanding in this field makes it well-suited to serve in roles related to head and neck surgery education provided that the appropriate precautions are taken and potential limitations are understood. LEVEL OF EVIDENCE NA Laryngoscope, 135:629-634, 2025.
Collapse
Affiliation(s)
- Dante J Merlino
- Department of Otolaryngology-Head and Neck Surgery, Mayo Clinic, Rochester, Minnesota, U.S.A
| | - Santiago R Brufau
- Department of Otolaryngology-Head and Neck Surgery, Mayo Clinic, Rochester, Minnesota, U.S.A
| | - George Saieed
- Department of Otolaryngology-Head and Neck Surgery, Mayo Clinic, Rochester, Minnesota, U.S.A
| | - Kathryn M Van Abel
- Department of Otolaryngology-Head and Neck Surgery, Mayo Clinic, Rochester, Minnesota, U.S.A
| | - Daniel L Price
- Department of Otolaryngology-Head and Neck Surgery, Mayo Clinic, Rochester, Minnesota, U.S.A
| | - David J Archibald
- The Center for Plastic Surgery at Castle Rock, Castle Rock, Colorado, U.S.A
| | - Gregory A Ator
- Department of Otolaryngology-Head and Neck Surgery, University of Kansas Medical Center, Kansas City, Kansas, U.S.A
| | - Matthew L Carlson
- Department of Otolaryngology-Head and Neck Surgery, Mayo Clinic, Rochester, Minnesota, U.S.A
- Department of Neurologic Surgery, Mayo Clinic, Rochester, Minnesota, U.S.A
| |
Collapse
|
15
|
Aster A, Laupichler MC, Rockwell-Kollmann T, Masala G, Bala E, Raupach T. ChatGPT and Other Large Language Models in Medical Education - Scoping Literature Review. MEDICAL SCIENCE EDUCATOR 2025; 35:555-567. [PMID: 40144083 PMCID: PMC11933646 DOI: 10.1007/s40670-024-02206-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Accepted: 10/24/2024] [Indexed: 03/28/2025]
Abstract
This review aims to provide a summary of all scientific publications on the use of large language models (LLMs) in medical education over the first year of their availability. A scoping literature review was conducted in accordance with the PRISMA recommendations for scoping reviews. Five scientific literature databases were searched using predefined search terms. The search yielded 1509 initial results, of which 145 studies were ultimately included. Most studies assessed LLMs' capabilities in passing medical exams. Some studies discussed advantages, disadvantages, and potential use cases of LLMs. Very few studies conducted empirical research. Many published studies lack methodological rigor. We therefore propose a research agenda to improve the quality of studies on LLM.
Collapse
Affiliation(s)
- Alexandra Aster
- Institute of Medical Education, University Hospital Bonn, Venusberg-Campus 1, 53127 Bonn, Germany
| | - Matthias Carl Laupichler
- Institute of Medical Education, University Hospital Bonn, Venusberg-Campus 1, 53127 Bonn, Germany
| | - Tamina Rockwell-Kollmann
- Institute of Medical Education, University Hospital Bonn, Venusberg-Campus 1, 53127 Bonn, Germany
| | - Gilda Masala
- Institute of Medical Education, University Hospital Bonn, Venusberg-Campus 1, 53127 Bonn, Germany
| | - Ebru Bala
- Institute of Medical Education, University Hospital Bonn, Venusberg-Campus 1, 53127 Bonn, Germany
| | - Tobias Raupach
- Institute of Medical Education, University Hospital Bonn, Venusberg-Campus 1, 53127 Bonn, Germany
| |
Collapse
|
16
|
Tangsrivimol JA, Darzidehkalani E, Virk HUH, Wang Z, Egger J, Wang M, Hacking S, Glicksberg BS, Strauss M, Krittanawong C. Benefits, limits, and risks of ChatGPT in medicine. Front Artif Intell 2025; 8:1518049. [PMID: 39949509 PMCID: PMC11821943 DOI: 10.3389/frai.2025.1518049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2024] [Accepted: 01/15/2025] [Indexed: 02/16/2025] Open
Abstract
ChatGPT represents a transformative technology in healthcare, with demonstrated impacts across clinical practice, medical education, and research. Studies show significant efficiency gains, including 70% reduction in administrative time for discharge summaries and achievement of medical professional-level performance on standardized tests (60% accuracy on USMLE, 78.2% on PubMedQA). ChatGPT offers personalized learning platforms, automated scoring, and instant access to vast medical knowledge in medical education, addressing resource limitations and enhancing training efficiency. It streamlines clinical workflows by supporting triage processes, generating discharge summaries, and alleviating administrative burdens, allowing healthcare professionals to focus more on patient care. Additionally, ChatGPT facilitates remote monitoring and chronic disease management, providing personalized advice, medication reminders, and emotional support, thus bridging gaps between clinical visits. Its ability to process and synthesize vast amounts of data accelerates research workflows, aiding in literature reviews, hypothesis generation, and clinical trial designs. This paper aims to gather and analyze published studies involving ChatGPT, focusing on exploring its advantages and disadvantages within the healthcare context. To aid in understanding and progress, our analysis is organized into six key areas: (1) Information and Education, (2) Triage and Symptom Assessment, (3) Remote Monitoring and Support, (4) Mental Healthcare Assistance, (5) Research and Decision Support, and (6) Language Translation. Realizing ChatGPT's full potential in healthcare requires addressing key limitations, such as its lack of clinical experience, inability to process visual data, and absence of emotional intelligence. Ethical, privacy, and regulatory challenges further complicate its integration. Future improvements should focus on enhancing accuracy, developing multimodal AI models, improving empathy through sentiment analysis, and safeguarding against artificial hallucination. While not a replacement for healthcare professionals, ChatGPT can serve as a powerful assistant, augmenting their expertise to improve efficiency, accessibility, and quality of care. This collaboration ensures responsible adoption of AI in transforming healthcare delivery. While ChatGPT demonstrates significant potential in healthcare transformation, systematic evaluation of its implementation across different healthcare settings reveals varying levels of evidence quality-from robust randomized trials in medical education to preliminary observational studies in clinical practice. This heterogeneity in evidence quality necessitates a structured approach to future research and implementation.
Collapse
Affiliation(s)
- Jonathan A. Tangsrivimol
- Department of Neurosurgery, and Neuroscience, Weill Cornell Medicine, NewYork-Presbyterian Hospital, New York, NY, United States
- Department of Neurosurgery, Chulabhorn Hospital, Chulabhorn Royal Academy, Bangkok, Thailand
| | - Erfan Darzidehkalani
- MIT Computer Science & Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, United States
| | - Hafeez Ul Hassan Virk
- Harrington Heart & Vascular Institute, University Hospitals Cleveland Medical Center, Case Western Reserve University, Cleveland, OH, United States
| | - Zhen Wang
- Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, MN, United States
- Division of Health Care Policy and Research, Department of Health Sciences Research, Mayo Clinic, Rochester, MN, United States
| | - Jan Egger
- Institute for Artificial Intelligence in Medicine, University Hospital Essen (AöR), Essen, Germany
| | - Michelle Wang
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, United States
| | - Sean Hacking
- Department of Pathology, NYU Grossman School of Medicine, New York, NY, United States
| | - Benjamin S. Glicksberg
- Hasso Plattner Institute for Digital Health, Icahn School of Medicine at Mount Sinai, New York, NY, United States
| | - Markus Strauss
- Department of Cardiology I, Coronary and Peripheral Vascular Disease, Heart Failure Medicine, University Hospital Muenster, Muenster, Germany
- Department of Cardiology, Sector Preventive Medicine, Health Promotion, Faculty of Health, School of Medicine, University Witten/Herdecke, Hagen, Germany
| | - Chayakrit Krittanawong
- Cardiology Division, New York University Langone Health, New York University School of Medicine, New York, NY, United States
- HumanX, Delaware, DE, United States
| |
Collapse
|
17
|
Goshtasbi K, Best C, Powers B, Ching H, Pastorek NJ, Altman D, Adamson P, Krugman M, Wong BJF. Comparative Performance of the Leading Large Language Models in Answering Complex Rhinoplasty Consultation Questions. Facial Plast Surg Aesthet Med 2025. [PMID: 39812574 DOI: 10.1089/fpsam.2024.0206] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2025] Open
Abstract
Background: Various large language models (LLMs) can provide human-level medical discussions, but they have not been compared regarding rhinoplasty knowledge. Objective: To compare the leading LLMs in answering complex rhinoplasty consultation questions as evaluated by plastic surgeons. Methods: Ten open-ended rhinoplasty consultation questions were presented to ChatGPT-4o, Google Gemini, Claude, and Meta-AI LLMs. The responses were randomized and ranked by seven rhinoplasty-specializing plastic surgeons (1 = worst, 4 = best) considering their quality. Textual readability was analyzed via Flesch Reading Ease (FRE) and Flesch-Kincaid Grade (FKG). Results: Claude provided the top answers for seven questions while ChatGPT provided the top answers for three questions. In overall collective scoring, Claude provided the best answers with 224 points, followed by ChatGPT's 200, Meta's 138, and Gemini's 138 scores. Claude (mean score/question 3.20 ± 1.00) significantly outperformed all the other models (p < 0.05), while ChatGPT (mean score/question 2.86 ± 0.94) outperformed Meta and Gemini. Meta and Gemini performed similarly. Meta had a significantly lower FKG than Claude and ChatGPT and a significantly lower FRE than ChatGPT. Conclusion: According to ratings by seven rhinoplasty-specializing surgeons, Claude provided the best answers for a set of complex rhinoplasty consultation questions, followed by ChatGPT. Future studies are warranted to continue comparing these models as they evolve.
Collapse
Affiliation(s)
- Khodayar Goshtasbi
- Department of Otolaryngology-Head and Neck Surgery, University of California, Irvine, California, USA
- Beckman Laser Institute, University of California, Irvine, California, USA
| | - Corliss Best
- Department of Otolaryngology, University of Ottawa, Ottawa, Canada
| | | | - Harry Ching
- Department of Otolaryngology-Head and Neck Surgery, University of Nevada, Las Vegas, Nevada, USA
| | | | - Donald Altman
- Irvine Plastic Surgery Center, Irvine, California, USA
| | - Peter Adamson
- Department of Otolaryngology-Head and Neck Surgery, University of Toronto, Toronto, Canada
| | - Mark Krugman
- Department of Otolaryngology-Head and Neck Surgery, University of California, Irvine, California, USA
| | - Brian J F Wong
- Department of Otolaryngology-Head and Neck Surgery, University of California, Irvine, California, USA
- Beckman Laser Institute, University of California, Irvine, California, USA
| |
Collapse
|
18
|
Zhang Y, Lu X, Luo Y, Zhu Y, Ling W. Performance of Artificial Intelligence Chatbots on Ultrasound Examinations: Cross-Sectional Comparative Analysis. JMIR Med Inform 2025; 13:e63924. [PMID: 39814698 PMCID: PMC11737282 DOI: 10.2196/63924] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Revised: 10/23/2024] [Accepted: 11/19/2024] [Indexed: 01/18/2025] Open
Abstract
Background Artificial intelligence chatbots are being increasingly used for medical inquiries, particularly in the field of ultrasound medicine. However, their performance varies and is influenced by factors such as language, question type, and topic. Objective This study aimed to evaluate the performance of ChatGPT and ERNIE Bot in answering ultrasound-related medical examination questions, providing insights for users and developers. Methods We curated 554 questions from ultrasound medicine examinations, covering various question types and topics. The questions were posed in both English and Chinese. Objective questions were scored based on accuracy rates, whereas subjective questions were rated by 5 experienced doctors using a Likert scale. The data were analyzed in Excel. Results Of the 554 questions included in this study, single-choice questions comprised the largest share (354/554, 64%), followed by short answers (69/554, 12%) and noun explanations (63/554, 11%). The accuracy rates for objective questions ranged from 8.33% to 80%, with true or false questions scoring highest. Subjective questions received acceptability rates ranging from 47.62% to 75.36%. ERNIE Bot was superior to ChatGPT in many aspects (P<.05). Both models showed a performance decline in English, but ERNIE Bot's decline was less significant. The models performed better in terms of basic knowledge, ultrasound methods, and diseases than in terms of ultrasound signs and diagnosis. Conclusions Chatbots can provide valuable ultrasound-related answers, but performance differs by model and is influenced by language, question type, and topic. In general, ERNIE Bot outperforms ChatGPT. Users and developers should understand model performance characteristics and select appropriate models for different questions and languages to optimize chatbot use.
Collapse
Affiliation(s)
- Yong Zhang
- Department of Medical Ultrasound, West China Hospital of Sichuan University, 37 Guoxue Alley, Chengdu, 610041, China, 86 18980605569
| | - Xiao Lu
- Department of Medical Ultrasound, West China Hospital of Sichuan University, 37 Guoxue Alley, Chengdu, 610041, China, 86 18980605569
| | - Yan Luo
- Department of Medical Ultrasound, West China Hospital of Sichuan University, 37 Guoxue Alley, Chengdu, 610041, China, 86 18980605569
| | - Ying Zhu
- Department of Thoracic Surgery, West China Hospital of Sichuan University, Chengdu, China
| | - Wenwu Ling
- Department of Medical Ultrasound, West China Hospital of Sichuan University, 37 Guoxue Alley, Chengdu, 610041, China, 86 18980605569
| |
Collapse
|
19
|
Li CP, Jakob J, Menge F, Reißfelder C, Hohenberger P, Yang C. Comparing ChatGPT-3.5 and ChatGPT-4's alignments with the German evidence-based S3 guideline for adult soft tissue sarcoma. iScience 2024; 27:111493. [PMID: 39759026 PMCID: PMC11699281 DOI: 10.1016/j.isci.2024.111493] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2024] [Revised: 10/02/2024] [Accepted: 11/26/2024] [Indexed: 01/07/2025] Open
Abstract
Clinical reliability assessment of large language models is necessary due to their increasing use in healthcare. This study assessed the performance of ChatGPT-3.5 and ChatGPT-4 in answering questions deducted from the German evidence-based S3 guideline for adult soft tissue sarcoma (STS). Reponses to 80 complex clinical questions covering diagnosis, treatment, and surveillance aspects were independently scored by two sarcoma experts for accuracy and adequacy. ChatGPT-4 outperformed ChatGPT-3.5 overall, with higher median scores in both accuracy (5.5 vs. 5.0) and adequacy (5.0 vs. 4.0). While both versions performed similarly on questions about retroperitoneal/visceral sarcoma and gastrointestinal stromal tumor (GIST)-specific treatment as well as questions about surveillance, ChatGPT-4 performed better on questions about general STS treatment and extremity/trunk sarcomas. Despite their potential as a supportive tool, both models occasionally offered misleading and potentially life-threatening information. This underscores the significance of cautious adoption and human monitoring in clinical settings.
Collapse
Affiliation(s)
- Cheng-Peng Li
- Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education), Sarcoma Center, Peking University Cancer Hospital & Institute, Beijing, China
- Department of Surgery, University Medical Center Mannheim, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Germany
| | - Jens Jakob
- Department of Surgery, University Medical Center Mannheim, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Germany
| | - Franka Menge
- Department of Surgery, University Medical Center Mannheim, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Germany
| | - Christoph Reißfelder
- Department of Surgery, University Medical Center Mannheim, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Germany
- DKFZ-Hector Cancer Institute, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
| | - Peter Hohenberger
- Division of Surgical Oncology and Thoracic Surgery, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Germany
| | - Cui Yang
- Department of Surgery, University Medical Center Mannheim, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Germany
- AI Health Innovation Cluster, German Cancer Research Center (DKFZ), Heidelberg, Germany
| |
Collapse
|
20
|
Bortoli M, Fiore M, Tedeschi S, Oliveira V, Sousa R, Bruschi A, Campanacci DA, Viale P, De Paolis M, Sambri A. GPT-based chatbot tools are still unreliable in the management of prosthetic joint infections. Musculoskelet Surg 2024; 108:459-466. [PMID: 38954323 PMCID: PMC11582126 DOI: 10.1007/s12306-024-00846-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Accepted: 06/21/2024] [Indexed: 07/04/2024]
Abstract
BACKGROUND Artificial intelligence chatbot tools responses might discern patterns and correlations that may elude human observation, leading to more accurate and timely interventions. However, their reliability to answer healthcare-related questions is still debated. This study aimed to assess the performance of the three versions of GPT-based chatbots about prosthetic joint infections (PJI). METHODS Thirty questions concerning the diagnosis and treatment of hip and knee PJIs, stratified by a priori established difficulty, were generated by a team of experts, and administered to ChatGPT 3.5, BingChat, and ChatGPT 4.0. Responses were rated by three orthopedic surgeons and two infectious diseases physicians using a five-point Likert-like scale with numerical values to quantify the quality of responses. Inter-rater reliability was assessed by interclass correlation statistics. RESULTS Responses averaged "good-to-very good" for all chatbots examined, both in diagnosis and treatment, with no significant differences according to the difficulty of the questions. However, BingChat ratings were significantly lower in the treatment setting (p = 0.025), particularly in terms of accuracy (p = 0.02) and completeness (p = 0.004). Agreement in ratings among examiners appeared to be very poor. CONCLUSIONS On average, the quality of responses is rated positively by experts, but with ratings that frequently may vary widely. This currently suggests that AI chatbot tools are still unreliable in the management of PJI.
Collapse
Affiliation(s)
- M Bortoli
- Orthopedic and Traumatology Unit, IRCCS Azienda Ospedaliero-Universitaria Di Bologna, 40138, Bologna, Italy
| | - M Fiore
- Orthopedic and Traumatology Unit, IRCCS Azienda Ospedaliero-Universitaria Di Bologna, 40138, Bologna, Italy.
- Department of Medical and Surgical Sciences, Alma Mater Studiorum University of Bologna, 40138, Bologna, Italy.
| | - S Tedeschi
- Department of Medical and Surgical Sciences, Alma Mater Studiorum University of Bologna, 40138, Bologna, Italy
- Infectious Disease Unit, Department for Integrated Infectious Risk Management, IRCCS Azienda Ospedaliero-Universitaria Di Bologna, 40138, Bologna, Italy
| | - V Oliveira
- Department of Orthopedics, Centro Hospitalar Universitário de Santo António, 4099-001, Porto, Portugal
| | - R Sousa
- Department of Orthopedics, Centro Hospitalar Universitário de Santo António, 4099-001, Porto, Portugal
| | - A Bruschi
- Orthopedic and Traumatology Unit, IRCCS Azienda Ospedaliero-Universitaria Di Bologna, 40138, Bologna, Italy
| | - D A Campanacci
- Orthopedic Oncology Unit, Azienda Ospedaliera Universitaria Careggi, 50134, Florence, Italy
| | - P Viale
- Department of Medical and Surgical Sciences, Alma Mater Studiorum University of Bologna, 40138, Bologna, Italy
- Infectious Disease Unit, Department for Integrated Infectious Risk Management, IRCCS Azienda Ospedaliero-Universitaria Di Bologna, 40138, Bologna, Italy
| | - M De Paolis
- Orthopedic and Traumatology Unit, IRCCS Azienda Ospedaliero-Universitaria Di Bologna, 40138, Bologna, Italy
| | - A Sambri
- Orthopedic and Traumatology Unit, IRCCS Azienda Ospedaliero-Universitaria Di Bologna, 40138, Bologna, Italy
| |
Collapse
|
21
|
Abhari S, Afshari Y, Fatehi F, Salmani H, Garavand A, Chumachenko D, Zakerabasali S, Morita PP. Exploring ChatGPT in clinical inquiry: a scoping review of characteristics, applications, challenges, and evaluation. Ann Med Surg (Lond) 2024; 86:7094-7104. [PMID: 39649918 PMCID: PMC11623824 DOI: 10.1097/ms9.0000000000002716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2024] [Accepted: 10/25/2024] [Indexed: 12/11/2024] Open
Abstract
Introduction Recent advancements in generative AI, exemplified by ChatGPT, hold promise for healthcare applications such as decision-making support, education, and patient engagement. However, rigorous evaluation is crucial to ensure reliability and safety in clinical contexts. This scoping review explores ChatGPT's role in clinical inquiry, focusing on its characteristics, applications, challenges, and evaluation. Methods This review, conducted in 2023, followed PRISMA-ScR guidelines (Supplemental Digital Content 1, http://links.lww.com/MS9/A636). Searches were performed across PubMed, Scopus, IEEE, Web of Science, Cochrane, and Google Scholar using relevant keywords. The review explored ChatGPT's effectiveness in various medical domains, evaluation methods, target users, and comparisons with other AI models. Data synthesis and analysis incorporated both quantitative and qualitative approaches. Results Analysis of 41 academic studies highlights ChatGPT's potential in medical education, patient care, and decision support, though performance varies by medical specialty and linguistic context. GPT-3.5, frequently referenced in 26 studies, demonstrated adaptability across diverse scenarios. Challenges include limited access to official answer keys and inconsistent performance, underscoring the need for ongoing refinement. Evaluation methods, including expert comparisons and statistical analyses, provided significant insights into ChatGPT's efficacy. The identification of target users, such as medical educators and nonexpert clinicians, illustrates its broad applicability. Conclusion ChatGPT shows significant potential in enhancing clinical practice and medical education. Nevertheless, continuous refinement is essential for its successful integration into healthcare, aiming to improve patient care outcomes, and address the evolving needs of the medical community.
Collapse
Affiliation(s)
- Shahabeddin Abhari
- School of Public Health Sciences, University of Waterloo, Waterloo, Ontario, Canada
| | - Yasna Afshari
- Department of Radiology and Nuclear Medicine, Erasmus MC University Medical Center Rotterdam, Rotterdam
- Department of Epidemiology, Erasmus MC University Medical Center Rotterdam, Rotterdam, The Netherlands
| | - Farhad Fatehi
- Business School, The University of Queensland, Brisbane, Australia
| | - Hosna Salmani
- Department of Health Information Management, School of Health Management and Information Sciences, Iran University of Medical Sciences, Tehran, Iran
| | - Ali Garavand
- Department of Health Information Technology, School of Allied Medical Sciences, Lorestan University of Medical Sciences, Khorramabad, Iran
| | - Dmytro Chumachenko
- Department of Mathematical Modeling and Artificial Intelligence, National Aerospace University ‘Kharkiv Aviation Institute’, Kharkiv, Ukraine
| | - Somayyeh Zakerabasali
- Department of Health Information Management, Clinical Education Research Center, Health Human Resources Research Center, School of Health Management and Information Sciences, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Plinio P. Morita
- School of Public Health Sciences, University of Waterloo, Waterloo, Ontario, Canada
- Department of Systems Design Engineering, University of Waterloo
- Research Institute for Aging, University of Waterloo, Waterloo, Ontario, Canada
- Centre for Digital Therapeutics, Techna Institute, University Health Network, Toronto
- Dalla Lana School of Public Health, Institute of Health Policy, Management, and Evaluation, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
22
|
Lee J, Park S, Shin J, Cho B. Analyzing evaluation methods for large language models in the medical field: a scoping review. BMC Med Inform Decis Mak 2024; 24:366. [PMID: 39614219 PMCID: PMC11606129 DOI: 10.1186/s12911-024-02709-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Accepted: 10/03/2024] [Indexed: 12/01/2024] Open
Abstract
BACKGROUND Owing to the rapid growth in the popularity of Large Language Models (LLMs), various performance evaluation studies have been conducted to confirm their applicability in the medical field. However, there is still no clear framework for evaluating LLMs. OBJECTIVE This study reviews studies on LLM evaluations in the medical field and analyzes the research methods used in these studies. It aims to provide a reference for future researchers designing LLM studies. METHODS & MATERIALS We conducted a scoping review of three databases (PubMed, Embase, and MEDLINE) to identify LLM-related articles published between January 1, 2023, and September 30, 2023. We analyzed the types of methods, number of questions (queries), evaluators, repeat measurements, additional analysis methods, use of prompt engineering, and metrics other than accuracy. RESULTS A total of 142 articles met the inclusion criteria. LLM evaluation was primarily categorized as either providing test examinations (n = 53, 37.3%) or being evaluated by a medical professional (n = 80, 56.3%), with some hybrid cases (n = 5, 3.5%) or a combination of the two (n = 4, 2.8%). Most studies had 100 or fewer questions (n = 18, 29.0%), 15 (24.2%) performed repeated measurements, 18 (29.0%) performed additional analyses, and 8 (12.9%) used prompt engineering. For medical assessment, most studies used 50 or fewer queries (n = 54, 64.3%), had two evaluators (n = 43, 48.3%), and 14 (14.7%) used prompt engineering. CONCLUSIONS More research is required regarding the application of LLMs in healthcare. Although previous studies have evaluated performance, future studies will likely focus on improving performance. A well-structured methodology is required for these studies to be conducted systematically.
Collapse
Affiliation(s)
- Junbok Lee
- Institute for Innovation in Digital Healthcare, Yonsei University, Seoul, Republic of Korea
- Department of Human Systems Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Sungkyung Park
- Department of Bigdata AI Management Information, Seoul National University of Science and Technology, Seoul, Republic of Korea
| | - Jaeyong Shin
- Department of Preventive Medicine and Public Health, Yonsei University College of Medicine, 50-1, Yonsei-ro, Seodaemun-gu, Seoul, 03722, Republic of Korea.
- Institute of Health Services Research, Yonsei University College of Medicine, Seoul, Korea.
| | - Belong Cho
- Department of Human Systems Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea.
- Department of Family Medicine, Seoul National University Hospital, Seoul, Republic of Korea.
- Department of Family Medicine, Seoul National University College of Medicine, 101 Daehak-ro, Jongno-gu, Seoul, 03080, Republic of Korea.
| |
Collapse
|
23
|
Wang L, Wan Z, Ni C, Song Q, Li Y, Clayton E, Malin B, Yin Z. Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review. J Med Internet Res 2024; 26:e22769. [PMID: 39509695 PMCID: PMC11582494 DOI: 10.2196/22769] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Revised: 09/19/2024] [Accepted: 10/03/2024] [Indexed: 11/15/2024] Open
Abstract
BACKGROUND The launch of ChatGPT (OpenAI) in November 2022 attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including health care. Numerous studies have since been conducted regarding how to use state-of-the-art LLMs in health-related scenarios. OBJECTIVE This review aims to summarize applications of and concerns regarding conversational LLMs in health care and provide an agenda for future research in this field. METHODS We used PubMed, ACM, and the IEEE digital libraries as primary sources for this review. We followed the guidance of PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) to screen and select peer-reviewed research articles that (1) were related to health care applications and conversational LLMs and (2) were published before September 1, 2023, the date when we started paper collection. We investigated these papers and classified them according to their applications and concerns. RESULTS Our search initially identified 820 papers according to targeted keywords, out of which 65 (7.9%) papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT (60/65, 92% of papers), followed by Bard (Google LLC; 1/65, 2% of papers), LLaMA (Meta; 1/65, 2% of papers), and other LLMs (6/65, 9% papers). These papers were classified into four categories of applications: (1) summarization, (2) medical knowledge inquiry, (3) prediction (eg, diagnosis, treatment recommendation, and drug synergy), and (4) administration (eg, documentation and information collection), and four categories of concerns: (1) reliability (eg, training data quality, accuracy, interpretability, and consistency in responses), (2) bias, (3) privacy, and (4) public acceptability. There were 49 (75%) papers using LLMs for either summarization or medical knowledge inquiry, or both, and there are 58 (89%) papers expressing concerns about either reliability or bias, or both. We found that conversational LLMs exhibited promising results in summarization and providing general medical knowledge to patients with a relatively high accuracy. However, conversational LLMs such as ChatGPT are not always able to provide reliable answers to complex health-related tasks (eg, diagnosis) that require specialized domain expertise. While bias or privacy issues are often noted as concerns, no experiments in our reviewed papers thoughtfully examined how conversational LLMs lead to these issues in health care research. CONCLUSIONS Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications bring bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in health care.
Collapse
Affiliation(s)
- Leyao Wang
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Zhiyu Wan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
- School of Biomedical Engineering, ShanghaiTech University, Shanghai, China
| | - Congning Ni
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Qingyuan Song
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Yang Li
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Ellen Clayton
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, TN, United States
- School of Law, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Bradley Malin
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Zhijun Yin
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
| |
Collapse
|
24
|
Bicknell BT, Butler D, Whalen S, Ricks J, Dixon CJ, Clark AB, Spaedy O, Skelton A, Edupuganti N, Dzubinski L, Tate H, Dyess G, Lindeman B, Lehmann LS. ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis. JMIR MEDICAL EDUCATION 2024; 10:e63430. [PMID: 39504445 PMCID: PMC11611793 DOI: 10.2196/63430] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/19/2024] [Revised: 09/02/2024] [Accepted: 09/14/2024] [Indexed: 09/16/2024]
Abstract
Background Recent studies, including those by the National Board of Medical Examiners, have highlighted the remarkable capabilities of recent large language models (LLMs) such as ChatGPT in passing the United States Medical Licensing Examination (USMLE). However, there is a gap in detailed analysis of LLM performance in specific medical content areas, thus limiting an assessment of their potential utility in medical education. Objective This study aimed to assess and compare the accuracy of successive ChatGPT versions (GPT-3.5, GPT-4, and GPT-4 Omni) in USMLE disciplines, clinical clerkships, and the clinical skills of diagnostics and management. Methods This study used 750 clinical vignette-based multiple-choice questions to characterize the performance of successive ChatGPT versions (ChatGPT 3.5 [GPT-3.5], ChatGPT 4 [GPT-4], and ChatGPT 4 Omni [GPT-4o]) across USMLE disciplines, clinical clerkships, and in clinical skills (diagnostics and management). Accuracy was assessed using a standardized protocol, with statistical analyses conducted to compare the models' performances. Results GPT-4o achieved the highest accuracy across 750 multiple-choice questions at 90.4%, outperforming GPT-4 and GPT-3.5, which scored 81.1% and 60.0%, respectively. GPT-4o's highest performances were in social sciences (95.5%), behavioral and neuroscience (94.2%), and pharmacology (93.2%). In clinical skills, GPT-4o's diagnostic accuracy was 92.7% and management accuracy was 88.8%, significantly higher than its predecessors. Notably, both GPT-4o and GPT-4 significantly outperformed the medical student average accuracy of 59.3% (95% CI 58.3-60.3). Conclusions GPT-4o's performance in USMLE disciplines, clinical clerkships, and clinical skills indicates substantial improvements over its predecessors, suggesting significant potential for the use of this technology as an educational aid for medical students. These findings underscore the need for careful consideration when integrating LLMs into medical education, emphasizing the importance of structured curricula to guide their appropriate use and the need for ongoing critical analyses to ensure their reliability and effectiveness.
Collapse
Affiliation(s)
- Brenton T Bicknell
- UAB Heersink School of Medicine, 1670 University Blvd, Birmingham, AL, 35233, United States, 1 2566539498
| | - Danner Butler
- University of South Alabama Whiddon College of Medicine, Mobile, AL, United States
| | - Sydney Whalen
- University of Illinois College of Medicine, Chicago, IL, United States
| | - James Ricks
- Harvard Medical School, Boston, MA, United States
| | - Cory J Dixon
- Alabama College of Osteopathic Medicine, Dothan, AL, United States
| | | | - Olivia Spaedy
- Saint Louis University School of Medicine, St. Louis, MO, United States
| | - Adam Skelton
- UAB Heersink School of Medicine, 1670 University Blvd, Birmingham, AL, 35233, United States, 1 2566539498
| | - Neel Edupuganti
- Medical College of Georgia, Augusta University, Augusta, GA, United States
| | - Lance Dzubinski
- University of Colorado Anschutz Medical Campus School of Medicine, Aurora, CO, United States
| | - Hudson Tate
- UAB Heersink School of Medicine, 1670 University Blvd, Birmingham, AL, 35233, United States, 1 2566539498
| | - Garrett Dyess
- University of South Alabama Whiddon College of Medicine, Mobile, AL, United States
| | - Brenessa Lindeman
- UAB Heersink School of Medicine, 1670 University Blvd, Birmingham, AL, 35233, United States, 1 2566539498
| | - Lisa Soleymani Lehmann
- Harvard Medical School, Boston, MA, United States
- Mass General Brigham, Boston, MA, United States
| |
Collapse
|
25
|
Gritti MN, Prajapati R, Yissar D, Morgan CT. Precision of artificial intelligence in paediatric cardiology multimodal image interpretation. Cardiol Young 2024; 34:2349-2354. [PMID: 39526423 DOI: 10.1017/s1047951124036035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/16/2024]
Abstract
Multimodal imaging is crucial for diagnosis and treatment in paediatric cardiology. However, the proficiency of artificial intelligence chatbots, like ChatGPT-4, in interpreting these images has not been assessed. This cross-sectional study evaluates the precision of ChatGPT-4 in interpreting multimodal images for paediatric cardiology knowledge assessment, including echocardiograms, angiograms, X-rays, and electrocardiograms. One hundred multiple-choice questions with accompanying images from the textbook Pediatric Cardiology Board Review were randomly selected. The chatbot was prompted to answer these questions with and without the accompanying images. Statistical analysis was done using X2, Fisher's exact, and McNemar tests. Results showed that ChatGPT-4 answered 41% of questions with images correctly, performing best on those with electrocardiograms (54%) and worst on those with angiograms (29%). Without the images, ChatGPT-4's performance was similar at 37% (difference = 4%, 95% confidence interval (CI) -9.4% to 17.2%, p = 0.56). The chatbot performed significantly better when provided the image of an electrocardiogram than without (difference = 18, 95% CI 4.0% to 31.9%, p < 0.04). In cases of incorrect answers, ChatGPT-4 was more inconsistent with an image than without (difference = 21%, 95% CI 3.5% to 36.9%, p < 0.02). In conclusion, ChatGPT-4 performed poorly in answering image-based multiple-choice questions in paediatric cardiology. Its accuracy in answering questions with images was similar to without, indicating limited multimodal image interpretation capabilities. Substantial training is required before clinical integration can be considered. Further research is needed to assess the clinical reasoning skills and progression of ChatGPT in paediatric cardiology for clinical and academic utility.
Collapse
Affiliation(s)
- Michael N Gritti
- Division of Cardiology, The Labatt Family Heart Centre, The Hospital for Sick Children, Toronto, Ontario, Canada
- Department of Paediatrics, University of Toronto, Toronto, Ontario, Canada
- Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Rahil Prajapati
- Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Dolev Yissar
- Division of Cardiology, The Labatt Family Heart Centre, The Hospital for Sick Children, Toronto, Ontario, Canada
| | - Conall T Morgan
- Division of Cardiology, The Labatt Family Heart Centre, The Hospital for Sick Children, Toronto, Ontario, Canada
- Department of Paediatrics, University of Toronto, Toronto, Ontario, Canada
- Temerty Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
26
|
Ostrowska M, Kacała P, Onolememen D, Vaughan-Lane K, Sisily Joseph A, Ostrowski A, Pietruszewska W, Banaszewski J, Wróbel MJ. To trust or not to trust: evaluating the reliability and safety of AI responses to laryngeal cancer queries. Eur Arch Otorhinolaryngol 2024; 281:6069-6081. [PMID: 38652298 PMCID: PMC11512842 DOI: 10.1007/s00405-024-08643-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2024] [Accepted: 03/26/2024] [Indexed: 04/25/2024]
Abstract
PURPOSE As online health information-seeking surges, concerns mount over the quality and safety of accessible content, potentially leading to patient harm through misinformation. On one hand, the emergence of Artificial Intelligence (AI) in healthcare could prevent it; on the other hand, questions raise regarding the quality and safety of the medical information provided. As laryngeal cancer is a prevalent head and neck malignancy, this study aims to evaluate the utility and safety of three large language models (LLMs) as sources of patient information about laryngeal cancer. METHODS A cross-sectional study was conducted using three LLMs (ChatGPT 3.5, ChatGPT 4.0, and Bard). A questionnaire comprising 36 inquiries about laryngeal cancer was categorised into diagnosis (11 questions), treatment (9 questions), novelties and upcoming treatments (4 questions), controversies (8 questions), and sources of information (4 questions). The population of reviewers consisted of 3 groups, including ENT specialists, junior physicians, and non-medicals, who graded the responses. Each physician evaluated each question twice for each model, while non-medicals only once. Everyone was blinded to the model type, and the question order was shuffled. Outcome evaluations were based on a safety score (1-3) and a Global Quality Score (GQS, 1-5). Results were compared between LLMs. The study included iterative assessments and statistical validations. RESULTS Analysis revealed that ChatGPT 3.5 scored highest in both safety (mean: 2.70) and GQS (mean: 3.95). ChatGPT 4.0 and Bard had lower safety scores of 2.56 and 2.42, respectively, with corresponding quality scores of 3.65 and 3.38. Inter-rater reliability was consistent, with less than 3% discrepancy. About 4.2% of responses fell into the lowest safety category (1), particularly in the novelty category. Non-medical reviewers' quality assessments correlated moderately (r = 0.67) with response length. CONCLUSIONS LLMs can be valuable resources for patients seeking information on laryngeal cancer. ChatGPT 3.5 provided the most reliable and safe responses among the models evaluated.
Collapse
Affiliation(s)
- Magdalena Ostrowska
- Department of Otolaryngology and Laryngological Oncology, Collegium Medicum, Nicolaus Copernicus University in Torun, ul.Marie Sklodowskiej-Curie 9, 85-094, Bydgoszcz, Poland
| | - Paulina Kacała
- ENT Scientific Club, Department of Otolaryngology and Laryngological Oncology, Collegium Medicum, Nicolaus Copernicus University in Torun, ul.Marie Sklodowskiej-Curie 9, 85-094, Bydgoszcz, Poland
| | - Deborah Onolememen
- ENT Scientific Club, Department of Otolaryngology and Laryngological Oncology, Collegium Medicum, Nicolaus Copernicus University in Torun, ul.Marie Sklodowskiej-Curie 9, 85-094, Bydgoszcz, Poland
| | - Katie Vaughan-Lane
- ENT Scientific Club, Department of Otolaryngology and Laryngological Oncology, Collegium Medicum, Nicolaus Copernicus University in Torun, ul.Marie Sklodowskiej-Curie 9, 85-094, Bydgoszcz, Poland.
| | - Anitta Sisily Joseph
- ENT Scientific Club, Department of Otolaryngology and Laryngological Oncology, Collegium Medicum, Nicolaus Copernicus University in Torun, ul.Marie Sklodowskiej-Curie 9, 85-094, Bydgoszcz, Poland
| | - Adam Ostrowski
- Department of Urology, Collegium Medicum, Nicolaus Copernicus University in Torun, ul.Marie Sklodowskiej-Curie 9, 85-094, Bydgoszcz, Poland
| | - Wioletta Pietruszewska
- Department of Otolaryngology, Laryngological Oncology, Audiology and Phoniatrics, Medical University of Lodz, ul Żeromskiego 113, 90-549, Lodz, Poland
| | - Jacek Banaszewski
- Department of Otolaryngology, Head and Neck Oncology, Poznan University of Medical Science, ul Przybyszewskiego 49, 60-355, Poznań, Poland
| | - Maciej J Wróbel
- Department of Otolaryngology and Laryngological Oncology, Collegium Medicum, Nicolaus Copernicus University in Torun, ul.Marie Sklodowskiej-Curie 9, 85-094, Bydgoszcz, Poland
| |
Collapse
|
27
|
Oliva AD, Pasick LJ, Hoffer ME, Rosow DE. Improving readability and comprehension levels of otolaryngology patient education materials using ChatGPT. Am J Otolaryngol 2024; 45:104502. [PMID: 39197330 DOI: 10.1016/j.amjoto.2024.104502] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2024] [Accepted: 08/24/2024] [Indexed: 09/01/2024]
Abstract
OBJECTIVE A publicly available large language learning model platform may help determine current readability levels of otolaryngology patient education materials, as well as translate these materials to the recommended 6th-grade and 8th-grade reading levels. STUDY DESIGN Cross-sectional analysis. SETTING Online using large language learning model, ChatGPT. METHODS The Patient Education pages of the American Laryngological Association (ALA) and American Academy of Otolaryngology-Head and Neck Surgery (AAO-HNS) websites were accessed. Materials were input into ChatGPT (OpenAI, San Francisco, CA; version 3.5) and Microsoft Word (Microsoft, Redmond, WA; version 16.74). Programs calculated Flesch Reading Ease (FRE) scores, with higher scores indicating easier readability, and Flesch-Kincaid (FK) grade levels, estimating U.S. grade level required to understand text. ChatGPT was prompted to "translate to a 5th-grade reading level" and provide new scores. Scores were compared for statistical differences, as well as differences between ChatGPT and Word gradings. RESULTS Patient education materials were reviewed and 37 ALA and 72 AAO-HNS topics were translated. Overall FRE scores and FK grades demonstrated significant improvements following translation of materials, as scored by ChatGPT (p < 0.001). Word also scored significant improvements in FRE and FK following translation by ChatGPT for AAO-HNS materials overall (p < 0.001) but not for individual topics or for subspecialty-specific categories. Compared with Word, ChatGPT significantly exaggerated the change in FRE grades and FK scores (p < 0.001). CONCLUSION Otolaryngology patient education materials were found to be written at higher reading levels than recommended. Artificial intelligence may prove to be a useful resource to simplify content to make it more accessible to patients.
Collapse
Affiliation(s)
- Allison D Oliva
- Department of Otolaryngology-Head and Neck Surgery, University of Miami Miller School of Medicine, United States of America
| | - Luke J Pasick
- Department of Otolaryngology-Head and Neck Surgery, University of Miami Miller School of Medicine, United States of America
| | - Michael E Hoffer
- Department of Otolaryngology-Head and Neck Surgery, University of Miami Miller School of Medicine, United States of America
| | - David E Rosow
- Department of Otolaryngology-Head and Neck Surgery, University of Miami Miller School of Medicine, United States of America.
| |
Collapse
|
28
|
Rothka AJ, Lorenz FJ, Hearn M, Meci A, LaBarge B, Walen SG, Slonimsky G, McGinn J, Chung T, Goyal N. Utilizing Artificial Intelligence to Increase the Readability of Patient Education Materials in Pediatric Otolaryngology. EAR, NOSE & THROAT JOURNAL 2024:1455613241289647. [PMID: 39467826 DOI: 10.1177/01455613241289647] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/30/2024] Open
Abstract
Objectives: To identify the reading levels of existing patient education materials in pediatric otolaryngology and to utilize natural language processing artificial intelligence (AI) to reduce the reading level of patient education materials. Methods: Patient education materials for pediatric conditions were identified from the American Academy of Otolaryngology-Head and Neck Surgery (AAO-HNS) website. Patient education materials about the same conditions, if available, were identified and selected from the websites of 7 children's hospitals. The readability of the patient materials was scored before and after using AI with the Flesch-Kincaid calculator. ChatGPT version 3.5 was used to convert the materials to a fifth-grade reading level. Results: On average, AAO-HNS pediatric education material was written at a 10.71 ± 0.71 grade level. After requesting the reduction of those materials to a fifth-grade reading level, ChatGPT converted the same materials to an average grade level of 7.9 ± 1.18 (P < .01). When comparing the published materials from AAO-HNS and the 7 institutions, the average grade level was 9.32 ± 1.82, and ChatGPT was able to reduce the average level to 7.68 ± 1.12 (P = .0598). Of the 7 children's hospitals, only 1 institution had an average grade level below the recommended sixth-grade level. Conclusions: Patient education materials in pediatric otolaryngology were consistently above recommended reading levels. In its current state, AI can reduce the reading levels of education materials. However, it did not possess the capability to reduce all materials to be below the recommended reading level.
Collapse
Affiliation(s)
| | - F Jeffrey Lorenz
- Penn State Health Department of Otolaryngology-Head and Neck Surgery, Hershey, PA, USA
| | | | - Andrew Meci
- Penn State College of Medicine, Hershey, PA, USA
| | - Brandon LaBarge
- Penn State Health Department of Otolaryngology-Head and Neck Surgery, Hershey, PA, USA
| | - Scott G Walen
- Penn State Health Department of Otolaryngology-Head and Neck Surgery, Hershey, PA, USA
| | - Guy Slonimsky
- Penn State Health Department of Otolaryngology-Head and Neck Surgery, Hershey, PA, USA
| | - Johnathan McGinn
- Penn State Health Department of Otolaryngology-Head and Neck Surgery, Hershey, PA, USA
| | - Thomas Chung
- Penn State Health Department of Otolaryngology-Head and Neck Surgery, Hershey, PA, USA
| | - Neerav Goyal
- Penn State Health Department of Otolaryngology-Head and Neck Surgery, Hershey, PA, USA
| |
Collapse
|
29
|
Mete U. Evaluating the Performance of ChatGPT, Gemini, and Bing Compared with Resident Surgeons in the Otorhinolaryngology In-service Training Examination. Turk Arch Otorhinolaryngol 2024; 62:48-57. [PMID: 39463066 PMCID: PMC11572338 DOI: 10.4274/tao.2024.3.5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 05/07/2024] [Indexed: 10/29/2024] Open
Abstract
Objective Large language models (LLMs) are used in various fields for their ability to produce human-like text. They are particularly useful in medical education, aiding clinical management skills and exam preparation for residents. To evaluate and compare the performance of ChatGPT (GPT-4), Gemini, and Bing with each other and with otorhinolaryngology residents in answering in-service training exam questions and provide insights into the usefulness of these models in medical education and healthcare. Methods Eight otorhinolaryngology in-service training exams were used for comparison. 316 questions were prepared from the Resident Training Textbook of the Turkish Society of Otorhinolaryngology Head and Neck Surgery. These questions were presented to the three artificial intelligence models. The exam results were evaluated to determine the accuracy of both models and residents. Results GPT-4 achieved the highest accuracy among the LLMs at 54.75% (GPT-4 vs. Gemini p=0.002, GPT-4 vs. Bing p<0.001), followed by Gemini at 40.50% and Bing at 37.00% (Gemini vs. Bing p=0.327). However, senior residents outperformed all LLMs and other residents with an accuracy rate of 75.5% (p<0.001). The LLMs could only compete with junior residents. GPT- 4 and Gemini performed similarly to juniors, whose accuracy level was 46.90% (p=0.058 and p=0.120, respectively). However, juniors still outperformed Bing (p=0.019). Conclusion The LLMs currently have limitations in achieving the same medical accuracy as senior and mid-level residents. However, they outperform in specific subspecialties, indicating the potential usefulness in certain medical fields.
Collapse
Affiliation(s)
- Utku Mete
- Bursa Uludağ University Faculty of Medicine Department of Otorhinolaryngology, Bursa, Türkiye
| |
Collapse
|
30
|
Künzle P, Paris S. Performance of large language artificial intelligence models on solving restorative dentistry and endodontics student assessments. Clin Oral Investig 2024; 28:575. [PMID: 39373739 PMCID: PMC11458639 DOI: 10.1007/s00784-024-05968-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2024] [Accepted: 09/24/2024] [Indexed: 10/08/2024]
Abstract
OBJECTIVES The advent of artificial intelligence (AI) and large language model (LLM)-based AI applications (LLMAs) has tremendous implications for our society. This study analyzed the performance of LLMAs on solving restorative dentistry and endodontics (RDE) student assessment questions. MATERIALS AND METHODS 151 questions from a RDE question pool were prepared for prompting using LLMAs from OpenAI (ChatGPT-3.5,-4.0 and -4.0o) and Google (Gemini 1.0). Multiple-choice questions were sorted into four question subcategories, entered into LLMAs and answers recorded for analysis. P-value and chi-square statistical analyses were performed using Python 3.9.16. RESULTS The total answer accuracy of ChatGPT-4.0o was the highest, followed by ChatGPT-4.0, Gemini 1.0 and ChatGPT-3.5 (72%, 62%, 44% and 25%, respectively) with significant differences between all LLMAs except GPT-4.0 models. The performance on subcategories direct restorations and caries was the highest, followed by indirect restorations and endodontics. CONCLUSIONS Overall, there are large performance differences among LLMAs. Only the ChatGPT-4 models achieved a success ratio that could be used with caution to support the dental academic curriculum. CLINICAL RELEVANCE While LLMAs could support clinicians to answer dental field-related questions, this capacity depends strongly on the employed model. The most performant model ChatGPT-4.0o achieved acceptable accuracy rates in some subject sub-categories analyzed.
Collapse
Affiliation(s)
- Paul Künzle
- Department of Operative, Preventive and Pediatric Dentistry, Charité - Universitätsmedizin Berlin, Aßmannshauser Str. 4-6, Berlin, 14197, Germany.
| | - Sebastian Paris
- Department of Operative, Preventive and Pediatric Dentistry, Charité - Universitätsmedizin Berlin, Aßmannshauser Str. 4-6, Berlin, 14197, Germany
| |
Collapse
|
31
|
Wu Z, Gan W, Xue Z, Ni Z, Zheng X, Zhang Y. Performance of ChatGPT on Nursing Licensure Examinations in the United States and China: Cross-Sectional Study. JMIR MEDICAL EDUCATION 2024; 10:e52746. [PMID: 39363539 PMCID: PMC11466054 DOI: 10.2196/52746] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 06/12/2024] [Accepted: 06/15/2024] [Indexed: 10/05/2024]
Abstract
Background The creation of large language models (LLMs) such as ChatGPT is an important step in the development of artificial intelligence, which shows great potential in medical education due to its powerful language understanding and generative capabilities. The purpose of this study was to quantitatively evaluate and comprehensively analyze ChatGPT's performance in handling questions for the National Nursing Licensure Examination (NNLE) in China and the United States, including the National Council Licensure Examination for Registered Nurses (NCLEX-RN) and the NNLE. Objective This study aims to examine how well LLMs respond to the NCLEX-RN and the NNLE multiple-choice questions (MCQs) in various language inputs. To evaluate whether LLMs can be used as multilingual learning assistance for nursing, and to assess whether they possess a repository of professional knowledge applicable to clinical nursing practice. Methods First, we compiled 150 NCLEX-RN Practical MCQs, 240 NNLE Theoretical MCQs, and 240 NNLE Practical MCQs. Then, the translation function of ChatGPT 3.5 was used to translate NCLEX-RN questions from English to Chinese and NNLE questions from Chinese to English. Finally, the original version and the translated version of the MCQs were inputted into ChatGPT 4.0, ChatGPT 3.5, and Google Bard. Different LLMs were compared according to the accuracy rate, and the differences between different language inputs were compared. Results The accuracy rates of ChatGPT 4.0 for NCLEX-RN practical questions and Chinese-translated NCLEX-RN practical questions were 88.7% (133/150) and 79.3% (119/150), respectively. Despite the statistical significance of the difference (P=.03), the correct rate was generally satisfactory. Around 71.9% (169/235) of NNLE Theoretical MCQs and 69.1% (161/233) of NNLE Practical MCQs were correctly answered by ChatGPT 4.0. The accuracy of ChatGPT 4.0 in processing NNLE Theoretical MCQs and NNLE Practical MCQs translated into English was 71.5% (168/235; P=.92) and 67.8% (158/233; P=.77), respectively, and there was no statistically significant difference between the results of text input in different languages. ChatGPT 3.5 (NCLEX-RN P=.003, NNLE Theoretical P<.001, NNLE Practical P=.12) and Google Bard (NCLEX-RN P<.001, NNLE Theoretical P<.001, NNLE Practical P<.001) had lower accuracy rates for nursing-related MCQs than ChatGPT 4.0 in English input. English accuracy was higher when compared with ChatGPT 3.5's Chinese input, and the difference was statistically significant (NCLEX-RN P=.02, NNLE Practical P=.02). Whether submitted in Chinese or English, the MCQs from the NCLEX-RN and NNLE demonstrated that ChatGPT 4.0 had the highest number of unique correct responses and the lowest number of unique incorrect responses among the 3 LLMs. Conclusions This study, focusing on 618 nursing MCQs including NCLEX-RN and NNLE exams, found that ChatGPT 4.0 outperformed ChatGPT 3.5 and Google Bard in accuracy. It excelled in processing English and Chinese inputs, underscoring its potential as a valuable tool in nursing education and clinical decision-making.
Collapse
Affiliation(s)
- Zelin Wu
- Department of Bone and Joint Surgery and Sports Medicine Center, The First Affiliated Hospital, Guangzhou, China
| | - Wenyi Gan
- Department of Joint Surgery and Sports Medicine, Zhuhai People’s Hospital, Zhuhai City, China
| | - Zhaowen Xue
- Department of Bone and Joint Surgery and Sports Medicine Center, The First Affiliated Hospital, Guangzhou, China
| | - Zhengxin Ni
- School of Nursing, Yangzhou University, Yangzhou, China
| | - Xiaofei Zheng
- Department of Bone and Joint Surgery and Sports Medicine Center, The First Affiliated Hospital, Guangzhou, China
| | - Yiyi Zhang
- Department of Bone and Joint Surgery and Sports Medicine Center, The First Affiliated Hospital, Guangzhou, China
| |
Collapse
|
32
|
Lechien JR. Generative AI and Otolaryngology-Head & Neck Surgery. Otolaryngol Clin North Am 2024; 57:753-765. [PMID: 38839556 DOI: 10.1016/j.otc.2024.04.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2024]
Abstract
The increasing development of artificial intelligence (AI) generative models in otolaryngology-head and neck surgery will progressively change our practice. Practitioners and patients have access to AI resources, improving information, knowledge, and practice of patient care. This article summarizes the currently investigated applications of AI generative models, particularly Chatbot Generative Pre-trained Transformer, in otolaryngology-head and neck surgery.
Collapse
Affiliation(s)
- Jérôme R Lechien
- Research Committee of Young Otolaryngologists of the International Federation of Otorhinolaryngological Societies (IFOS), Paris, France; Division of Laryngology and Broncho-esophagology, Department of Otolaryngology-Head Neck Surgery, EpiCURA Hospital, UMONS Research Institute for Health Sciences and Technology, University of Mons (UMons), Mons, Belgium; Department of Otorhinolaryngology and Head and Neck Surgery, Foch Hospital, Paris Saclay University, Phonetics and Phonology Laboratory (UMR 7018 CNRS, Université Sorbonne Nouvelle/Paris 3), Paris, France; Department of Otorhinolaryngology and Head and Neck Surgery, CHU Saint-Pierre, Brussels, Belgium.
| |
Collapse
|
33
|
Patel J, Robinson P, Illing E, Anthony B. Is ChatGPT 3.5 smarter than Otolaryngology trainees? A comparison study of board style exam questions. PLoS One 2024; 19:e0306233. [PMID: 39325705 PMCID: PMC11426521 DOI: 10.1371/journal.pone.0306233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2024] [Accepted: 09/01/2024] [Indexed: 09/28/2024] Open
Abstract
OBJECTIVES This study compares the performance of the artificial intelligence (AI) platform Chat Generative Pre-Trained Transformer (ChatGPT) to Otolaryngology trainees on board-style exam questions. METHODS We administered a set of 30 Otolaryngology board-style questions to medical students (MS) and Otolaryngology residents (OR). 31 MSs and 17 ORs completed the questionnaire. The same test was administered to ChatGPT version 3.5, five times. Comparisons of performance were achieved using a one-way ANOVA with Tukey Post Hoc test, along with a regression analysis to explore the relationship between education level and performance. RESULTS The average scores increased each year from MS1 to PGY5. A one-way ANOVA revealed that ChatGPT outperformed trainee years MS1, MS2, and MS3 (p = <0.001, 0.003, and 0.019, respectively). PGY4 and PGY5 otolaryngology residents outperformed ChatGPT (p = 0.033 and 0.002, respectively). For years MS4, PGY1, PGY2, and PGY3 there was no statistical difference between trainee scores and ChatGPT (p = .104, .996, and 1.000, respectively). CONCLUSION ChatGPT can outperform lower-level medical trainees on Otolaryngology board-style exam but still lacks the ability to outperform higher-level trainees. These questions primarily test rote memorization of medical facts; in contrast, the art of practicing medicine is predicated on the synthesis of complex presentations of disease and multilayered application of knowledge of the healing process. Given that upper-level trainees outperform ChatGPT, it is unlikely that ChatGPT, in its current form will provide significant clinical utility over an Otolaryngologist.
Collapse
Affiliation(s)
- Jaimin Patel
- Department of Otolaryngology-Head and Neck Surgery, Indiana University School of Medicine, Indianapolis, IN, United States of America
| | - Peyton Robinson
- Indiana University School of Medicine, Indianapolis, IN, United States of America
| | - Elisa Illing
- Department of Otolaryngology-Head and Neck Surgery, Indiana University School of Medicine, Indianapolis, IN, United States of America
| | - Benjamin Anthony
- Department of Otolaryngology-Head and Neck Surgery, Indiana University School of Medicine, Indianapolis, IN, United States of America
| |
Collapse
|
34
|
Bellamkonda N, Farlow JL, Haring CT, Sim MW, Seim NB, Cannon RB, Monroe MM, Agrawal A, Rocco JW, McCrary HC. Evaluating the Accuracy of ChatGPT in Common Patient Questions Regarding HPV+ Oropharyngeal Carcinoma. Ann Otol Rhinol Laryngol 2024; 133:814-819. [PMID: 39075853 DOI: 10.1177/00034894241259137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/31/2024]
Abstract
OBJECTIVES Large language model (LLM)-based chatbots such as ChatGPT have been publicly available and increasingly utilized by the general public since late 2022. This study sought to investigate ChatGPT responses to common patient questions regarding Human Papilloma Virus (HPV) positive oropharyngeal cancer (OPC). METHODS This was a prospective, multi-institutional study, with data collected from high volume institutions that perform >50 transoral robotic surgery cases per year. The 100 most recent discussion threads including the term "HPV" on the American Cancer Society's Cancer Survivors Network's Head and Neck Cancer public discussion board were reviewed. The 11 most common questions were serially queried to ChatGPT 3.5; answers were recorded. A survey was distributed to fellowship trained head and neck oncologic surgeons at 3 institutions to evaluate the responses. RESULTS A total of 8 surgeons participated in the study. For questions regarding HPV contraction and transmission, ChatGPT answers were scored as clinically accurate and aligned with consensus in the head and neck surgical oncology community 84.4% and 90.6% of the time, respectively. For questions involving treatment of HPV+ OPC, ChatGPT was clinically accurate and aligned with consensus 87.5% and 91.7% of the time, respectively. For questions regarding the HPV vaccine, ChatGPT was clinically accurate and aligned with consensus 62.5% and 75% of the time, respectively. When asked about circulating tumor DNA testing, only 12.5% of surgeons thought responses were accurate or consistent with consensus. CONCLUSION ChatGPT 3.5 performed poorly with questions involving evolving therapies and diagnostics-thus, caution should be used when using a platform like ChatGPT 3.5 to assess use of advanced technology. Patients should be counseled on the importance of consulting their surgeons to receive accurate and up to date recommendations, and use LLM's to augment their understanding of these important health-related topics.
Collapse
Affiliation(s)
- Nikhil Bellamkonda
- Department of Otolaryngology-Head and Neck Surgery, University of Utah, Salt Lake City, UT, USA
| | - Janice L Farlow
- Department of Otolaryngology-Head and Neck Surgery, Indiana University, Indianapolis, IN, USA
| | - Catherine T Haring
- Department of Otolaryngology-Head and Neck Surgery, The Ohio State University Wexner Medical Center, Columbus, OH, USA
| | - Michael W Sim
- Department of Otolaryngology-Head and Neck Surgery, Indiana University, Indianapolis, IN, USA
| | - Nolan B Seim
- Department of Otolaryngology-Head and Neck Surgery, The Ohio State University Wexner Medical Center, Columbus, OH, USA
| | - Richard B Cannon
- Department of Otolaryngology-Head and Neck Surgery, University of Utah, Salt Lake City, UT, USA
| | - Marcus M Monroe
- Department of Otolaryngology-Head and Neck Surgery, University of Utah, Salt Lake City, UT, USA
| | - Amit Agrawal
- Department of Otolaryngology-Head and Neck Surgery, The Ohio State University Wexner Medical Center, Columbus, OH, USA
| | - James W Rocco
- Department of Otolaryngology-Head and Neck Surgery, The Ohio State University Wexner Medical Center, Columbus, OH, USA
| | - Hilary C McCrary
- Department of Otolaryngology-Head and Neck Surgery, University of Utah, Salt Lake City, UT, USA
| |
Collapse
|
35
|
Qin S, Chislett B, Ischia J, Ranasinghe W, de Silva D, Coles‐Black J, Woon D, Bolton D. ChatGPT and generative AI in urology and surgery-A narrative review. BJUI COMPASS 2024; 5:813-821. [PMID: 39323919 PMCID: PMC11420103 DOI: 10.1002/bco2.390] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Revised: 04/27/2024] [Accepted: 05/12/2024] [Indexed: 09/27/2024] Open
Abstract
Introduction ChatGPT (generative pre-trained transformer [GPT]), developed by OpenAI, is a type of generative artificial intelligence (AI) that has been widely utilised since its public release. It orchestrates an advanced conversational intelligence, producing sophisticated responses to questions. ChatGPT has been successfully demonstrated across several applications in healthcare, including patient management, academic research and clinical trials. We aim to evaluate the different ways ChatGPT has been utilised in urology and more broadly in surgery. Methods We conducted a literature search of the PubMed and Embase electronic databases for the purpose of writing a narrative review and identified relevant articles on ChatGPT in surgery from the years 2000 to 2023. A PRISMA flow chart was created to highlight the article selection process. The search terms 'ChatGPT' and 'surgery' were intentionally kept broad given the nascency of the field. Studies unrelated to these terms were excluded. Duplicates were removed. Results Multiple papers have been published about novel uses of ChatGPT in surgery, ranging from assisting in administrative tasks including answering frequently asked questions, surgical consent, writing operation reports, discharge summaries, grants, journal article drafts, reviewing journal articles and medical education. AI and machine learning has also been extensively researched in surgery with respect to patient diagnosis and predicting outcomes. There are also several limitations with the software including artificial hallucination, bias, out-of-date information and patient confidentiality. Conclusion The potential of ChatGPT and related generative AI models are vast, heralding the beginning of a new era where AI may eventually become integrated seamlessly into surgical practice. Concerns with this new technology must not be disregarded in the urge to hasten progression, and potential risks impacting patients' interests must be considered. Appropriate regulation and governance of this technology will be key to optimising the benefits and addressing the intricate challenges of healthcare delivery and equity.
Collapse
Affiliation(s)
- Shane Qin
- Department of UrologyAustin HealthHeidelbergVictoriaAustralia
| | - Bodie Chislett
- Department of UrologyAustin HealthHeidelbergVictoriaAustralia
| | - Joseph Ischia
- Department of UrologyAustin HealthHeidelbergVictoriaAustralia
- Department of SurgeryUniversity of Melbourne, Austin HealthMelbourneVictoriaAustralia
| | - Weranja Ranasinghe
- Department of Anatomy and Developmental BiologyMonash UniversityMelbourneVictoriaAustralia
- Department of UrologyMonash HealthMelbourneVictoriaAustralia
| | - Daswin de Silva
- Research Centre for Data Analytics and CognitionLa Trobe UniversityMelbourneVictoriaAustralia
| | | | - Dixon Woon
- Department of UrologyAustin HealthHeidelbergVictoriaAustralia
- Department of SurgeryUniversity of Melbourne, Austin HealthMelbourneVictoriaAustralia
| | - Damien Bolton
- Department of UrologyAustin HealthHeidelbergVictoriaAustralia
- Department of SurgeryUniversity of Melbourne, Austin HealthMelbourneVictoriaAustralia
| |
Collapse
|
36
|
Hubany SS, Scala FD, Hashemi K, Kapoor S, Fedorova JR, Vaccaro MJ, Ridout RP, Hedman CC, Kellogg BC, Leto Barone AA. ChatGPT-4 Surpasses Residents: A Study of Artificial Intelligence Competency in Plastic Surgery In-service Examinations and Its Advancements from ChatGPT-3.5. PLASTIC AND RECONSTRUCTIVE SURGERY-GLOBAL OPEN 2024; 12:e6136. [PMID: 39239234 PMCID: PMC11377087 DOI: 10.1097/gox.0000000000006136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2024] [Accepted: 07/09/2024] [Indexed: 09/07/2024]
Abstract
Background ChatGPT, launched in 2022 and updated to Generative Pre-trained Transformer 4 (GPT-4) in 2023, is a large language model trained on extensive data, including medical information. This study compares ChatGPT's performance on Plastic Surgery In-Service Examinations with medical residents nationally as well as its earlier version, ChatGPT-3.5. Methods This study reviewed 1500 questions from the Plastic Surgery In-service Examinations from 2018 to 2023. After excluding image-based, unscored, and inconclusive questions, 1292 were analyzed. The question stem and each multiple-choice answer was inputted verbatim into ChatGPT-4. Results ChatGPT-4 correctly answered 961 (74.4%) of the included questions. Best performance by section was in core surgical principles (79.1% correct) and lowest in craniomaxillofacial (69.1%). ChatGPT-4 ranked between the 61st and 97th percentiles compared with all residents. Comparatively, ChatGPT-4 significantly outperformed ChatGPT-3.5 in 2018-2022 examinations (P < 0.001). Although ChatGPT-3.5 averaged 55.5% correctness, ChatGPT-4 averaged 74%, a mean difference of 18.54%. In 2021, ChatGPT-3.5 ranked in the 23rd percentile of all residents, whereas ChatGPT-4 ranked in the 97th percentile. ChatGPT-4 outperformed 80.7% of residents on average and scored above the 97th percentile among first-year residents. Its performance was comparable with sixth-year integrated residents, ranking in the 55.7th percentile, on average. These results show significant improvements in ChatGPT-4's application of medical knowledge within six months of ChatGPT-3.5's release. Conclusion This study reveals ChatGPT-4's rapid developments, advancing from a first-year medical resident's level to surpassing independent residents and matching a sixth-year resident's proficiency.
Collapse
Affiliation(s)
- Shannon S Hubany
- From the University of Central Florida College of Medicine, Orlando, Fla
- Division of Craniofacial and Pediatric Plastic Surgery, Nemours Children's Hospital, Orlando, Fla
| | - Fernanda D Scala
- Division of Craniofacial and Pediatric Plastic Surgery, Nemours Children's Hospital, Orlando, Fla
| | - Kiana Hashemi
- From the University of Central Florida College of Medicine, Orlando, Fla
- Division of Craniofacial and Pediatric Plastic Surgery, Nemours Children's Hospital, Orlando, Fla
| | - Saumya Kapoor
- From the University of Central Florida College of Medicine, Orlando, Fla
- Division of Craniofacial and Pediatric Plastic Surgery, Nemours Children's Hospital, Orlando, Fla
| | - Julia R Fedorova
- From the University of Central Florida College of Medicine, Orlando, Fla
- Division of Craniofacial and Pediatric Plastic Surgery, Nemours Children's Hospital, Orlando, Fla
| | - Matthew J Vaccaro
- From the University of Central Florida College of Medicine, Orlando, Fla
- Division of Craniofacial and Pediatric Plastic Surgery, Nemours Children's Hospital, Orlando, Fla
| | - Rees P Ridout
- From the University of Central Florida College of Medicine, Orlando, Fla
- Division of Craniofacial and Pediatric Plastic Surgery, Nemours Children's Hospital, Orlando, Fla
| | - Casey C Hedman
- From the University of Central Florida College of Medicine, Orlando, Fla
- Division of Craniofacial and Pediatric Plastic Surgery, Nemours Children's Hospital, Orlando, Fla
| | - Brian C Kellogg
- Division of Craniofacial and Pediatric Plastic Surgery, Nemours Children's Hospital, Orlando, Fla
| | - Angelo A Leto Barone
- Division of Craniofacial and Pediatric Plastic Surgery, Nemours Children's Hospital, Orlando, Fla
| |
Collapse
|
37
|
Alami K, Willemse E, Quiriny M, Lipski S, Laurent C, Donquier V, Digonnet A. Evaluation of ChatGPT-4's Performance in Therapeutic Decision-Making During Multidisciplinary Oncology Meetings for Head and Neck Squamous Cell Carcinoma. Cureus 2024; 16:e68808. [PMID: 39376890 PMCID: PMC11456411 DOI: 10.7759/cureus.68808] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/04/2024] [Indexed: 10/09/2024] Open
Abstract
Objectives First reports suggest that artificial intelligence (AI) such as ChatGPT-4 (Open AI, ChatGPT-4, San Francisco, USA) might represent reliable tools for therapeutic decisions in some medical conditions. This study aims to assess the decisional capacity of ChatGPT-4 in patients with head and neck carcinomas, using the multidisciplinary oncology meeting (MOM) and the National Comprehensive Cancer Network (NCCN) decision as references. Methods This retrospective study included 263 patients with squamous cell carcinoma of the oral cavity, oropharynx, hypopharynx, and larynx who were followed at our institution between January 1, 2016, and December 31, 2021. The recommendation of GPT4 for the first- and second-line treatments was compared to the MOM decision and NCCN guidelines. The degrees of agreement were calculated using the Kappa method, which measures the degree of agreement between two evaluators. Results ChatGPT-4 demonstrated a moderate agreement in first-line treatment recommendations (Kappa = 0.48) and a substantial agreement (Kappa = 0.78) in second-line treatment recommendations compared to the decisions from MOM. A substantial agreement with the NCCN guidelines for both first- and second-line treatments was observed (Kappa = 0.72 and 0.66, respectively). The degree of agreement decreased when the decision included gastrostomy, patients over 70, and those with comorbidities. Conclusions The study illustrates that while ChatGPT-4 can significantly support clinical decision-making in oncology by aligning closely with expert recommendations and established guidelines, ongoing enhancements and training are crucial. The findings advocate for the continued evolution of AI tools to better handle the nuanced aspects of patient health profiles, thus broadening their applicability and reliability in clinical practice.
Collapse
Affiliation(s)
- Kenza Alami
- Otolaryngology, Jules Bordet Institute, Bruxelles, BEL
| | | | - Marie Quiriny
- Surgical Oncology, Jules Bordet Institute, Bruxelles, BEL
| | - Samuel Lipski
- Surgical Oncology, Jules Bordet Institute, Bruxelles, BEL
| | - Celine Laurent
- Otolaryngology - Head and Neck Surgery, Hôpital Ambroise-Paré, Mons, BEL
- Otolaryngology - Head and Neck Surgery, Hôpital Universitaire de Bruxelles (HUB) Erasme Hospital, Bruxelles, BEL
| | | | | |
Collapse
|
38
|
Lechien JR, Rameau A. Applications of ChatGPT in Otolaryngology-Head Neck Surgery: A State of the Art Review. Otolaryngol Head Neck Surg 2024; 171:667-677. [PMID: 38716790 DOI: 10.1002/ohn.807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 04/01/2024] [Accepted: 04/19/2024] [Indexed: 08/28/2024]
Abstract
OBJECTIVE To review the current literature on the application, accuracy, and performance of Chatbot Generative Pre-Trained Transformer (ChatGPT) in Otolaryngology-Head and Neck Surgery. DATA SOURCES PubMED, Cochrane Library, and Scopus. REVIEW METHODS A comprehensive review of the literature on the applications of ChatGPT in otolaryngology was conducted according to Preferred Reporting Items for Systematic Reviews and Meta-analyses statement. CONCLUSIONS ChatGPT provides imperfect patient information or general knowledge related to diseases found in Otolaryngology-Head and Neck Surgery. In clinical practice, despite suboptimal performance, studies reported that the model is more accurate in providing diagnoses, than in suggesting the most adequate additional examinations and treatments related to clinical vignettes or real clinical cases. ChatGPT has been used as an adjunct tool to improve scientific reports (referencing, spelling correction), to elaborate study protocols, or to take student or resident exams reporting several levels of accuracy. The stability of ChatGPT responses throughout repeated questions appeared high but many studies reported some hallucination events, particularly in providing scientific references. IMPLICATIONS FOR PRACTICE To date, most applications of ChatGPT are limited in generating disease or treatment information, and in the improvement of the management of clinical cases. The lack of comparison of ChatGPT performance with other large language models is the main limitation of the current research. Its ability to analyze clinical images has not yet been investigated in otolaryngology although upper airway tract or ear images are an important step in the diagnosis of most common ear, nose, and throat conditions. This review may help otolaryngologists to conceive new applications in further research.
Collapse
Affiliation(s)
- Jérôme R Lechien
- Research Committee of Young Otolaryngologists of the International Federation of Otorhinolaryngological Societies (IFOS), Paris, France
- Division of Laryngology and Broncho-Esophagology, Department of Otolaryngology-Head Neck Surgery, EpiCURA Hospital, UMONS Research Institute for Health Sciences and Technology, University of Mons (UMons), Mons, Belgium
- Department of Otorhinolaryngology and Head and Neck Surgery, Foch Hospital, Phonetics and Phonology Laboratory (UMR 7018 CNRS, Université Sorbonne Nouvelle/Paris 3), Paris Saclay University, Paris, France
- Department of Otorhinolaryngology and Head and Neck Surgery, CHU Saint-Pierre, Brussels, Belgium
| | - Anais Rameau
- Department of Otolaryngology-Head and Neck Surgery, Sean Parker Institute for the Voice, Weill Cornell Medicine, New York City, New York, USA
| |
Collapse
|
39
|
Mayo-Yáñez M, Lechien JR, Maria-Saibene A, Vaira LA, Maniaci A, Chiesa-Estomba CM. Examining the Performance of ChatGPT 3.5 and Microsoft Copilot in Otolaryngology: A Comparative Study with Otolaryngologists' Evaluation. Indian J Otolaryngol Head Neck Surg 2024; 76:3465-3469. [PMID: 39130248 PMCID: PMC11306834 DOI: 10.1007/s12070-024-04729-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Accepted: 04/20/2024] [Indexed: 08/13/2024] Open
Abstract
To evaluate the response capabilities, in a public healthcare system otolaryngology job competition examination, of ChatGPT 3.5 and an internet-connected GPT-4 engine (Microsoft Copilot) with the real scores of otolaryngology specialists as the control group. In September 2023, 135 questions divided into theoretical and practical parts were input into ChatGPT 3.5 and an internet-connected GPT-4. The accuracy of AI responses was compared with the official results from otolaryngologists who took the exam, and statistical analysis was conducted using Stata 14.2. Copilot (GPT-4) outperformed ChatGPT 3.5. Copilot achieved a score of 88.5 points, while ChatGPT scored 60 points. Both AIs had discrepancies in their incorrect answers. Despite ChatGPT's proficiency, Copilot displayed superior performance, ranking as the second-best score among the 108 otolaryngologists who took the exam, while ChatGPT was placed 83rd. A chat powered by GPT-4 with internet access (Copilot) demonstrates superior performance in responding to multiple-choice medical questions compared to ChatGPT 3.5.
Collapse
Affiliation(s)
- Miguel Mayo-Yáñez
- Young-Otolaryngologists of the International Federation of Oto-Rhino-Laryngological Societies (YO-IFOS) Study Group, 75000 Paris, France
- Otorhinolaryngology – Head and Neck Surgery Department, Complexo Hospitalario Universitario A Coruña (CHUAC), 15006 A Coruña, Galicia Spain
- Otorhinolaryngology—Head and Neck Surgery Department, Hospital San Rafael (HSR) de A Coruña, 15006 A Coruña, Spain
- Otorhinolaryngology Research Group, Institute of Biomedical Research of A Coruña, (INIBIC), Complexo Hospitalario Universitario de A Coruña (CHUAC), Universidade da Coruña (UDC), 15006 A Coruña, Spain
| | - Jerome R. Lechien
- Young-Otolaryngologists of the International Federation of Oto-Rhino-Laryngological Societies (YO-IFOS) Study Group, 75000 Paris, France
- Department of Otolaryngology, Polyclinique de Poitiers, Elsan Hospital, 86000 Poitiers, France
- Department of Otolaryngology—Head & Neck Surgery, Foch Hospital, School of Medicine, UFR Simone Veil, Université Versailles Saint-Quentin-en-Yvelines (Paris Saclay University), 91190 Paris, France
- Department of Human Anatomy and Experimental Oncology, UMONS Research Institute for Health Sciences and Technology, University of Mons (UMons), 7000 Mons, Belgium
- Department of Otolaryngology—Head & Neck Surgery, CHU Saint-Pierre (CHU de Bruxelles), 1000 Brussels, Belgium
| | - Alberto Maria-Saibene
- Young-Otolaryngologists of the International Federation of Oto-Rhino-Laryngological Societies (YO-IFOS) Study Group, 75000 Paris, France
- Otolaryngology Unit, Santi Paolo e Carlo Hospital, Department of Health Sciences, Università degli Studi di Milano, Milan, Italy
| | - Luigi A. Vaira
- Young-Otolaryngologists of the International Federation of Oto-Rhino-Laryngological Societies (YO-IFOS) Study Group, 75000 Paris, France
- Maxillofacial Surgery Operative Unit, Department of Medicine, Surgery and Pharmacy, University of Sassari, 07100 Sassari, Italy
| | - Antonino Maniaci
- Young-Otolaryngologists of the International Federation of Oto-Rhino-Laryngological Societies (YO-IFOS) Study Group, 75000 Paris, France
- Faculty of Medicine and Surgery, University of Enna “Kore”, 94100 Enna, Italy
| | - Carlos M. Chiesa-Estomba
- Young-Otolaryngologists of the International Federation of Oto-Rhino-Laryngological Societies (YO-IFOS) Study Group, 75000 Paris, France
- Otorhinolaryngology—Head and Neck Surgery Department, Hospital Universitario Donostia—Biodonostia Research Institute, 20014 Donostia, Spain
| |
Collapse
|
40
|
Knoedler L, Vogt A, Alfertshofer M, Camacho JM, Najafali D, Kehrer A, Prantl L, Iske J, Dean J, Hoefer S, Knoedler C, Knoedler S. The law code of ChatGPT and artificial intelligence-how to shield plastic surgeons and reconstructive surgeons against Justitia's sword. Front Surg 2024; 11:1390684. [PMID: 39132668 PMCID: PMC11312379 DOI: 10.3389/fsurg.2024.1390684] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Accepted: 07/02/2024] [Indexed: 08/13/2024] Open
Abstract
Large Language Models (LLMs) like ChatGPT 4 (OpenAI), Claude 2 (Anthropic), and Llama 2 (Meta AI) have emerged as novel technologies to integrate artificial intelligence (AI) into everyday work. LLMs in particular, and AI in general, carry infinite potential to streamline clinical workflows, outsource resource-intensive tasks, and disburden the healthcare system. While a plethora of trials is elucidating the untapped capabilities of this technology, the sheer pace of scientific progress also takes its toll. Legal guidelines hold a key role in regulating upcoming technologies, safeguarding patients, and determining individual and institutional liabilities. To date, there is a paucity of research work delineating the legal regulations of Language Models and AI for clinical scenarios in plastic and reconstructive surgery. This knowledge gap poses the risk of lawsuits and penalties against plastic surgeons. Thus, we aim to provide the first overview of legal guidelines and pitfalls of LLMs and AI for plastic surgeons. Our analysis encompasses models like ChatGPT, Claude 2, and Llama 2, among others, regardless of their closed or open-source nature. Ultimately, this line of research may help clarify the legal responsibilities of plastic surgeons and seamlessly integrate such cutting-edge technologies into the field of PRS.
Collapse
Affiliation(s)
- Leonard Knoedler
- Department of Plastic, Hand, and Reconstructive Surgery, University Hospital Regensburg, Regensburg, Germany
| | - Alexander Vogt
- Corporate/M&A Department, Dentons Europe (Germany) GmbH & Co. KG, Munich, Germany
- UC Law San Francisco (Formerly UC Hastings), San Francisco, CA, United States
| | - Michael Alfertshofer
- Division of Hand, Plastic and Aesthetic Surgery, Ludwig-Maximilians-University Munich, Munich, Germany
| | - Justin M. Camacho
- College of Medicine, Drexel University, Philadelphia, PA, United States
| | - Daniel Najafali
- Carle Illinois College of Medicine, University of Illinois Urbana-Champaign, Urbana, IL, United States
| | - Andreas Kehrer
- Department of Plastic, Hand, and Reconstructive Surgery, University Hospital Regensburg, Regensburg, Germany
| | - Lukas Prantl
- Department of Plastic, Hand, and Reconstructive Surgery, University Hospital Regensburg, Regensburg, Germany
| | - Jasper Iske
- Department of Cardiothoracic and Vascular Surgery, Deutsches Herzzentrum der Charité (DHZC), Berlin, Germany
| | - Jillian Dean
- School of Medicine, University of Pittsburgh, Pittsburgh, PA, United States
| | | | - Christoph Knoedler
- Faculty of Applied Social and Health Sciences, Regensburg University of Applied Sciences, Regensburg, Germany
| | - Samuel Knoedler
- Department of Plastic, Hand, and Reconstructive Surgery, University Hospital Regensburg, Regensburg, Germany
| |
Collapse
|
41
|
Lotto C, Sheppard SC, Anschuetz W, Stricker D, Molinari G, Huwendiek S, Anschuetz L. ChatGPT Generated Otorhinolaryngology Multiple-Choice Questions: Quality, Psychometric Properties, and Suitability for Assessments. OTO Open 2024; 8:e70018. [PMID: 39328276 PMCID: PMC11424880 DOI: 10.1002/oto2.70018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2024] [Revised: 08/23/2024] [Accepted: 09/07/2024] [Indexed: 09/28/2024] Open
Abstract
Objective To explore Chat Generative Pretrained Transformer's (ChatGPT's) capability to create multiple-choice questions about otorhinolaryngology (ORL). Study Design Experimental question generation and exam simulation. Setting Tertiary academic center. Methods ChatGPT 3.5 was prompted: "Can you please create a challenging 20-question multiple-choice questionnaire about clinical cases in otolaryngology, offering five answer options?." The generated questionnaire was sent to medical students, residents, and consultants. Questions were investigated regarding quality criteria. Answers were anonymized and the resulting data was analyzed in terms of difficulty and internal consistency. Results ChatGPT 3.5 generated 20 exam questions of which 1 question was considered off-topic, 3 questions had a false answer, and 3 questions had multiple correct answers. Subspecialty theme repartition was as follows: 5 questions were on otology, 5 about rhinology, and 10 questions addressed head and neck. The qualities of focus and relevance were good while the vignette and distractor qualities were low. The level of difficulty was suitable for undergraduate medical students (n = 24), but too easy for residents (n = 30) or consultants (n = 10) in ORL. Cronbach's α was highest (.69) with 15 selected questions using students' results. Conclusion ChatGPT 3.5 is able to generate grammatically correct simple ORL multiple choice questions for a medical student level. However, the overall quality of the questions was average, needing thorough review and revision by a medical expert to ensure suitability in future exams.
Collapse
Affiliation(s)
- Cecilia Lotto
- Department of Otorhinolaryngology, Head and Neck SurgeryInselspital, Bern University Hospital, University of BernBernSwitzerland
- Department of Otolaryngology, Head and Neck SurgeryIRCCS Azienda Ospedaliero‐Universitaria di BolognaBolognaItaly
- Department of Medical and Surgical SciencesAlma Mater Studiorum‐University of BolognaBolognaItaly
| | - Sean C. Sheppard
- Department of Otorhinolaryngology, Head and Neck SurgeryInselspital, Bern University Hospital, University of BernBernSwitzerland
| | - Wilma Anschuetz
- Institute for Medical EducationUniversity of BernBernSwitzerland
| | - Daniel Stricker
- Institute for Medical EducationUniversity of BernBernSwitzerland
| | - Giulia Molinari
- Department of Otolaryngology, Head and Neck SurgeryIRCCS Azienda Ospedaliero‐Universitaria di BolognaBolognaItaly
- Department of Medical and Surgical SciencesAlma Mater Studiorum‐University of BolognaBolognaItaly
| | - Sören Huwendiek
- Institute for Medical EducationUniversity of BernBernSwitzerland
| | - Lukas Anschuetz
- Department of Otorhinolaryngology, Head and Neck SurgeryInselspital, Bern University Hospital, University of BernBernSwitzerland
- Department of Otorhinolaryngology, Head and Neck SurgeryCHUV, University of LausanneLausanneSwitzerland
- The Sense Innovation and Research CenterLausanneSwitzerland
| |
Collapse
|
42
|
Terwilliger E, Bcharah G, Bcharah H, Bcharah E, Richardson C, Scheffler P. Advancing Medical Education: Performance of Generative Artificial Intelligence Models on Otolaryngology Board Preparation Questions With Image Analysis Insights. Cureus 2024; 16:e64204. [PMID: 39130878 PMCID: PMC11315421 DOI: 10.7759/cureus.64204] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/09/2024] [Indexed: 08/13/2024] Open
Abstract
Objective To evaluate and compare the performance of Chat Generative Pre-Trained Transformer (ChatGPT), GPT-4, and Google Bard on United States otolaryngology board-style questions to scale their ability to act as an adjunctive study tool and resource for students and doctors. Methods A 1077 text question and 60 image-based questions from the otolaryngology board exam preparation tool BoardVitals were inputted into ChatGPT, GPT-4, and Google Bard. The questions were scaled true or false, depending on whether the artificial intelligence (AI) modality provided the correct response. Data analysis was performed in R Studio. Results GPT-4 scored the highest at 78.7% compared to ChatGPT and Bard at 55.3% and 61.7% (p<0.001), respectively. In terms of question difficulty, all three AI models performed best on easy questions (ChatGPT: 69.7%, GPT-4: 92.5%, and Bard: 76.4%) and worst on hard questions (ChatGPT: 42.3%, GPT-4: 61.3%, and Bard: 45.6%). Across all difficulty levels, GPT-4 did better than Bard and ChatGPT (p<0.0001). GPT-4 outperformed ChatGPT and Bard in all subspecialty sections, with significantly higher scores (p<0.05) on all sections except allergy (p>0.05). On image-based questions, GPT-4 performed better than Bard (56.7% vs 46.4%, p=0.368) and had better overall image interpretation capabilities. Conclusion This study showed that the GPT-4 model performed better than both ChatGPT and Bard on the United States otolaryngology board practice questions. Although the GPT-4 results were promising, AI should still be used with caution when being implemented in medical education or patient care settings.
Collapse
Affiliation(s)
- Emma Terwilliger
- Otolaryngology, Mayo Clinic Alix School of Medicine, Scottsdale, USA
| | - George Bcharah
- Otolaryngology, Mayo Clinic Alix School of Medicine, Scottsdale, USA
| | - Hend Bcharah
- Otolaryngology, Andrew Taylor Still University School of Osteopathic Medicine, Mesa, USA
| | | | | | | |
Collapse
|
43
|
Tessler I, Wolfovitz A, Alon EE, Gecel NA, Livneh N, Zimlichman E, Klang E. ChatGPT's adherence to otolaryngology clinical practice guidelines. Eur Arch Otorhinolaryngol 2024; 281:3829-3834. [PMID: 38647684 DOI: 10.1007/s00405-024-08634-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Accepted: 03/22/2024] [Indexed: 04/25/2024]
Abstract
OBJECTIVES Large language models, including ChatGPT, has the potential to transform the way we approach medical knowledge, yet accuracy in clinical topics is critical. Here we assessed ChatGPT's performance in adhering to the American Academy of Otolaryngology-Head and Neck Surgery guidelines. METHODS We presented ChatGPT with 24 clinical otolaryngology questions based on the guidelines of the American Academy of Otolaryngology. This was done three times (N = 72) to test the model's consistency. Two otolaryngologists evaluated the responses for accuracy and relevance to the guidelines. Cohen's Kappa was used to measure evaluator agreement, and Cronbach's alpha assessed the consistency of ChatGPT's responses. RESULTS The study revealed mixed results; 59.7% (43/72) of ChatGPT's responses were highly accurate, while only 2.8% (2/72) directly contradicted the guidelines. The model showed 100% accuracy in Head and Neck, but lower accuracy in Rhinology and Otology/Neurotology (66%), Laryngology (50%), and Pediatrics (8%). The model's responses were consistent in 17/24 (70.8%), with a Cronbach's alpha value of 0.87, indicating a reasonable consistency across tests. CONCLUSIONS Using a guideline-based set of structured questions, ChatGPT demonstrates consistency but variable accuracy in otolaryngology. Its lower performance in some areas, especially Pediatrics, suggests that further rigorous evaluation is needed before considering real-world clinical use.
Collapse
Affiliation(s)
- Idit Tessler
- Department of Otolaryngology and Head and Neck Surgery, Sheba Medical Center, Ramat Gan, Israel.
- School of Medicine, Tel Aviv University, Tel Aviv, Israel.
- ARC Innovation Center, Sheba Medical Center, Ramat Gan, Israel.
| | - Amit Wolfovitz
- Department of Otolaryngology and Head and Neck Surgery, Sheba Medical Center, Ramat Gan, Israel
- School of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Eran E Alon
- Department of Otolaryngology and Head and Neck Surgery, Sheba Medical Center, Ramat Gan, Israel
- School of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Nir A Gecel
- School of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Nir Livneh
- Department of Otolaryngology and Head and Neck Surgery, Sheba Medical Center, Ramat Gan, Israel
- School of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Eyal Zimlichman
- School of Medicine, Tel Aviv University, Tel Aviv, Israel
- ARC Innovation Center, Sheba Medical Center, Ramat Gan, Israel
- The Sheba Talpiot Medical Leadership Program, Ramat Gan, Israel
- Hospital Management, Sheba Medical Center, Ramat Gan, Israel
| | - Eyal Klang
- The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, USA
| |
Collapse
|
44
|
Longwell JB, Hirsch I, Binder F, Gonzalez Conchas GA, Mau D, Jang R, Krishnan RG, Grant RC. Performance of Large Language Models on Medical Oncology Examination Questions. JAMA Netw Open 2024; 7:e2417641. [PMID: 38888919 PMCID: PMC11185976 DOI: 10.1001/jamanetworkopen.2024.17641] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 04/18/2024] [Indexed: 06/20/2024] Open
Abstract
Importance Large language models (LLMs) recently developed an unprecedented ability to answer questions. Studies of LLMs from other fields may not generalize to medical oncology, a high-stakes clinical setting requiring rapid integration of new information. Objective To evaluate the accuracy and safety of LLM answers on medical oncology examination questions. Design, Setting, and Participants This cross-sectional study was conducted between May 28 and October 11, 2023. The American Society of Clinical Oncology (ASCO) Oncology Self-Assessment Series on ASCO Connection, the European Society of Medical Oncology (ESMO) Examination Trial questions, and an original set of board-style medical oncology multiple-choice questions were presented to 8 LLMs. Main Outcomes and Measures The primary outcome was the percentage of correct answers. Medical oncologists evaluated the explanations provided by the best LLM for accuracy, classified the types of errors, and estimated the likelihood and extent of potential clinical harm. Results Proprietary LLM 2 correctly answered 125 of 147 questions (85.0%; 95% CI, 78.2%-90.4%; P < .001 vs random answering). Proprietary LLM 2 outperformed an earlier version, proprietary LLM 1, which correctly answered 89 of 147 questions (60.5%; 95% CI, 52.2%-68.5%; P < .001), and the best open-source LLM, Mixtral-8x7B-v0.1, which correctly answered 87 of 147 questions (59.2%; 95% CI, 50.0%-66.4%; P < .001). The explanations provided by proprietary LLM 2 contained no or minor errors for 138 of 147 questions (93.9%; 95% CI, 88.7%-97.2%). Incorrect responses were most commonly associated with errors in information retrieval, particularly with recent publications, followed by erroneous reasoning and reading comprehension. If acted upon in clinical practice, 18 of 22 incorrect answers (81.8%; 95% CI, 59.7%-94.8%) would have a medium or high likelihood of moderate to severe harm. Conclusions and Relevance In this cross-sectional study of the performance of LLMs on medical oncology examination questions, the best LLM answered questions with remarkable performance, although errors raised safety concerns. These results demonstrated an opportunity to develop and evaluate LLMs to improve health care clinician experiences and patient care, considering the potential impact on capabilities and safety.
Collapse
Affiliation(s)
- Jack B. Longwell
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
| | - Ian Hirsch
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
- Department of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Fernando Binder
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
- Department of Medicine, University of Toronto, Toronto, Ontario, Canada
| | | | - Daniel Mau
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
- Institute of Medical Science, University of Toronto, Toronto, Ontario, Canada
| | - Raymond Jang
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
- Department of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Rahul G. Krishnan
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada
- Vector Institute, Toronto, Ontario, Canada
| | - Robert C. Grant
- Princess Margaret Cancer Centre, University Health Network, Toronto, Ontario, Canada
- Department of Medicine, University of Toronto, Toronto, Ontario, Canada
- Institute of Medical Science, University of Toronto, Toronto, Ontario, Canada
- ICES, Toronto, Ontario, Canada
| |
Collapse
|
45
|
Dallari V, Liberale C, De Cecco F, Nocini R, Arietti V, Monzani D, Sacchetto L. The role of artificial intelligence in training ENT residents: a survey on ChatGPT, a new method of investigation. ACTA OTORHINOLARYNGOLOGICA ITALICA : ORGANO UFFICIALE DELLA SOCIETA ITALIANA DI OTORINOLARINGOLOGIA E CHIRURGIA CERVICO-FACCIALE 2024; 44:161-168. [PMID: 38712520 PMCID: PMC11166211 DOI: 10.14639/0392-100x-n2806] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Accepted: 01/02/2024] [Indexed: 05/08/2024]
Abstract
Objective The primary focus of this study was to analyze the adoption of ChatGPT among Ear, Nose, and Throat (ENT) trainees, encompassing its role in scientific research and personal study. We examined in which year ENT trainees become involved in clinical research and how many scientific investigations they have been engaged in. Methods An online survey was distributed to ENT residents employed in Italian University Hospitals. Results Out of 609 Italian ENT trainees, 181 (29.7%) responded to the survey. Among these, 67.4% were familiar with ChatGPT, and 18.9% of them used artificial intelligence as a tool for research and study. In all, 32.6% were not familiar with ChatGPT and its functions. Within our sample, there was an increasing trend of participation by ENT trainees in scientific publications throughout their training. Conclusions ChatGPT remains relatively unfamiliar and underutilised in Italy, even though it could be a valuable and efficient tool for ENT trainees, providing quick access for study and research through both personal computers and smartphones.
Collapse
Affiliation(s)
- Virginia Dallari
- Unit of Otorhinolaryngology, Head & Neck Department, University of Verona, Verona, Italy
- Member of the Young Confederation of European ORL-HNS
| | - Carlotta Liberale
- Unit of Otorhinolaryngology, Head & Neck Department, University of Verona, Verona, Italy
| | - Francesca De Cecco
- Unit of Otorhinolaryngology, Head & Neck Department, University of Verona, Verona, Italy
| | - Riccardo Nocini
- Unit of Otorhinolaryngology, Head & Neck Department, University of Verona, Verona, Italy
- Member of the Young Confederation of European ORL-HNS
| | - Valerio Arietti
- Unit of Otorhinolaryngology, Head & Neck Department, University of Verona, Verona, Italy
| | - Daniele Monzani
- Unit of Otorhinolaryngology, Head & Neck Department, University of Verona, Verona, Italy
| | - Luca Sacchetto
- Unit of Otorhinolaryngology, Head & Neck Department, University of Verona, Verona, Italy
| |
Collapse
|
46
|
Alfertshofer M, Hoch CC, Funk PF, Hollmann K, Wollenberg B, Knoedler S, Knoedler L. Sailing the Seven Seas: A Multinational Comparison of ChatGPT's Performance on Medical Licensing Examinations. Ann Biomed Eng 2024; 52:1542-1545. [PMID: 37553555 PMCID: PMC11082010 DOI: 10.1007/s10439-023-03338-3] [Citation(s) in RCA: 27] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 07/28/2023] [Indexed: 08/10/2023]
Abstract
PURPOSE The use of AI-powered technology, particularly OpenAI's ChatGPT, holds significant potential to reshape healthcare and medical education. Despite existing studies on the performance of ChatGPT in medical licensing examinations across different nations, a comprehensive, multinational analysis using rigorous methodology is currently lacking. Our study sought to address this gap by evaluating the performance of ChatGPT on six different national medical licensing exams and investigating the relationship between test question length and ChatGPT's accuracy. METHODS We manually inputted a total of 1,800 test questions (300 each from US, Italian, French, Spanish, UK, and Indian medical licensing examination) into ChatGPT, and recorded the accuracy of its responses. RESULTS We found significant variance in ChatGPT's test accuracy across different countries, with the highest accuracy seen in the Italian examination (73% correct answers) and the lowest in the French examination (22% correct answers). Interestingly, question length correlated with ChatGPT's performance in the Italian and French state examinations only. In addition, the study revealed that questions requiring multiple correct answers, as seen in the French examination, posed a greater challenge to ChatGPT. CONCLUSION Our findings underscore the need for future research to further delineate ChatGPT's strengths and limitations in medical test-taking across additional countries and to develop guidelines to prevent AI-assisted cheating in medical examinations.
Collapse
Affiliation(s)
- Michael Alfertshofer
- Division of Hand, Plastic and Aesthetic Surgery, Ludwig-Maximilians University Munich, Ziemssenstrasse 5, 80336, Munich, Germany.
| | - Cosima C Hoch
- Department of Otolaryngology, Head and Neck Surgery, School of Medicine, Technical University of Munich (TUM), Ismaningerstrasse 22, 81675, Munich, Germany
| | - Paul F Funk
- Department of Otolaryngology, Head and Neck Surgery, University Hospital Jena, Friedrich Schiller University Jena, Am Klinikum 1, 07747, Jena, Germany
| | - Katharina Hollmann
- Department of Pathology, Massachusetts General Hospital, Harvard Medical School, 55 Fruit St, Boston, MA, 02114, USA
| | - Barbara Wollenberg
- Department of Otolaryngology, Head and Neck Surgery, School of Medicine, Technical University of Munich (TUM), Ismaningerstrasse 22, 81675, Munich, Germany
| | - Samuel Knoedler
- Department of Plastic, Hand and Reconstructive Surgery, University Hospital Regensburg, Franz-Josef-Strauss-Allee 11, 93053, Regensburg, Germany
| | - Leonard Knoedler
- Department of Plastic, Hand and Reconstructive Surgery, University Hospital Regensburg, Franz-Josef-Strauss-Allee 11, 93053, Regensburg, Germany
| |
Collapse
|
47
|
Igarashi Y, Nakahara K, Norii T, Miyake N, Tagami T, Yokobori S. Performance of a Large Language Model on Japanese Emergency Medicine Board Certification Examinations. J NIPPON MED SCH 2024; 91:155-161. [PMID: 38432929 DOI: 10.1272/jnms.jnms.2024_91-205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/05/2024]
Abstract
BACKGROUND Emergency physicians need a broad range of knowledge and skills to address critical medical, traumatic, and environmental conditions. Artificial intelligence (AI), including large language models (LLMs), has potential applications in healthcare settings; however, the performance of LLMs in emergency medicine remains unclear. METHODS To evaluate the reliability of information provided by ChatGPT, an LLM was given the questions set by the Japanese Association of Acute Medicine in its board certification examinations over a period of 5 years (2018-2022) and programmed to answer them twice. Statistical analysis was used to assess agreement of the two responses. RESULTS The LLM successfully answered 465 of the 475 text-based questions, achieving an overall correct response rate of 62.3%. For questions without images, the rate of correct answers was 65.9%. For questions with images that were not explained to the LLM, the rate of correct answers was only 52.0%. The annual rates of correct answers to questions without images ranged from 56.3% to 78.8%. Accuracy was better for scenario-based questions (69.1%) than for stand-alone questions (62.1%). Agreement between the two responses was substantial (kappa = 0.70). Factual error accounted for 82% of the incorrectly answered questions. CONCLUSION An LLM performed satisfactorily on an emergency medicine board certification examination in Japanese and without images. However, factual errors in the responses highlight the need for physician oversight when using LLMs.
Collapse
Affiliation(s)
- Yutaka Igarashi
- Department of Emergency and Critical Care Medicine, Nippon Medical School
| | - Kyoichi Nakahara
- Department of Emergency and Critical Care Medicine, Nippon Medical School
| | - Tatsuya Norii
- Department of Emergency Medicine, University of New Mexico, NM, United States of America
| | - Nodoka Miyake
- Department of Emergency and Critical Care Medicine, Nippon Medical School
| | - Takashi Tagami
- Department of Emergency and Critical Care Medicine, Nippon Medical School Musashi Kosugi Hospital
| | - Shoji Yokobori
- Department of Emergency and Critical Care Medicine, Nippon Medical School
| |
Collapse
|
48
|
Kochanek K, Skarzynski H, Jedrzejczak WW. Accuracy and Repeatability of ChatGPT Based on a Set of Multiple-Choice Questions on Objective Tests of Hearing. Cureus 2024; 16:e59857. [PMID: 38854312 PMCID: PMC11157293 DOI: 10.7759/cureus.59857] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/07/2024] [Indexed: 06/11/2024] Open
Abstract
INTRODUCTION ChatGPT has been tested in many disciplines, but only a few have involved hearing diagnosis and none to physiology or audiology more generally. The consistency of the chatbot's responses to the same question posed multiple times has not been well investigated either. This study aimed to assess the accuracy and repeatability of ChatGPT 3.5 and 4 on test questions concerning objective measures of hearing. Of particular interest was the short-term repeatability of responses which was here tested on four separate days extended over one week. METHODS We used 30 single-answer, multiple-choice exam questions from a one-year course on objective methods of testing hearing. The questions were posed five times to both ChatGPT 3.5 (the free version) and ChatGPT 4 (the paid version) on each of four days (two days one week and two days the following week). The accuracy of the responses was evaluated in terms of a response key. To evaluate the repeatability of the responses over time, percent agreement and Cohen's Kappa were calculated. Results: The overall accuracy of ChatGPT 3.5 was 48-49%, while that of ChatGPT 4 was 65-69%. ChatGPT 3.5 consistently failed to pass the threshold of 50% correct responses. Within a single day, the percent agreement was 76-79% for ChatGPT 3.5 and 87-88% for ChatGPT 4 (Cohen's Kappa 0.67-0.71 and 0.81-0.84 respectively). The percent agreement between responses from different days was 75-79% for ChatGPT 3.5 and 85-88% for ChatGPT 4 (Cohen's Kappa 0.65-0.69 and 0.80-0.85 respectively). CONCLUSION ChatGPT 4 outperforms ChatGPT 3.5 both in accuracy and higher repeatability over time. However, the great variability of the responses casts doubt on possible professional applications of both versions.
Collapse
Affiliation(s)
- Krzysztof Kochanek
- Department of Experimental Audiology, Institute of Physiology and Pathology of Hearing, Warsaw, POL
| | - Henryk Skarzynski
- Otorhinolaryngosurgery Clinic, Institute of Physiology and Pathology of Hearing, Warsaw, POL
| | - Wiktor W Jedrzejczak
- Department of Experimental Audiology, Institute of Physiology and Pathology of Hearing, Warsaw, POL
| |
Collapse
|
49
|
Makhoul M, Melkane AE, Khoury PE, Hadi CE, Matar N. A cross-sectional comparative study: ChatGPT 3.5 versus diverse levels of medical experts in the diagnosis of ENT diseases. Eur Arch Otorhinolaryngol 2024; 281:2717-2721. [PMID: 38365990 DOI: 10.1007/s00405-024-08509-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Accepted: 01/24/2024] [Indexed: 02/18/2024]
Abstract
PURPOSE With recent advances in artificial intelligence (AI), it has become crucial to thoroughly evaluate its applicability in healthcare. This study aimed to assess the accuracy of ChatGPT in diagnosing ear, nose, and throat (ENT) pathology, and comparing its performance to that of medical experts. METHODS We conducted a cross-sectional comparative study where 32 ENT cases were presented to ChatGPT 3.5, ENT physicians, ENT residents, family medicine (FM) specialists, second-year medical students (Med2), and third-year medical students (Med3). Each participant provided three differential diagnoses. The study analyzed diagnostic accuracy rates and inter-rater agreement within and between participant groups and ChatGPT. RESULTS The accuracy rate of ChatGPT was 70.8%, being not significantly different from ENT physicians or ENT residents. However, a significant difference in correctness rate existed between ChatGPT and FM specialists (49.8%, p < 0.001), and between ChatGPT and medical students (Med2 47.5%, p < 0.001; Med3 47%, p < 0.001). Inter-rater agreement for the differential diagnosis between ChatGPT and each participant group was either poor or fair. In 68.75% of cases, ChatGPT failed to mention the most critical diagnosis. CONCLUSIONS ChatGPT demonstrated accuracy comparable to that of ENT physicians and ENT residents in diagnosing ENT pathology, outperforming FM specialists, Med2 and Med3. However, it showed limitations in identifying the most critical diagnosis.
Collapse
Affiliation(s)
- Mikhael Makhoul
- Department of Otolaryngology-Head and Neck Surgery, Hotel Dieu de France Hospital, Saint Joseph University, Alfred Naccache Boulevard, Ashrafieh, PO Box: 166830, Beirut, Lebanon.
| | - Antoine E Melkane
- Department of Otolaryngology-Head and Neck Surgery, Hotel Dieu de France Hospital, Saint Joseph University, Alfred Naccache Boulevard, Ashrafieh, PO Box: 166830, Beirut, Lebanon
| | - Patrick El Khoury
- Department of Otolaryngology-Head and Neck Surgery, Hotel Dieu de France Hospital, Saint Joseph University, Alfred Naccache Boulevard, Ashrafieh, PO Box: 166830, Beirut, Lebanon
| | - Christopher El Hadi
- Department of Otolaryngology-Head and Neck Surgery, Hotel Dieu de France Hospital, Saint Joseph University, Alfred Naccache Boulevard, Ashrafieh, PO Box: 166830, Beirut, Lebanon
| | - Nayla Matar
- Department of Otolaryngology-Head and Neck Surgery, Hotel Dieu de France Hospital, Saint Joseph University, Alfred Naccache Boulevard, Ashrafieh, PO Box: 166830, Beirut, Lebanon
| |
Collapse
|
50
|
Dallari V, Liberale C, De Cecco F, Monzani D. Can ChatGPT be a valuable study tool for ENT residents? Eur Ann Otorhinolaryngol Head Neck Dis 2024; 141:189-190. [PMID: 37993361 DOI: 10.1016/j.anorl.2023.10.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2023] [Accepted: 10/08/2023] [Indexed: 11/24/2023]
Affiliation(s)
- V Dallari
- Unit of Otorhinolaryngology, Head & Neck Department, University of Verona, Piazzale L.A. Scuro 10, 37134 Verona, Italy; Young Confederation of European ORL-HNS
| | - C Liberale
- Unit of Otorhinolaryngology, Head & Neck Department, University of Verona, Piazzale L.A. Scuro 10, 37134 Verona, Italy
| | - F De Cecco
- Unit of Otorhinolaryngology, Head & Neck Department, University of Verona, Piazzale L.A. Scuro 10, 37134 Verona, Italy.
| | - D Monzani
- Unit of Otorhinolaryngology, Head & Neck Department, University of Verona, Piazzale L.A. Scuro 10, 37134 Verona, Italy
| |
Collapse
|