1
|
Tran M, Balasooriya C, Jonnagaddala J, Leung GKK, Mahboobani N, Ramani S, Rhee J, Schuwirth L, Najafzadeh-Tabrizi NS, Semmler C, Wong ZS. Situating governance and regulatory concerns for generative artificial intelligence and large language models in medical education. NPJ Digit Med 2025; 8:315. [PMID: 40425695 PMCID: PMC12116760 DOI: 10.1038/s41746-025-01721-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2024] [Accepted: 05/14/2025] [Indexed: 05/29/2025] Open
Abstract
Generative artificial intelligence (GenAI) and large language models represent gains in educational efficiency and personalisation of learning. These are balanced against the considerations of the learning process, authentic assessment, and academic integrity. A pedagogical approach helps situate these concerns, and informs various types of governance and regulatory approaches. In this review we identify current and emerging issues regarding GenAI in medical education including pedagogical considerations, emerging roles, and trustworthiness. Potential measures to address specific regulatory concerns are explored.
Collapse
Affiliation(s)
- Michael Tran
- University of New South Wales, Kensington, NSW, Australia.
| | | | | | | | - Neeraj Mahboobani
- Department of Imaging and Interventional Radiology, Faculty of Medicine, The Chinese University of Hong Kong (CUHK), Hong Kong, PR China
| | - Subha Ramani
- Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Joel Rhee
- University of New South Wales, Kensington, NSW, Australia
| | | | | | | | - Zoie Sy Wong
- University of New South Wales, Kensington, NSW, Australia
- The University of Hong Kong, Hong Kong, PR China
- St Luke's International University, Chuo, Japan
| |
Collapse
|
2
|
Buhl LK. The answer may vary: large language model response patterns challenge their use in test item analysis. MEDICAL TEACHER 2025:1-6. [PMID: 40319392 DOI: 10.1080/0142159x.2025.2497891] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/30/2025] [Accepted: 04/22/2025] [Indexed: 05/07/2025]
Abstract
INTRODUCTION The validation of multiple-choice question (MCQ)-based assessments typically requires administration to a test population, which is resource-intensive and practically demanding. Large language models (LLMs) are a promising tool to aid in many aspects of assessment development, including the challenge of determining the psychometric properties of test items. This study investigated whether LLMs could predict the difficulty and point biserial indices of MCQs, potentially alleviating the need for preliminary analysis in a test population. METHODS Sixty MCQs developed by subject matter experts in anesthesiology were presented one hundred times each to five different LLMs (ChatGPT-4o, o1-preview, Claude 3.5 Sonnet, Grok-2, and Llama 3.2) and to clinical fellows. Response patterns were analyzed, and difficulty indices (proportion of correct responses) and point biserial indices (item-test score correlation) were calculated. Spearman correlation coefficients were used to compare difficulty and point biserial indices between the LLMs and fellows. RESULTS Marked differences in response patterns were observed among LLMs: ChatGPT-4o, o1-preview, and Grok-2 showed variable responses across trials, while Claude 3.5 Sonnet and Llama 3.2 gave consistent responses. The LLMs outperformed fellows with mean scores of 58% to 85% compared to 57% for the fellows. Three LLMs showed a weak correlation with fellow difficulty indices (r = 0.28-0.29), while the two highest scoring models showed no correlation. No LLM predicted the point biserial indices. DISCUSSION These findings suggest LLMs have limited utility in predicting MCQ performance metrics. Notably, higher-scoring models showed less correlation with human performance, suggesting that as models become more powerful, their ability to predict human performance may decrease. Understanding the consistency of an LLM's response pattern is critical for both research methodology and practical applications in test development. Future work should focus on leveraging the language-processing capabilities of LLMs for overall assessment optimization (e.g., inter-item correlation) rather than predicting item characteristics.
Collapse
Affiliation(s)
- Lauren K Buhl
- Department of Anesthesiology, Dartmouth Hitchcock Medical Center, Lebanon, NH, USA
| |
Collapse
|
3
|
Tseng LW, Lu YC, Tseng LC, Chen YC, Chen HY. Performance of ChatGPT-4 on Taiwanese Traditional Chinese Medicine Licensing Examinations: Cross-Sectional Study. JMIR MEDICAL EDUCATION 2025; 11:e58897. [PMID: 40106227 PMCID: PMC11939018 DOI: 10.2196/58897] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/27/2024] [Revised: 07/27/2024] [Accepted: 11/09/2024] [Indexed: 03/22/2025]
Abstract
Background The integration of artificial intelligence (AI), notably ChatGPT, into medical education, has shown promising results in various medical fields. Nevertheless, its efficacy in traditional Chinese medicine (TCM) examinations remains understudied. Objective This study aims to (1) assess the performance of ChatGPT on the TCM licensing examination in Taiwan and (2) evaluate the model's explainability in answering TCM-related questions to determine its suitability as a TCM learning tool. Methods We used the GPT-4 model to respond to 480 questions from the 2022 TCM licensing examination. This study compared the performance of the model against that of licensed TCM doctors using 2 approaches, namely direct answer selection and provision of explanations before answer selection. The accuracy and consistency of AI-generated responses were analyzed. Moreover, a breakdown of question characteristics was performed based on the cognitive level, depth of knowledge, types of questions, vignette style, and polarity of questions. Results ChatGPT achieved an overall accuracy of 43.9%, which was lower than that of 2 human participants (70% and 78.4%). The analysis did not reveal a significant correlation between the accuracy of the model and the characteristics of the questions. An in-depth examination indicated that errors predominantly resulted from a misunderstanding of TCM concepts (55.3%), emphasizing the limitations of the model with regard to its TCM knowledge base and reasoning capability. Conclusions Although ChatGPT shows promise as an educational tool, its current performance on TCM licensing examinations is lacking. This highlights the need for enhancing AI models with specialized TCM training and suggests a cautious approach to utilizing AI for TCM education. Future research should focus on model improvement and the development of tailored educational applications to support TCM learning.
Collapse
Affiliation(s)
- Liang-Wei Tseng
- Division of Chinese Acupuncture and Traumatology, Center of Traditional Chinese Medicine, Chang Gung Memorial Hospital, Taoyuan, Taiwan
| | - Yi-Chin Lu
- Division of Chinese Internal Medicine, Center for Traditional Chinese Medicine, Chang Gung Memorial Hospital, No. 123, Dinghu Rd, Gueishan Dist, Taoyuan, 33378, Taiwan, 886 3 3196200 ext 2611, 886 3 3298995
| | | | - Yu-Chun Chen
- School of Medicine, Faculty of Medicine, National Yang-Ming Chiao Tung University, Taipei, Taiwan
- Taipei Veterans General Hospital, Yuli Branch, Taipei, Taiwan
- Institute of Hospital and Health Care Administration, National Yang-Ming Chiao Tung University, Taipei, Taiwan
| | - Hsing-Yu Chen
- Division of Chinese Internal Medicine, Center for Traditional Chinese Medicine, Chang Gung Memorial Hospital, No. 123, Dinghu Rd, Gueishan Dist, Taoyuan, 33378, Taiwan, 886 3 3196200 ext 2611, 886 3 3298995
- School of Traditional Chinese Medicine, College of Medicine, Chang Gung University, Taoyuan, Taiwan
| |
Collapse
|
4
|
Prazeres F. ChatGPT's Performance on Portuguese Medical Examination Questions: Comparative Analysis of ChatGPT-3.5 Turbo and ChatGPT-4o Mini. JMIR MEDICAL EDUCATION 2025; 11:e65108. [PMID: 40043219 PMCID: PMC11902880 DOI: 10.2196/65108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/05/2024] [Revised: 11/30/2024] [Accepted: 12/12/2024] [Indexed: 03/14/2025]
Abstract
Background Advancements in ChatGPT are transforming medical education by providing new tools for assessment and learning, potentially enhancing evaluations for doctors and improving instructional effectiveness. Objective This study evaluates the performance and consistency of ChatGPT-3.5 Turbo and ChatGPT-4o mini in solving European Portuguese medical examination questions (2023 National Examination for Access to Specialized Training; Prova Nacional de Acesso à Formação Especializada [PNA]) and compares their performance to human candidates. Methods ChatGPT-3.5 Turbo was tested on the first part of the examination (74 questions) on July 18, 2024, and ChatGPT-4o mini on the second part (74 questions) on July 19, 2024. Each model generated an answer using its natural language processing capabilities. To test consistency, each model was asked, "Are you sure?" after providing an answer. Differences between the first and second responses of each model were analyzed using the McNemar test with continuity correction. A single-parameter t test compared the models' performance to human candidates. Frequencies and percentages were used for categorical variables, and means and CIs for numerical variables. Statistical significance was set at P<.05. Results ChatGPT-4o mini achieved an accuracy rate of 65% (48/74) on the 2023 PNA examination, surpassing ChatGPT-3.5 Turbo. ChatGPT-4o mini outperformed medical candidates, while ChatGPT-3.5 Turbo had a more moderate performance. Conclusions This study highlights the advancements and potential of ChatGPT models in medical education, emphasizing the need for careful implementation with teacher oversight and further research.
Collapse
Affiliation(s)
- Filipe Prazeres
- Faculty of Health Sciences, University of Beira Interior, Av. Infante D. Henrique, Covilhã, 6201-506, Portugal, 351 234393150
- Family Health Unit Beira Ria, Gafanha da Nazaré, Portugal
- CINTESIS@RISE, Department of Community Medicine, Information and Health Decision Sciences, Faculty of Medicine of the University of Porto, Porto, Portugal
| |
Collapse
|
5
|
Aster A, Laupichler MC, Rockwell-Kollmann T, Masala G, Bala E, Raupach T. ChatGPT and Other Large Language Models in Medical Education - Scoping Literature Review. MEDICAL SCIENCE EDUCATOR 2025; 35:555-567. [PMID: 40144083 PMCID: PMC11933646 DOI: 10.1007/s40670-024-02206-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Accepted: 10/24/2024] [Indexed: 03/28/2025]
Abstract
This review aims to provide a summary of all scientific publications on the use of large language models (LLMs) in medical education over the first year of their availability. A scoping literature review was conducted in accordance with the PRISMA recommendations for scoping reviews. Five scientific literature databases were searched using predefined search terms. The search yielded 1509 initial results, of which 145 studies were ultimately included. Most studies assessed LLMs' capabilities in passing medical exams. Some studies discussed advantages, disadvantages, and potential use cases of LLMs. Very few studies conducted empirical research. Many published studies lack methodological rigor. We therefore propose a research agenda to improve the quality of studies on LLM.
Collapse
Affiliation(s)
- Alexandra Aster
- Institute of Medical Education, University Hospital Bonn, Venusberg-Campus 1, 53127 Bonn, Germany
| | - Matthias Carl Laupichler
- Institute of Medical Education, University Hospital Bonn, Venusberg-Campus 1, 53127 Bonn, Germany
| | - Tamina Rockwell-Kollmann
- Institute of Medical Education, University Hospital Bonn, Venusberg-Campus 1, 53127 Bonn, Germany
| | - Gilda Masala
- Institute of Medical Education, University Hospital Bonn, Venusberg-Campus 1, 53127 Bonn, Germany
| | - Ebru Bala
- Institute of Medical Education, University Hospital Bonn, Venusberg-Campus 1, 53127 Bonn, Germany
| | - Tobias Raupach
- Institute of Medical Education, University Hospital Bonn, Venusberg-Campus 1, 53127 Bonn, Germany
| |
Collapse
|
6
|
Coşkun Ö, Kıyak YS, Budakoğlu Iİ. ChatGPT to generate clinical vignettes for teaching and multiple-choice questions for assessment: A randomized controlled experiment. MEDICAL TEACHER 2025; 47:268-274. [PMID: 38478902 DOI: 10.1080/0142159x.2024.2327477] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Accepted: 03/04/2024] [Indexed: 02/08/2025]
Abstract
AIM This study aimed to evaluate the real-life performance of clinical vignettes and multiple-choice questions generated by using ChatGPT. METHODS This was a randomized controlled study in an evidence-based medicine training program. We randomly assigned seventy-four medical students to two groups. The ChatGPT group received ill-defined cases generated by ChatGPT, while the control group received human-written cases. At the end of the training, they evaluated the cases by rating 10 statements using a Likert scale. They also answered 15 multiple-choice questions (MCQs) generated by ChatGPT. The case evaluations of the two groups were compared. Some psychometric characteristics (item difficulty and point-biserial correlations) of the test were also reported. RESULTS None of the scores in 10 statements regarding the cases showed a significant difference between the ChatGPT group and the control group (p > .05). In the test, only six MCQs had acceptable levels (higher than 0.30) of point-biserial correlation, and five items could be considered acceptable in classroom settings. CONCLUSIONS The results showed that the quality of the vignettes are comparable to those created by human authors, and some multiple-questions have acceptable psychometric characteristics. ChatGPT has potential in generating clinical vignettes for teaching and MCQs for assessment in medical education.
Collapse
Affiliation(s)
- Özlem Coşkun
- Department of Medical Education and Informatics, Gazi University, Ankara, Turkey
| | - Yavuz Selim Kıyak
- Department of Medical Education and Informatics, Gazi University, Ankara, Turkey
| | - Işıl İrem Budakoğlu
- Department of Medical Education and Informatics, Gazi University, Ankara, Turkey
| |
Collapse
|
7
|
Jin HK, Kim E. Performance of GPT-3.5 and GPT-4 on the Korean Pharmacist Licensing Examination: Comparison Study. JMIR MEDICAL EDUCATION 2024; 10:e57451. [PMID: 39630413 PMCID: PMC11633516 DOI: 10.2196/57451] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/17/2024] [Revised: 08/28/2024] [Accepted: 10/09/2024] [Indexed: 12/13/2024]
Abstract
Background ChatGPT, a recently developed artificial intelligence chatbot and a notable large language model, has demonstrated improved performance on medical field examinations. However, there is currently little research on its efficacy in languages other than English or in pharmacy-related examinations. Objective This study aimed to evaluate the performance of GPT models on the Korean Pharmacist Licensing Examination (KPLE). Methods We evaluated the percentage of correct answers provided by 2 different versions of ChatGPT (GPT-3.5 and GPT-4) for all multiple-choice single-answer KPLE questions, excluding image-based questions. In total, 320, 317, and 323 questions from the 2021, 2022, and 2023 KPLEs, respectively, were included in the final analysis, which consisted of 4 units: Biopharmacy, Industrial Pharmacy, Clinical and Practical Pharmacy, and Medical Health Legislation. Results The 3-year average percentage of correct answers was 86.5% (830/960) for GPT-4 and 60.7% (583/960) for GPT-3.5. GPT model accuracy was highest in Biopharmacy (GPT-3.5 77/96, 80.2% in 2022; GPT-4 87/90, 96.7% in 2021) and lowest in Medical Health Legislation (GPT-3.5 8/20, 40% in 2022; GPT-4 12/20, 60% in 2022). Additionally, when comparing the performance of artificial intelligence with that of human participants, pharmacy students outperformed GPT-3.5 but not GPT-4. Conclusions In the last 3 years, GPT models have performed very close to or exceeded the passing threshold for the KPLE. This study demonstrates the potential of large language models in the pharmacy domain; however, extensive research is needed to evaluate their reliability and ensure their secure application in pharmacy contexts due to several inherent challenges. Addressing these limitations could make GPT models more effective auxiliary tools for pharmacy education.
Collapse
Affiliation(s)
- Hye Kyung Jin
- Research Institute of Pharmaceutical Sciences, College of Pharmacy, Chung-Ang University, Seoul, Republic of Korea
- Data Science, Evidence-Based and Clinical Research Laboratory, Department of Health, Social, and Clinical Pharmacy, College of Pharmacy, Chung-Ang University, Seoul, Republic of Korea
| | - EunYoung Kim
- Research Institute of Pharmaceutical Sciences, College of Pharmacy, Chung-Ang University, Seoul, Republic of Korea
- Data Science, Evidence-Based and Clinical Research Laboratory, Department of Health, Social, and Clinical Pharmacy, College of Pharmacy, Chung-Ang University, Seoul, Republic of Korea
- Division of Licensing of Medicines and Regulatory Science, The Graduate School of Pharmaceutical Management and Regulatory Science Policy, The Graduate School of Pharmaceutical Regulatory Sciences, Chung-Ang University, 84 Heukseok-Ro, Dongjak-gu, Seoul, 06974, Republic of Korea, 82 2-820-5791, 82 2-816-7338
| |
Collapse
|
8
|
Yu H, Fan L, Li L, Zhou J, Ma Z, Xian L, Hua W, He S, Jin M, Zhang Y, Gandhi A, Ma X. Large Language Models in Biomedical and Health Informatics: A Review with Bibliometric Analysis. JOURNAL OF HEALTHCARE INFORMATICS RESEARCH 2024; 8:658-711. [PMID: 39463859 PMCID: PMC11499577 DOI: 10.1007/s41666-024-00171-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2024] [Revised: 08/16/2024] [Accepted: 08/22/2024] [Indexed: 10/29/2024]
Abstract
Large language models (LLMs) have rapidly become important tools in Biomedical and Health Informatics (BHI), potentially enabling new ways to analyze data, treat patients, and conduct research. This study aims to provide a comprehensive overview of LLM applications in BHI, highlighting their transformative potential and addressing the associated ethical and practical challenges. We reviewed 1698 research articles from January 2022 to December 2023, categorizing them by research themes and diagnostic categories. Additionally, we conducted network analysis to map scholarly collaborations and research dynamics. Our findings reveal a substantial increase in the potential applications of LLMs to a variety of BHI tasks, including clinical decision support, patient interaction, and medical document analysis. Notably, LLMs are expected to be instrumental in enhancing the accuracy of diagnostic tools and patient care protocols. The network analysis highlights dense and dynamically evolving collaborations across institutions, underscoring the interdisciplinary nature of LLM research in BHI. A significant trend was the application of LLMs in managing specific disease categories, such as mental health and neurological disorders, demonstrating their potential to influence personalized medicine and public health strategies. LLMs hold promising potential to further transform biomedical research and healthcare delivery. While promising, the ethical implications and challenges of model validation call for rigorous scrutiny to optimize their benefits in clinical settings. This survey serves as a resource for stakeholders in healthcare, including researchers, clinicians, and policymakers, to understand the current state and future potential of LLMs in BHI.
Collapse
Affiliation(s)
- Huizi Yu
- University of Michigan, Ann Arbor, MI USA
| | - Lizhou Fan
- University of Michigan, Ann Arbor, MI USA
| | - Lingyao Li
- University of Michigan, Ann Arbor, MI USA
| | | | - Zihui Ma
- University of Maryland, College Park, MD USA
| | - Lu Xian
- University of Michigan, Ann Arbor, MI USA
| | | | - Sijia He
- University of Michigan, Ann Arbor, MI USA
| | | | | | - Ashvin Gandhi
- University of California, Los Angeles, Los Angeles, CA USA
| | - Xin Ma
- Shandong University, Jinan, Shandong China
| |
Collapse
|
9
|
Ramgopal S, Varma S, Gorski JK, Kester KM, Shieh A, Suresh S. Evaluation of a Large Language Model on the American Academy of Pediatrics' PREP Emergency Medicine Question Bank. Pediatr Emerg Care 2024; 40:871-875. [PMID: 39591396 DOI: 10.1097/pec.0000000000003271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/28/2024]
Abstract
BACKGROUND Large language models (LLMs), including ChatGPT (Chat Generative Pretrained Transformer), a popular, publicly available LLM, represent an important innovation in the application of artificial intelligence. These systems generate relevant content by identifying patterns in large text datasets based on user input across various topics. We sought to evaluate the performance of ChatGPT in practice test questions designed to assess knowledge competency for pediatric emergency medicine (PEM). METHODS We evaluated the performance of ChatGPT for PEM board certification using a popular question bank used for board certification in PEM published between 2022 and 2024. Clinicians assessed performance of ChatGPT by inputting prompts and recording the software's responses, asking each question over 3 separate iterations. We calculated correct answer percentages (defined as correct in at least 2/3 iterations) and assessed for agreement between the iterations using Fleiss' κ. RESULTS We included 215 questions over the 3 study years. ChatGPT responded correctly to 161 of PREP EM questions over 3 years (74.5%; 95% confidence interval, 68.5%-80.5%), which was similar within each study year (75.0%, 71.8%, and 77.8% for study years 2022, 2023, and 2024, respectively). Among correct responses, most were answered correctly on all 3 iterations (137/161, 85.1%). Performance varied by topic, with the highest scores in research and medical specialties and lower in procedures and toxicology. Fleiss' κ across the 3 iterations was 0.71, indicating substantial agreement. CONCLUSION ChatGPT provided correct answers to PEM responses in three-quarters of cases, over the recommended minimum of 65% provided by the question publisher for passing. Responses by ChatGPT included detailed explanations, suggesting potential use for medical education. We identified limitations in specific topics and image interpretation. These results demonstrate opportunities for LLMs to enhance both the education and clinical practice of PEM.
Collapse
Affiliation(s)
| | | | | | | | - Andrew Shieh
- Department of Emergency Medicine, University of Michigan, Ann Arbor, MI
| | - Srinivasan Suresh
- Divisions of Health Informatics and Emergency Medicine, Department of Pediatrics, University of Pittsburgh School of Medicine and UPMC Children's Hospital of Pittsburgh, Pittsburgh, PA
| |
Collapse
|
10
|
Patel S, Patel R. Embracing Large Language Models for Adult Life Support Learning. Cureus 2024; 16:e75961. [PMID: 39698196 PMCID: PMC11654997 DOI: 10.7759/cureus.75961] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/17/2024] [Indexed: 12/20/2024] Open
Abstract
Background It is recognised that large language models (LLMs) may aid medical education by supporting the understanding of explanations behind answers to multiple choice questions. This study aimed to evaluate the efficacy of LLM chatbots ChatGPT and Bard in answering an Intermediate Life Support pre-course multiple choice question (MCQs) test developed by the Resuscitation Council UK focused on managing deteriorating patients and identifying causes and treating cardiac arrest. We assessed the accuracy of responses and quality of explanations to evaluate the utility of the chatbots. Methods The performance of the AI chatbots ChatGPT-3.5 and Bard were assessed on their ability to choose the correct answer and provide clear comprehensive explanations in answering MCQs developed by the Resuscitation Council UK for their Intermediate Life Support Course. Ten MCQs were tested with a total score of 40, with one point scored for each accurate response to each statement a-d. In a separate scoring, questions were scored out of 1 if all sub-statements a-d were correct, to give a total score out of 10 for the test. The explanations provided by the AI chatbots were evaluated by three qualified physicians as per a rating scale from 0-3 for each overall question and median rater scores calculated and compared. The Fleiss multi-rater kappa (κ) was used to determine the score agreement among the three raters. Results When scoring each overall question to give a total score out of 10, Bard outperformed ChatGPT although the difference was not significant (p=0.37). Furthermore, there was no statistically significant difference in the performance of ChatGPT compared to Bard when scoring each sub-question separately to give a total score out of 40 (p=0.26). The qualities of explanations were similar for both LLMs. Importantly, despite answering certain questions incorrectly, both AI chatbots provided some useful correct information in their explanations of the answers to these questions. The Fleiss multi-rater kappa was 0.899 (p<0.001) for ChatGPT and 0.801 (p<0.001) for Bard. Conclusions The performances of both Bard and ChatGPT were similar in answering the MCQs with similar scores achieved. Notably, despite having access to data across the web, neither of the LLMs answered all questions accurately. This suggests that there is still learning required of AI models in medical education.
Collapse
Affiliation(s)
- Serena Patel
- General Surgery, Imperial College NHS Trust, Ilford, GBR
| | - Rohit Patel
- Oral and Maxillofacial Surgery, Kings College Hospital, London, GBR
| |
Collapse
|
11
|
Solmonovich RL, Kouba I, Quezada O, Rodriguez-Ayala G, Rojas V, Bonilla K, Espino K, Bracero LA. Artificial intelligence generates proficient Spanish obstetrics and gynecology counseling templates. AJOG GLOBAL REPORTS 2024; 4:100400. [PMID: 39507462 PMCID: PMC11539139 DOI: 10.1016/j.xagr.2024.100400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2024] Open
Abstract
Background Effective patient counseling in Obstetrics and gynecology is vital. Existing language barriers between Spanish-speaking patients and English-speaking providers may negatively impact patient understanding and adherence to medical recommendations, as language discordance between provider and patient has been associated with medication noncompliance, adverse drug events, and underuse of preventative care. Artificial intelligence large language models may be a helpful adjunct to patient care by generating counseling templates in Spanish. Objectives The primary objective was to determine if large language models can generate proficient counseling templates in Spanish on obstetric and gynecology topics. Secondary objectives were to (1) compare the content, quality, and comprehensiveness of generated templates between different large language models, (2) compare the proficiency ratings among the large language model generated templates, and (3) assess which generated templates had potential for integration into clinical practice. Study design Cross-sectional study using free open-access large language models to generate counseling templates in Spanish on select obstetrics and gynecology topics. Native Spanish-speaking practicing obstetricians and gynecologists, who were blinded to the source large language model for each template, reviewed and subjectively scored each template on its content, quality, and comprehensiveness and considered it for integration into clinical practice. Proficiency ratings were calculated as a composite score of content, quality, and comprehensiveness. A score of >4 was considered proficient. Basic inferential statistics were performed. Results All artificial intelligence large language models generated proficient obstetrics and gynecology counseling templates in Spanish, with Google Bard generating the most proficient template (p<0.0001) and outperforming the others in comprehensiveness (P=.03), quality (P=.04), and content (P=.01). Microsoft Bing received the lowest scores in these domains. Physicians were likely to be willing to incorporate the templates into clinical practice, with no significant discrepancy in the likelihood of integration based on the source large language model (P=.45). Conclusions Large language models have potential to generate proficient obstetrics and gynecology counseling templates in Spanish, which physicians would integrate into their clinical practice. Google Bard scored the highest across all attributes. There is an opportunity to use large language models to try to mitigate the language barriers in health care. Future studies should assess patient satisfaction, understanding, and adherence to clinical plans following receipt of these counseling templates.
Collapse
Affiliation(s)
- Rachel L. Solmonovich
- Northwell, New Hyde Park, NY (Solmonovich, Kouba, Quezada, Rodriguez-Ayala, Rojas, Bonilla, Espino, and Bracero)
- Department of Obstetrics and Gynecology, South Shore University Hospital, Bay Shore, NY (Solmonovich, Kouba, Rojas, Bonilla, Espino, and Bracero)
- Zucker School of Medicine at Hofstra/Northwell, Hempstead, NY (Solmonovich, Rodriguez-Ayala, Rojas, and Bracero)
| | - Insaf Kouba
- Northwell, New Hyde Park, NY (Solmonovich, Kouba, Quezada, Rodriguez-Ayala, Rojas, Bonilla, Espino, and Bracero)
- Department of Obstetrics and Gynecology, South Shore University Hospital, Bay Shore, NY (Solmonovich, Kouba, Rojas, Bonilla, Espino, and Bracero)
| | - Oscar Quezada
- Northwell, New Hyde Park, NY (Solmonovich, Kouba, Quezada, Rodriguez-Ayala, Rojas, Bonilla, Espino, and Bracero)
- Department of Obstetrics and Gynecology, Peconic Bay Medical Center, Riverhead, NY (Quezada)
| | - Gianni Rodriguez-Ayala
- Northwell, New Hyde Park, NY (Solmonovich, Kouba, Quezada, Rodriguez-Ayala, Rojas, Bonilla, Espino, and Bracero)
- Zucker School of Medicine at Hofstra/Northwell, Hempstead, NY (Solmonovich, Rodriguez-Ayala, Rojas, and Bracero)
- Department of Obstetrics and Gynecology, Huntington Hospital, Huntington, NY (Rodriguez-Ayala)
| | - Veronica Rojas
- Northwell, New Hyde Park, NY (Solmonovich, Kouba, Quezada, Rodriguez-Ayala, Rojas, Bonilla, Espino, and Bracero)
- Department of Obstetrics and Gynecology, South Shore University Hospital, Bay Shore, NY (Solmonovich, Kouba, Rojas, Bonilla, Espino, and Bracero)
- Zucker School of Medicine at Hofstra/Northwell, Hempstead, NY (Solmonovich, Rodriguez-Ayala, Rojas, and Bracero)
| | - Kevin Bonilla
- Northwell, New Hyde Park, NY (Solmonovich, Kouba, Quezada, Rodriguez-Ayala, Rojas, Bonilla, Espino, and Bracero)
- Department of Obstetrics and Gynecology, South Shore University Hospital, Bay Shore, NY (Solmonovich, Kouba, Rojas, Bonilla, Espino, and Bracero)
| | - Kevin Espino
- Northwell, New Hyde Park, NY (Solmonovich, Kouba, Quezada, Rodriguez-Ayala, Rojas, Bonilla, Espino, and Bracero)
- Department of Obstetrics and Gynecology, South Shore University Hospital, Bay Shore, NY (Solmonovich, Kouba, Rojas, Bonilla, Espino, and Bracero)
| | - Luis A. Bracero
- Northwell, New Hyde Park, NY (Solmonovich, Kouba, Quezada, Rodriguez-Ayala, Rojas, Bonilla, Espino, and Bracero)
- Department of Obstetrics and Gynecology, South Shore University Hospital, Bay Shore, NY (Solmonovich, Kouba, Rojas, Bonilla, Espino, and Bracero)
- Zucker School of Medicine at Hofstra/Northwell, Hempstead, NY (Solmonovich, Rodriguez-Ayala, Rojas, and Bracero)
| |
Collapse
|
12
|
Sallam M, Al-Salahat K, Eid H, Egger J, Puladi B. Human versus Artificial Intelligence: ChatGPT-4 Outperforming Bing, Bard, ChatGPT-3.5 and Humans in Clinical Chemistry Multiple-Choice Questions. ADVANCES IN MEDICAL EDUCATION AND PRACTICE 2024; 15:857-871. [PMID: 39319062 PMCID: PMC11421444 DOI: 10.2147/amep.s479801] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/26/2024] [Accepted: 09/15/2024] [Indexed: 09/26/2024]
Abstract
Introduction Artificial intelligence (AI) chatbots excel in language understanding and generation. These models can transform healthcare education and practice. However, it is important to assess the performance of such AI models in various topics to highlight its strengths and possible limitations. This study aimed to evaluate the performance of ChatGPT (GPT-3.5 and GPT-4), Bing, and Bard compared to human students at a postgraduate master's level in Medical Laboratory Sciences. Methods The study design was based on the METRICS checklist for the design and reporting of AI-based studies in healthcare. The study utilized a dataset of 60 Clinical Chemistry multiple-choice questions (MCQs) initially conceived for assessing 20 MSc students. The revised Bloom's taxonomy was used as the framework for classifying the MCQs into four cognitive categories: Remember, Understand, Analyze, and Apply. A modified version of the CLEAR tool was used for the assessment of the quality of AI-generated content, with Cohen's κ for inter-rater agreement. Results Compared to the mean students' score which was 0.68±0.23, GPT-4 scored 0.90 ± 0.30, followed by Bing (0.77 ± 0.43), GPT-3.5 (0.73 ± 0.45), and Bard (0.67 ± 0.48). Statistically significant better performance was noted in lower cognitive domains (Remember and Understand) in GPT-3.5 (P=0.041), GPT-4 (P=0.003), and Bard (P=0.017) compared to the higher cognitive domains (Apply and Analyze). The CLEAR scores indicated that ChatGPT-4 performance was "Excellent" compared to the "Above average" performance of ChatGPT-3.5, Bing, and Bard. Discussion The findings indicated that ChatGPT-4 excelled in the Clinical Chemistry exam, while ChatGPT-3.5, Bing, and Bard were above average. Given that the MCQs were directed to postgraduate students with a high degree of specialization, the performance of these AI chatbots was remarkable. Due to the risk of academic dishonesty and possible dependence on these AI models, the appropriateness of MCQs as an assessment tool in higher education should be re-evaluated.
Collapse
Affiliation(s)
- Malik Sallam
- Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, Jordan
- Department of Clinical Laboratories and Forensic Medicine, Jordan University Hospital, Amman, Jordan
- Scientific Approaches to Fight Epidemics of Infectious Diseases (SAFE-ID) Research Group, The University of Jordan, Amman, Jordan
| | - Khaled Al-Salahat
- Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, Jordan
- Scientific Approaches to Fight Epidemics of Infectious Diseases (SAFE-ID) Research Group, The University of Jordan, Amman, Jordan
| | - Huda Eid
- Scientific Approaches to Fight Epidemics of Infectious Diseases (SAFE-ID) Research Group, The University of Jordan, Amman, Jordan
| | - Jan Egger
- Institute for AI in Medicine (IKIM), University Medicine Essen (AöR), Essen, Germany
| | - Behrus Puladi
- Institute of Medical Informatics, University Hospital RWTH Aachen, Aachen, Germany
| |
Collapse
|
13
|
Jin HK, Lee HE, Kim E. Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: a systematic review and meta-analysis. BMC MEDICAL EDUCATION 2024; 24:1013. [PMID: 39285377 PMCID: PMC11406751 DOI: 10.1186/s12909-024-05944-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/29/2024] [Accepted: 08/22/2024] [Indexed: 09/19/2024]
Abstract
BACKGROUND ChatGPT, a recently developed artificial intelligence (AI) chatbot, has demonstrated improved performance in examinations in the medical field. However, thus far, an overall evaluation of the potential of ChatGPT models (ChatGPT-3.5 and GPT-4) in a variety of national health licensing examinations is lacking. This study aimed to provide a comprehensive assessment of the ChatGPT models' performance in national licensing examinations for medical, pharmacy, dentistry, and nursing research through a meta-analysis. METHODS Following the PRISMA protocol, full-text articles from MEDLINE/PubMed, EMBASE, ERIC, Cochrane Library, Web of Science, and key journals were reviewed from the time of ChatGPT's introduction to February 27, 2024. Studies were eligible if they evaluated the performance of a ChatGPT model (ChatGPT-3.5 or GPT-4); related to national licensing examinations in the fields of medicine, pharmacy, dentistry, or nursing; involved multiple-choice questions; and provided data that enabled the calculation of effect size. Two reviewers independently completed data extraction, coding, and quality assessment. The JBI Critical Appraisal Tools were used to assess the quality of the selected articles. Overall effect size and 95% confidence intervals [CIs] were calculated using a random-effects model. RESULTS A total of 23 studies were considered for this review, which evaluated the accuracy of four types of national licensing examinations. The selected articles were in the fields of medicine (n = 17), pharmacy (n = 3), nursing (n = 2), and dentistry (n = 1). They reported varying accuracy levels, ranging from 36 to 77% for ChatGPT-3.5 and 64.4-100% for GPT-4. The overall effect size for the percentage of accuracy was 70.1% (95% CI, 65-74.8%), which was statistically significant (p < 0.001). Subgroup analyses revealed that GPT-4 demonstrated significantly higher accuracy in providing correct responses than its earlier version, ChatGPT-3.5. Additionally, in the context of health licensing examinations, the ChatGPT models exhibited greater proficiency in the following order: pharmacy, medicine, dentistry, and nursing. However, the lack of a broader set of questions, including open-ended and scenario-based questions, and significant heterogeneity were limitations of this meta-analysis. CONCLUSIONS This study sheds light on the accuracy of ChatGPT models in four national health licensing examinations across various countries and provides a practical basis and theoretical support for future research. Further studies are needed to explore their utilization in medical and health education by including a broader and more diverse range of questions, along with more advanced versions of AI chatbots.
Collapse
Affiliation(s)
- Hye Kyung Jin
- Research Institute of Pharmaceutical Sciences, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, South Korea
- Data Science, Evidence-Based and Clinical Research Laboratory, Department of Health, Social, and Clinical Pharmacy, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, South Korea
| | - Ha Eun Lee
- Data Science, Evidence-Based and Clinical Research Laboratory, Department of Health, Social, and Clinical Pharmacy, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, South Korea
| | - EunYoung Kim
- Research Institute of Pharmaceutical Sciences, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, South Korea.
- Data Science, Evidence-Based and Clinical Research Laboratory, Department of Health, Social, and Clinical Pharmacy, College of Pharmacy, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, South Korea.
- Division of Licensing of Medicines and Regulatory Science, The Graduate School of Pharmaceutical Management, and Regulatory Science Policy, The Graduate School of Pharmaceutical Regulatory Sciences, Chung-Ang University, 84 Heukseok-ro, Dongjak-gu, Seoul, 06974, South Korea.
| |
Collapse
|
14
|
Gan W, Ouyang J, Li H, Xue Z, Zhang Y, Dong Q, Huang J, Zheng X, Zhang Y. Integrating ChatGPT in Orthopedic Education for Medical Undergraduates: Randomized Controlled Trial. J Med Internet Res 2024; 26:e57037. [PMID: 39163598 PMCID: PMC11372336 DOI: 10.2196/57037] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2024] [Revised: 06/10/2024] [Accepted: 06/27/2024] [Indexed: 08/22/2024] Open
Abstract
BACKGROUND ChatGPT is a natural language processing model developed by OpenAI, which can be iteratively updated and optimized to accommodate the changing and complex requirements of human verbal communication. OBJECTIVE The study aimed to evaluate ChatGPT's accuracy in answering orthopedics-related multiple-choice questions (MCQs) and assess its short-term effects as a learning aid through a randomized controlled trial. In addition, long-term effects on student performance in other subjects were measured using final examination results. METHODS We first evaluated ChatGPT's accuracy in answering MCQs pertaining to orthopedics across various question formats. Then, 129 undergraduate medical students participated in a randomized controlled study in which the ChatGPT group used ChatGPT as a learning tool, while the control group was prohibited from using artificial intelligence software to support learning. Following a 2-week intervention, the 2 groups' understanding of orthopedics was assessed by an orthopedics test, and variations in the 2 groups' performance in other disciplines were noted through a follow-up at the end of the semester. RESULTS ChatGPT-4.0 answered 1051 orthopedics-related MCQs with a 70.60% (742/1051) accuracy rate, including 71.8% (237/330) accuracy for A1 MCQs, 73.7% (330/448) accuracy for A2 MCQs, 70.2% (92/131) accuracy for A3/4 MCQs, and 58.5% (83/142) accuracy for case analysis MCQs. As of April 7, 2023, a total of 129 individuals participated in the experiment. However, 19 individuals withdrew from the experiment at various phases; thus, as of July 1, 2023, a total of 110 individuals accomplished the trial and completed all follow-up work. After we intervened in the learning style of the students in the short term, the ChatGPT group answered more questions correctly than the control group (ChatGPT group: mean 141.20, SD 26.68; control group: mean 130.80, SD 25.56; P=.04) in the orthopedics test, particularly on A1 (ChatGPT group: mean 46.57, SD 8.52; control group: mean 42.18, SD 9.43; P=.01), A2 (ChatGPT group: mean 60.59, SD 10.58; control group: mean 56.66, SD 9.91; P=.047), and A3/4 MCQs (ChatGPT group: mean 19.57, SD 5.48; control group: mean 16.46, SD 4.58; P=.002). At the end of the semester, we found that the ChatGPT group performed better on final examinations in surgery (ChatGPT group: mean 76.54, SD 9.79; control group: mean 72.54, SD 8.11; P=.02) and obstetrics and gynecology (ChatGPT group: mean 75.98, SD 8.94; control group: mean 72.54, SD 8.66; P=.04) than the control group. CONCLUSIONS ChatGPT answers orthopedics-related MCQs accurately, and students using it excel in both short-term and long-term assessments. Our findings strongly support ChatGPT's integration into medical education, enhancing contemporary instructional methods. TRIAL REGISTRATION Chinese Clinical Trial Registry Chictr2300071774; https://www.chictr.org.cn/hvshowproject.html ?id=225740&v=1.0.
Collapse
Affiliation(s)
- Wenyi Gan
- The First Clinical Medical College of Jinan University, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Jianfeng Ouyang
- Department of Joint Surgery and Sports Medicine, Zhuhai People's Hospital (Zhuhai Hospital Affiliated With Jinan University), Zhuhai, Guangdong, China
| | - Hua Li
- Department of Orthopaedics, Beijing Jishuitan Hospital, Beijing, China
| | - Zhaowen Xue
- The First Clinical Medical College of Jinan University, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Yiming Zhang
- The First Clinical Medical College of Jinan University, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Qiu Dong
- The First Clinical Medical College of Jinan University, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Jiadong Huang
- Jinan University-University of Birmingham Joint Institute, Jinan University, Guangzhou, China
| | - Xiaofei Zheng
- The First Clinical Medical College of Jinan University, The First Affiliated Hospital of Jinan University, Guangzhou, China
| | - Yiyi Zhang
- The First Clinical Medical College of Jinan University, The First Affiliated Hospital of Jinan University, Guangzhou, China
| |
Collapse
|
15
|
Mayo-Yáñez M, Lechien JR, Maria-Saibene A, Vaira LA, Maniaci A, Chiesa-Estomba CM. Examining the Performance of ChatGPT 3.5 and Microsoft Copilot in Otolaryngology: A Comparative Study with Otolaryngologists' Evaluation. Indian J Otolaryngol Head Neck Surg 2024; 76:3465-3469. [PMID: 39130248 PMCID: PMC11306834 DOI: 10.1007/s12070-024-04729-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Accepted: 04/20/2024] [Indexed: 08/13/2024] Open
Abstract
To evaluate the response capabilities, in a public healthcare system otolaryngology job competition examination, of ChatGPT 3.5 and an internet-connected GPT-4 engine (Microsoft Copilot) with the real scores of otolaryngology specialists as the control group. In September 2023, 135 questions divided into theoretical and practical parts were input into ChatGPT 3.5 and an internet-connected GPT-4. The accuracy of AI responses was compared with the official results from otolaryngologists who took the exam, and statistical analysis was conducted using Stata 14.2. Copilot (GPT-4) outperformed ChatGPT 3.5. Copilot achieved a score of 88.5 points, while ChatGPT scored 60 points. Both AIs had discrepancies in their incorrect answers. Despite ChatGPT's proficiency, Copilot displayed superior performance, ranking as the second-best score among the 108 otolaryngologists who took the exam, while ChatGPT was placed 83rd. A chat powered by GPT-4 with internet access (Copilot) demonstrates superior performance in responding to multiple-choice medical questions compared to ChatGPT 3.5.
Collapse
Affiliation(s)
- Miguel Mayo-Yáñez
- Young-Otolaryngologists of the International Federation of Oto-Rhino-Laryngological Societies (YO-IFOS) Study Group, 75000 Paris, France
- Otorhinolaryngology – Head and Neck Surgery Department, Complexo Hospitalario Universitario A Coruña (CHUAC), 15006 A Coruña, Galicia Spain
- Otorhinolaryngology—Head and Neck Surgery Department, Hospital San Rafael (HSR) de A Coruña, 15006 A Coruña, Spain
- Otorhinolaryngology Research Group, Institute of Biomedical Research of A Coruña, (INIBIC), Complexo Hospitalario Universitario de A Coruña (CHUAC), Universidade da Coruña (UDC), 15006 A Coruña, Spain
| | - Jerome R. Lechien
- Young-Otolaryngologists of the International Federation of Oto-Rhino-Laryngological Societies (YO-IFOS) Study Group, 75000 Paris, France
- Department of Otolaryngology, Polyclinique de Poitiers, Elsan Hospital, 86000 Poitiers, France
- Department of Otolaryngology—Head & Neck Surgery, Foch Hospital, School of Medicine, UFR Simone Veil, Université Versailles Saint-Quentin-en-Yvelines (Paris Saclay University), 91190 Paris, France
- Department of Human Anatomy and Experimental Oncology, UMONS Research Institute for Health Sciences and Technology, University of Mons (UMons), 7000 Mons, Belgium
- Department of Otolaryngology—Head & Neck Surgery, CHU Saint-Pierre (CHU de Bruxelles), 1000 Brussels, Belgium
| | - Alberto Maria-Saibene
- Young-Otolaryngologists of the International Federation of Oto-Rhino-Laryngological Societies (YO-IFOS) Study Group, 75000 Paris, France
- Otolaryngology Unit, Santi Paolo e Carlo Hospital, Department of Health Sciences, Università degli Studi di Milano, Milan, Italy
| | - Luigi A. Vaira
- Young-Otolaryngologists of the International Federation of Oto-Rhino-Laryngological Societies (YO-IFOS) Study Group, 75000 Paris, France
- Maxillofacial Surgery Operative Unit, Department of Medicine, Surgery and Pharmacy, University of Sassari, 07100 Sassari, Italy
| | - Antonino Maniaci
- Young-Otolaryngologists of the International Federation of Oto-Rhino-Laryngological Societies (YO-IFOS) Study Group, 75000 Paris, France
- Faculty of Medicine and Surgery, University of Enna “Kore”, 94100 Enna, Italy
| | - Carlos M. Chiesa-Estomba
- Young-Otolaryngologists of the International Federation of Oto-Rhino-Laryngological Societies (YO-IFOS) Study Group, 75000 Paris, France
- Otorhinolaryngology—Head and Neck Surgery Department, Hospital Universitario Donostia—Biodonostia Research Institute, 20014 Donostia, Spain
| |
Collapse
|
16
|
Ishida K, Hanada E. Potential of ChatGPT to Pass the Japanese Medical and Healthcare Professional National Licenses: A Literature Review. Cureus 2024; 16:e66324. [PMID: 39247019 PMCID: PMC11377128 DOI: 10.7759/cureus.66324] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/06/2024] [Indexed: 09/10/2024] Open
Abstract
This systematic review aimed to assess the academic potential of ChatGPT (GPT-3.5, 4, and 4V) for Japanese national medical and healthcare licensing examinations, taking into account its strengths and limitations. Electronic databases such as PubMed/Medline, Google Scholar, and ICHUSHI (a Japanese medical article database) were systematically searched for relevant articles, particularly those published between January 1, 2022, and April 30, 2024. A formal narrative analysis was conducted by systematically arranging similarities and differences between individual research findings together. After rigorous screening, we reviewed 22 articles. Except for one article, all articles that evaluated GPT-4 showed that this tool could pass each exam containing text only. However, some studies also reported that, despite the possibility to pass, the results of GPT-4 were worse than those of the actual examinee. Moreover, the newest model GPT-4V insufficiently recognized images, thereby providing insufficient answers to questions that involved images and figures/tables. Therefore, their precision needs to be improved to obtain better results.
Collapse
Affiliation(s)
- Kai Ishida
- Faculty of Engineering, Shonan Institute of Technology, Fujisawa, JPN
| | - Eisuke Hanada
- Faculty of Science and Engineering, Saga University, Saga, JPN
| |
Collapse
|
17
|
Tessler I, Wolfovitz A, Alon EE, Gecel NA, Livneh N, Zimlichman E, Klang E. ChatGPT's adherence to otolaryngology clinical practice guidelines. Eur Arch Otorhinolaryngol 2024; 281:3829-3834. [PMID: 38647684 DOI: 10.1007/s00405-024-08634-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Accepted: 03/22/2024] [Indexed: 04/25/2024]
Abstract
OBJECTIVES Large language models, including ChatGPT, has the potential to transform the way we approach medical knowledge, yet accuracy in clinical topics is critical. Here we assessed ChatGPT's performance in adhering to the American Academy of Otolaryngology-Head and Neck Surgery guidelines. METHODS We presented ChatGPT with 24 clinical otolaryngology questions based on the guidelines of the American Academy of Otolaryngology. This was done three times (N = 72) to test the model's consistency. Two otolaryngologists evaluated the responses for accuracy and relevance to the guidelines. Cohen's Kappa was used to measure evaluator agreement, and Cronbach's alpha assessed the consistency of ChatGPT's responses. RESULTS The study revealed mixed results; 59.7% (43/72) of ChatGPT's responses were highly accurate, while only 2.8% (2/72) directly contradicted the guidelines. The model showed 100% accuracy in Head and Neck, but lower accuracy in Rhinology and Otology/Neurotology (66%), Laryngology (50%), and Pediatrics (8%). The model's responses were consistent in 17/24 (70.8%), with a Cronbach's alpha value of 0.87, indicating a reasonable consistency across tests. CONCLUSIONS Using a guideline-based set of structured questions, ChatGPT demonstrates consistency but variable accuracy in otolaryngology. Its lower performance in some areas, especially Pediatrics, suggests that further rigorous evaluation is needed before considering real-world clinical use.
Collapse
Affiliation(s)
- Idit Tessler
- Department of Otolaryngology and Head and Neck Surgery, Sheba Medical Center, Ramat Gan, Israel.
- School of Medicine, Tel Aviv University, Tel Aviv, Israel.
- ARC Innovation Center, Sheba Medical Center, Ramat Gan, Israel.
| | - Amit Wolfovitz
- Department of Otolaryngology and Head and Neck Surgery, Sheba Medical Center, Ramat Gan, Israel
- School of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Eran E Alon
- Department of Otolaryngology and Head and Neck Surgery, Sheba Medical Center, Ramat Gan, Israel
- School of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Nir A Gecel
- School of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Nir Livneh
- Department of Otolaryngology and Head and Neck Surgery, Sheba Medical Center, Ramat Gan, Israel
- School of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Eyal Zimlichman
- School of Medicine, Tel Aviv University, Tel Aviv, Israel
- ARC Innovation Center, Sheba Medical Center, Ramat Gan, Israel
- The Sheba Talpiot Medical Leadership Program, Ramat Gan, Israel
- Hospital Management, Sheba Medical Center, Ramat Gan, Israel
| | - Eyal Klang
- The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, USA
| |
Collapse
|
18
|
Knoedler L, Knoedler S, Hoch CC, Prantl L, Frank K, Soiderer L, Cotofana S, Dorafshar AH, Schenck T, Vollbach F, Sofo G, Alfertshofer M. In-depth analysis of ChatGPT's performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions. Sci Rep 2024; 14:13553. [PMID: 38866891 PMCID: PMC11169536 DOI: 10.1038/s41598-024-63997-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Accepted: 06/04/2024] [Indexed: 06/14/2024] Open
Abstract
ChatGPT has garnered attention as a multifaceted AI chatbot with potential applications in medicine. Despite intriguing preliminary findings in areas such as clinical management and patient education, there remains a substantial knowledge gap in comprehensively understanding the chances and limitations of ChatGPT's capabilities, especially in medical test-taking and education. A total of n = 2,729 USMLE Step 1 practice questions were extracted from the Amboss question bank. After excluding 352 image-based questions, a total of 2,377 text-based questions were further categorized and entered manually into ChatGPT, and its responses were recorded. ChatGPT's overall performance was analyzed based on question difficulty, category, and content with regards to specific signal words and phrases. ChatGPT achieved an overall accuracy rate of 55.8% in a total number of n = 2,377 USMLE Step 1 preparation questions obtained from the Amboss online question bank. It demonstrated a significant inverse correlation between question difficulty and performance with rs = -0.306; p < 0.001, maintaining comparable accuracy to the human user peer group across different levels of question difficulty. Notably, ChatGPT outperformed in serology-related questions (61.1% vs. 53.8%; p = 0.005) but struggled with ECG-related content (42.9% vs. 55.6%; p = 0.021). ChatGPT achieved statistically significant worse performances in pathophysiology-related question stems. (Signal phrase = "what is the most likely/probable cause"). ChatGPT performed consistent across various question categories and difficulty levels. These findings emphasize the need for further investigations to explore the potential and limitations of ChatGPT in medical examination and education.
Collapse
Affiliation(s)
- Leonard Knoedler
- Department of Oral and Maxillofacial Surgery, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität Zu Berlin, and Berlin Institute of Health, Berlin, Germany
| | - Samuel Knoedler
- Department of Plastic Surgery and Hand Surgery, Klinikum Rechts Der Isar, Technical University of Munich, Munich, Germany
- Division of Plastic Surgery, Department of Surgery, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Cosima C Hoch
- Department of Otolaryngology, Head and Neck Surgery, School of Medicine, Technical University of Munich (TUM), Munich, Germany
| | - Lukas Prantl
- Department of Plastic, Hand and Reconstructive Surgery, University Hospital Regensburg, Regensburg, Germany
| | | | | | - Sebastian Cotofana
- Department of Dermatology, Erasmus Medical Centre, Rotterdam, The Netherlands
- Centre for Cutaneous Research, Blizard Institute, Queen Mary University of London, London, UK
- Department of Plastic and Reconstructive Surgery, Guangdong Second Provincial General Hospital, Guangzhou, Guangdong Province, China
| | - Amir H Dorafshar
- Department of Surgery, Emory University School of Medicine, Atlanta, GA, USA
| | | | - Felix Vollbach
- Department of Hand, Plastic and Aesthetic Surgery, Ludwig-Maximilians-University Munich, Munich, Germany
| | - Giuseppe Sofo
- Instituto Ivo Pitanguy, Hospital Santa Casa de Misericórdia, Pontifícia Universidade Católica Do Rio de Janeiro, Rio de Janeiro, Brazil
| | - Michael Alfertshofer
- Department of Plastic Surgery and Hand Surgery, Klinikum Rechts Der Isar, Technical University of Munich, Munich, Germany.
- Department of Oromaxillofacial Surgery, Ludwig-Maximilians-University Munich, Munich, Germany.
| |
Collapse
|
19
|
Kıyak YS, Coşkun Ö, Budakoğlu Iİ, Uluoğlu C. ChatGPT for generating multiple-choice questions: Evidence on the use of artificial intelligence in automatic item generation for a rational pharmacotherapy exam. Eur J Clin Pharmacol 2024; 80:729-735. [PMID: 38353690 DOI: 10.1007/s00228-024-03649-x] [Citation(s) in RCA: 12] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Accepted: 02/03/2024] [Indexed: 04/09/2024]
Abstract
PURPOSE Artificial intelligence, specifically large language models such as ChatGPT, offers valuable potential benefits in question (item) writing. This study aimed to determine the feasibility of generating case-based multiple-choice questions using ChatGPT in terms of item difficulty and discrimination levels. METHODS This study involved 99 fourth-year medical students who participated in a rational pharmacotherapy clerkship carried out based-on the WHO 6-Step Model. In response to a prompt that we provided, ChatGPT generated ten case-based multiple-choice questions on hypertension. Following an expert panel, two of these multiple-choice questions were incorporated into a medical school exam without making any changes in the questions. Based on the administration of the test, we evaluated their psychometric properties, including item difficulty, item discrimination (point-biserial correlation), and functionality of the options. RESULTS Both questions exhibited acceptable levels of point-biserial correlation, which is higher than the threshold of 0.30 (0.41 and 0.39). However, one question had three non-functional options (options chosen by fewer than 5% of the exam participants) while the other question had none. CONCLUSIONS The findings showed that the questions can effectively differentiate between students who perform at high and low levels, which also point out the potential of ChatGPT as an artificial intelligence tool in test development. Future studies may use the prompt to generate items in order for enhancing the external validity of the results by gathering data from diverse institutions and settings.
Collapse
Affiliation(s)
- Yavuz Selim Kıyak
- Department of Medical Education and Informatics, Faculty of Medicine, Gazi University, Ankara, Turkey.
- Gazi Üniversitesi Hastanesi E Blok 9, Kat 06500 Beşevler, Ankara, Turkey.
| | - Özlem Coşkun
- Department of Medical Education and Informatics, Faculty of Medicine, Gazi University, Ankara, Turkey
| | - Işıl İrem Budakoğlu
- Department of Medical Education and Informatics, Faculty of Medicine, Gazi University, Ankara, Turkey
| | - Canan Uluoğlu
- Department of Medical Pharmacology, Faculty of Medicine, Gazi University, Ankara, Turkey
| |
Collapse
|
20
|
Powers AY, McCandless MG, Taussky P, Vega RA, Shutran MS, Moses ZB. Educational Limitations of ChatGPT in Neurosurgery Board Preparation. Cureus 2024; 16:e58639. [PMID: 38770467 PMCID: PMC11104278 DOI: 10.7759/cureus.58639] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/20/2024] [Indexed: 05/22/2024] Open
Abstract
Objective This study evaluated the potential of Chat Generative Pre-trained Transformer (ChatGPT) as an educational tool for neurosurgery residents preparing for the American Board of Neurological Surgery (ABNS) primary examination. Methods Non-imaging questions from the Congress of Neurological Surgeons (CNS) Self-Assessment in Neurological Surgery (SANS) online question bank were input into ChatGPT. Accuracy was evaluated and compared to human performance across subcategories. To quantify ChatGPT's educational potential, the concordance and insight of explanations were assessed by multiple neurosurgical faculty. Associations among these metrics as well as question length were evaluated. Results ChatGPT had an accuracy of 50.4% (1,068/2,120), with the highest and lowest accuracies in the pharmacology (81.2%, 13/16) and vascular (32.9%, 91/277) subcategories, respectively. ChatGPT performed worse than humans overall, as well as in the functional, other, peripheral, radiology, spine, trauma, tumor, and vascular subcategories. There were no subjects in which ChatGPT performed better than humans and its accuracy was below that required to pass the exam. The mean concordance was 93.4% (198/212) and the mean insight score was 2.7. Accuracy was negatively associated with question length (R2=0.29, p=0.03) but positively associated with both concordance (p<0.001, q<0.001) and insight (p<0.001, q<0.001). Conclusions The current study provides the largest and most comprehensive assessment of the accuracy and explanatory quality of ChatGPT in answering ABNS primary exam questions. The findings demonstrate shortcomings regarding ChatGPT's ability to pass, let alone teach, the neurosurgical boards.
Collapse
Affiliation(s)
- Andrew Y Powers
- Neurosurgery, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, USA
| | | | - Philipp Taussky
- Neurosurgery, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, USA
| | - Rafael A Vega
- Neurosurgery, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, USA
| | - Max S Shutran
- Neurosurgery, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, USA
| | - Ziev B Moses
- Neurosurgery, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, USA
| |
Collapse
|
21
|
Gordon M, Daniel M, Ajiboye A, Uraiby H, Xu NY, Bartlett R, Hanson J, Haas M, Spadafore M, Grafton-Clarke C, Gasiea RY, Michie C, Corral J, Kwan B, Dolmans D, Thammasitboon S. A scoping review of artificial intelligence in medical education: BEME Guide No. 84. MEDICAL TEACHER 2024; 46:446-470. [PMID: 38423127 DOI: 10.1080/0142159x.2024.2314198] [Citation(s) in RCA: 76] [Impact Index Per Article: 76.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Accepted: 01/31/2024] [Indexed: 03/02/2024]
Abstract
BACKGROUND Artificial Intelligence (AI) is rapidly transforming healthcare, and there is a critical need for a nuanced understanding of how AI is reshaping teaching, learning, and educational practice in medical education. This review aimed to map the literature regarding AI applications in medical education, core areas of findings, potential candidates for formal systematic review and gaps for future research. METHODS This rapid scoping review, conducted over 16 weeks, employed Arksey and O'Malley's framework and adhered to STORIES and BEME guidelines. A systematic and comprehensive search across PubMed/MEDLINE, EMBASE, and MedEdPublish was conducted without date or language restrictions. Publications included in the review spanned undergraduate, graduate, and continuing medical education, encompassing both original studies and perspective pieces. Data were charted by multiple author pairs and synthesized into various thematic maps and charts, ensuring a broad and detailed representation of the current landscape. RESULTS The review synthesized 278 publications, with a majority (68%) from North American and European regions. The studies covered diverse AI applications in medical education, such as AI for admissions, teaching, assessment, and clinical reasoning. The review highlighted AI's varied roles, from augmenting traditional educational methods to introducing innovative practices, and underscores the urgent need for ethical guidelines in AI's application in medical education. CONCLUSION The current literature has been charted. The findings underscore the need for ongoing research to explore uncharted areas and address potential risks associated with AI use in medical education. This work serves as a foundational resource for educators, policymakers, and researchers in navigating AI's evolving role in medical education. A framework to support future high utility reporting is proposed, the FACETS framework.
Collapse
Affiliation(s)
- Morris Gordon
- School of Medicine and Dentistry, University of Central Lancashire, Preston, UK
- Blackpool Hospitals NHS Foundation Trust, Blackpool, UK
| | - Michelle Daniel
- School of Medicine, University of California, San Diego, SanDiego, CA, USA
| | - Aderonke Ajiboye
- School of Medicine and Dentistry, University of Central Lancashire, Preston, UK
| | - Hussein Uraiby
- Department of Cellular Pathology, University Hospitals of Leicester NHS Trust, Leicester, UK
| | - Nicole Y Xu
- School of Medicine, University of California, San Diego, SanDiego, CA, USA
| | - Rangana Bartlett
- Department of Cognitive Science, University of California, San Diego, CA, USA
| | - Janice Hanson
- Department of Medicine and Office of Education, School of Medicine, Washington University in Saint Louis, Saint Louis, MO, USA
| | - Mary Haas
- Department of Emergency Medicine, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Maxwell Spadafore
- Department of Emergency Medicine, University of Michigan Medical School, Ann Arbor, MI, USA
| | | | | | - Colin Michie
- School of Medicine and Dentistry, University of Central Lancashire, Preston, UK
| | - Janet Corral
- Department of Medicine, University of Nevada Reno, School of Medicine, Reno, NV, USA
| | - Brian Kwan
- School of Medicine, University of California, San Diego, SanDiego, CA, USA
| | - Diana Dolmans
- School of Health Professions Education, Faculty of Health, Maastricht University, Maastricht, NL, USA
| | - Satid Thammasitboon
- Center for Research, Innovation and Scholarship in Health Professions Education, Baylor College of Medicine, Houston, TX, USA
| |
Collapse
|
22
|
Funk PF, Hoch CC, Knoedler S, Knoedler L, Cotofana S, Sofo G, Bashiri Dezfouli A, Wollenberg B, Guntinas-Lichius O, Alfertshofer M. ChatGPT's Response Consistency: A Study on Repeated Queries of Medical Examination Questions. Eur J Investig Health Psychol Educ 2024; 14:657-668. [PMID: 38534904 PMCID: PMC10969490 DOI: 10.3390/ejihpe14030043] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2024] [Revised: 03/05/2024] [Accepted: 03/07/2024] [Indexed: 05/31/2025] Open
Abstract
(1) Background: As the field of artificial intelligence (AI) evolves, tools like ChatGPT are increasingly integrated into various domains of medicine, including medical education and research. Given the critical nature of medicine, it is of paramount importance that AI tools offer a high degree of reliability in the information they provide. (2) Methods: A total of n = 450 medical examination questions were manually entered into ChatGPT thrice, each for ChatGPT 3.5 and ChatGPT 4. The responses were collected, and their accuracy and consistency were statistically analyzed throughout the series of entries. (3) Results: ChatGPT 4 displayed a statistically significantly improved accuracy with 85.7% compared to that of 57.7% of ChatGPT 3.5 (p < 0.001). Furthermore, ChatGPT 4 was more consistent, correctly answering 77.8% across all rounds, a significant increase from the 44.9% observed from ChatGPT 3.5 (p < 0.001). (4) Conclusions: The findings underscore the increased accuracy and dependability of ChatGPT 4 in the context of medical education and potential clinical decision making. Nonetheless, the research emphasizes the indispensable nature of human-delivered healthcare and the vital role of continuous assessment in leveraging AI in medicine.
Collapse
Affiliation(s)
- Paul F. Funk
- Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Jena, Friedrich Schiller University Jena, Am Klinikum 1, 07747 Jena, Germany;
| | - Cosima C. Hoch
- Department of Otolaryngology, Head and Neck Surgery, School of Medicine and Health, Technical University of Munich (TUM), Ismaningerstrasse 22, 81675 Munich, Germany (A.B.D.); (B.W.)
| | - Samuel Knoedler
- Department of Plastic Surgery and Hand Surgery, Klinikum Rechts der Isar, Technical University of Munich (TUM), Ismaningerstrasse 22, 81675 Munich, Germany
| | - Leonard Knoedler
- Division of Plastic and Reconstructive Surgery, Massachusetts General Hospital, Harvard Medical School, 55 Fruit Street, Boston, MA 02114, USA
| | - Sebastian Cotofana
- Department of Dermatology, Erasmus Medical Centre, Dr. Molewaterplein 40, 3015 GD Rotterdam, The Netherlands
- Centre for Cutaneous Research, Blizard Institute, Queen Mary University of London, Mile End Road, London E1 4NS, UK
- Department of Plastic and Reconstructive Surgery, Guangdong Second Provincial General Hospital, Guangzhou 510317, China
| | - Giuseppe Sofo
- Instituto Ivo Pitanguy, Hospital Santa Casa de Misericórdia Rio de Janeiro, Pontifícia Universidade Católica do Rio de Janeiro, Rio de Janeiro 20020-022, Brazil;
| | - Ali Bashiri Dezfouli
- Department of Otolaryngology, Head and Neck Surgery, School of Medicine and Health, Technical University of Munich (TUM), Ismaningerstrasse 22, 81675 Munich, Germany (A.B.D.); (B.W.)
| | - Barbara Wollenberg
- Department of Otolaryngology, Head and Neck Surgery, School of Medicine and Health, Technical University of Munich (TUM), Ismaningerstrasse 22, 81675 Munich, Germany (A.B.D.); (B.W.)
| | - Orlando Guntinas-Lichius
- Department of Otorhinolaryngology, Head and Neck Surgery, University Hospital Jena, Friedrich Schiller University Jena, Am Klinikum 1, 07747 Jena, Germany;
| | - Michael Alfertshofer
- Department of Plastic Surgery and Hand Surgery, Klinikum Rechts der Isar, Technical University of Munich (TUM), Ismaningerstrasse 22, 81675 Munich, Germany
- Department of Oromaxillofacial Surgery, Ludwig-Maximilians University Munich, Lindwurmstraße 2A, 80337 Munich, Germany
| |
Collapse
|
23
|
Sallam M, Barakat M, Sallam M. A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence-Based Models in Health Care Education and Practice: Development Study Involving a Literature Review. Interact J Med Res 2024; 13:e54704. [PMID: 38276872 PMCID: PMC10905357 DOI: 10.2196/54704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2023] [Revised: 12/18/2023] [Accepted: 01/26/2024] [Indexed: 01/27/2024] Open
Abstract
BACKGROUND Adherence to evidence-based practice is indispensable in health care. Recently, the utility of generative artificial intelligence (AI) models in health care has been evaluated extensively. However, the lack of consensus guidelines on the design and reporting of findings of these studies poses a challenge for the interpretation and synthesis of evidence. OBJECTIVE This study aimed to develop a preliminary checklist to standardize the reporting of generative AI-based studies in health care education and practice. METHODS A literature review was conducted in Scopus, PubMed, and Google Scholar. Published records with "ChatGPT," "Bing," or "Bard" in the title were retrieved. Careful examination of the methodologies employed in the included records was conducted to identify the common pertinent themes and the possible gaps in reporting. A panel discussion was held to establish a unified and thorough checklist for the reporting of AI studies in health care. The finalized checklist was used to evaluate the included records by 2 independent raters. Cohen κ was used as the method to evaluate the interrater reliability. RESULTS The final data set that formed the basis for pertinent theme identification and analysis comprised a total of 34 records. The finalized checklist included 9 pertinent themes collectively referred to as METRICS (Model, Evaluation, Timing, Range/Randomization, Individual factors, Count, and Specificity of prompts and language). Their details are as follows: (1) Model used and its exact settings; (2) Evaluation approach for the generated content; (3) Timing of testing the model; (4) Transparency of the data source; (5) Range of tested topics; (6) Randomization of selecting the queries; (7) Individual factors in selecting the queries and interrater reliability; (8) Count of queries executed to test the model; and (9) Specificity of the prompts and language used. The overall mean METRICS score was 3.0 (SD 0.58). The tested METRICS score was acceptable, with the range of Cohen κ of 0.558 to 0.962 (P<.001 for the 9 tested items). With classification per item, the highest average METRICS score was recorded for the "Model" item, followed by the "Specificity" item, while the lowest scores were recorded for the "Randomization" item (classified as suboptimal) and "Individual factors" item (classified as satisfactory). CONCLUSIONS The METRICS checklist can facilitate the design of studies guiding researchers toward best practices in reporting results. The findings highlight the need for standardized reporting algorithms for generative AI-based studies in health care, considering the variability observed in methodologies and reporting. The proposed METRICS checklist could be a preliminary helpful base to establish a universally accepted approach to standardize the design and reporting of generative AI-based studies in health care, which is a swiftly evolving research topic.
Collapse
Affiliation(s)
- Malik Sallam
- Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, Jordan
- Department of Clinical Laboratories and Forensic Medicine, Jordan University Hospital, Amman, Jordan
- Department of Translational Medicine, Faculty of Medicine, Lund University, Malmo, Sweden
| | - Muna Barakat
- Department of Clinical Pharmacy and Therapeutics, Faculty of Pharmacy, Applied Science Private University, Amman, Jordan
| | - Mohammed Sallam
- Department of Pharmacy, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai, United Arab Emirates
| |
Collapse
|
24
|
Boyd CJ, Hemal K, Sorenson TJ, Patel PA, Bekisz JM, Choi M, Karp NS. Artificial Intelligence as a Triage Tool during the Perioperative Period: Pilot Study of Accuracy and Accessibility for Clinical Application. PLASTIC AND RECONSTRUCTIVE SURGERY-GLOBAL OPEN 2024; 12:e5580. [PMID: 38313585 PMCID: PMC10836902 DOI: 10.1097/gox.0000000000005580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2023] [Accepted: 12/05/2023] [Indexed: 02/06/2024]
Abstract
Background Given the dialogistic properties of ChatGPT, we hypothesized that this artificial intelligence (AI) function can be used as a self-service tool where clinical questions can be directly answered by AI. Our objective was to assess the content, accuracy, and accessibility of AI-generated content regarding common perioperative questions for reduction mammaplasty. Methods ChatGPT (OpenAI, February Version, San Francisco, Calif.) was used to query 20 common patient concerns that arise in the perioperative period of a reduction mammaplasty. Searches were performed in duplicate for both a general term and a specific clinical question. Query outputs were analyzed both objectively and subjectively. Descriptive statistics, t tests, and chi-square tests were performed where appropriate with a predetermined level of significance of P less than 0.05. Results From a total of 40 AI-generated outputs, mean word length was 191.8 words. Readability was at the thirteenth grade level. Regarding content, of all query outputs, 97.5% were on the appropriate topic. Medical advice was deemed to be reasonable in 100% of cases. General queries more frequently reported overarching background information, whereas specific queries more frequently reported prescriptive information (P < 0.0001). AI outputs specifically recommended following surgeon provided postoperative instructions in 82.5% of instances. Conclusions Currently available AI tools, in their nascent form, can provide recommendations for common perioperative questions and concerns for reduction mammaplasty. With further calibration, AI interfaces may serve as a tool for fielding patient queries in the future; however, patients must always retain the ability to bypass technology and be able to contact their surgeon.
Collapse
Affiliation(s)
- Carter J Boyd
- From the Hansjörg Wyss Department of Plastic Surgery, NYU Langone, New York, N.Y
| | - Kshipra Hemal
- From the Hansjörg Wyss Department of Plastic Surgery, NYU Langone, New York, N.Y
| | - Thomas J Sorenson
- From the Hansjörg Wyss Department of Plastic Surgery, NYU Langone, New York, N.Y
| | | | - Jonathan M Bekisz
- From the Hansjörg Wyss Department of Plastic Surgery, NYU Langone, New York, N.Y
| | - Mihye Choi
- From the Hansjörg Wyss Department of Plastic Surgery, NYU Langone, New York, N.Y
| | - Nolan S Karp
- From the Hansjörg Wyss Department of Plastic Surgery, NYU Langone, New York, N.Y
| |
Collapse
|
25
|
Abdaljaleel M, Barakat M, Alsanafi M, Salim NA, Abazid H, Malaeb D, Mohammed AH, Hassan BAR, Wayyes AM, Farhan SS, Khatib SE, Rahal M, Sahban A, Abdelaziz DH, Mansour NO, AlZayer R, Khalil R, Fekih-Romdhane F, Hallit R, Hallit S, Sallam M. A multinational study on the factors influencing university students' attitudes and usage of ChatGPT. Sci Rep 2024; 14:1983. [PMID: 38263214 PMCID: PMC10806219 DOI: 10.1038/s41598-024-52549-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Accepted: 01/19/2024] [Indexed: 01/25/2024] Open
Abstract
Artificial intelligence models, like ChatGPT, have the potential to revolutionize higher education when implemented properly. This study aimed to investigate the factors influencing university students' attitudes and usage of ChatGPT in Arab countries. The survey instrument "TAME-ChatGPT" was administered to 2240 participants from Iraq, Kuwait, Egypt, Lebanon, and Jordan. Of those, 46.8% heard of ChatGPT, and 52.6% used it before the study. The results indicated that a positive attitude and usage of ChatGPT were determined by factors like ease of use, positive attitude towards technology, social influence, perceived usefulness, behavioral/cognitive influences, low perceived risks, and low anxiety. Confirmatory factor analysis indicated the adequacy of the "TAME-ChatGPT" constructs. Multivariate analysis demonstrated that the attitude towards ChatGPT usage was significantly influenced by country of residence, age, university type, and recent academic performance. This study validated "TAME-ChatGPT" as a useful tool for assessing ChatGPT adoption among university students. The successful integration of ChatGPT in higher education relies on the perceived ease of use, perceived usefulness, positive attitude towards technology, social influence, behavioral/cognitive elements, low anxiety, and minimal perceived risks. Policies for ChatGPT adoption in higher education should be tailored to individual contexts, considering the variations in student attitudes observed in this study.
Collapse
Affiliation(s)
- Maram Abdaljaleel
- Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, 11942, Jordan
- Department of Clinical Laboratories and Forensic Medicine, Jordan University Hospital, Amman, 11942, Jordan
| | - Muna Barakat
- Department of Clinical Pharmacy and Therapeutics, Faculty of Pharmacy, Applied Science Private University, Amman, 11931, Jordan
| | - Mariam Alsanafi
- Department of Pharmacy Practice, Faculty of Pharmacy, Kuwait University, Kuwait City, Kuwait
- Department of Pharmaceutical Sciences, Public Authority for Applied Education and Training, College of Health Sciences, Safat, Kuwait
| | - Nesreen A Salim
- Prosthodontic Department, School of Dentistry, The University of Jordan, Amman, 11942, Jordan
- Prosthodontic Department, Jordan University Hospital, Amman, 11942, Jordan
| | - Husam Abazid
- Department of Clinical Pharmacy and Therapeutics, Faculty of Pharmacy, Applied Science Private University, Amman, 11931, Jordan
| | - Diana Malaeb
- College of Pharmacy, Gulf Medical University, P.O. Box 4184, Ajman, United Arab Emirates
| | - Ali Haider Mohammed
- School of Pharmacy, Monash University Malaysia, Jalan Lagoon Selatan, 47500, Bandar Sunway, Selangor Darul Ehsan, Malaysia
| | | | | | - Sinan Subhi Farhan
- Department of Anesthesia, Al Rafidain University College, Baghdad, 10001, Iraq
| | - Sami El Khatib
- Department of Biomedical Sciences, School of Arts and Sciences, Lebanese International University, Bekaa, Lebanon
- Center for Applied Mathematics and Bioinformatics (CAMB), Gulf University for Science and Technology (GUST), 32093, Hawally, Kuwait
| | - Mohamad Rahal
- School of Pharmacy, Lebanese International University, Beirut, 961, Lebanon
| | - Ali Sahban
- School of Dentistry, The University of Jordan, Amman, 11942, Jordan
| | - Doaa H Abdelaziz
- Pharmacy Practice and Clinical Pharmacy Department, Faculty of Pharmacy, Future University in Egypt, Cairo, 11835, Egypt
- Department of Clinical Pharmacy, Faculty of Pharmacy, Al-Baha University, Al-Baha, Saudi Arabia
| | - Noha O Mansour
- Clinical Pharmacy and Pharmacy Practice Department, Faculty of Pharmacy, Mansoura University, Mansoura, 35516, Egypt
- Clinical Pharmacy and Pharmacy Practice Department, Faculty of Pharmacy, Mansoura National University, Dakahlia Governorate, 7723730, Egypt
| | - Reem AlZayer
- Clinical Pharmacy Practice, Department of Pharmacy, Mohammed Al-Mana College for Medical Sciences, 34222, Dammam, Saudi Arabia
| | - Roaa Khalil
- Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, 11942, Jordan
| | - Feten Fekih-Romdhane
- The Tunisian Center of Early Intervention in Psychosis, Department of Psychiatry "Ibn Omrane", Razi Hospital, 2010, Manouba, Tunisia
- Faculty of Medicine of Tunis, Tunis El Manar University, Tunis, Tunisia
| | - Rabih Hallit
- School of Medicine and Medical Sciences, Holy Spirit University of Kaslik, Jounieh, Lebanon
- Department of Infectious Disease, Bellevue Medical Center, Mansourieh, Lebanon
- Department of Infectious Disease, Notre Dame des Secours, University Hospital Center, Byblos, Lebanon
| | - Souheil Hallit
- School of Medicine and Medical Sciences, Holy Spirit University of Kaslik, Jounieh, Lebanon
- Research Department, Psychiatric Hospital of the Cross, Jal Eddib, Lebanon
| | - Malik Sallam
- Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, 11942, Jordan.
- Department of Clinical Laboratories and Forensic Medicine, Jordan University Hospital, Amman, 11942, Jordan.
| |
Collapse
|
26
|
Odabashian R, Bastin D, Jones G, Manzoor M, Tangestaniapour S, Assad M, Lakhani S, Odabashian M, McGee S. Assessment of ChatGPT-3.5's Knowledge in Oncology: Comparative Study with ASCO-SEP Benchmarks. JMIR AI 2024; 3:e50442. [PMID: 38875575 PMCID: PMC11041475 DOI: 10.2196/50442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/02/2023] [Revised: 10/05/2023] [Accepted: 11/19/2023] [Indexed: 06/16/2024]
Abstract
BACKGROUND ChatGPT (Open AI) is a state-of-the-art large language model that uses artificial intelligence (AI) to address questions across diverse topics. The American Society of Clinical Oncology Self-Evaluation Program (ASCO-SEP) created a comprehensive educational program to help physicians keep up to date with the many rapid advances in the field. The question bank consists of multiple choice questions addressing the many facets of cancer care, including diagnosis, treatment, and supportive care. As ChatGPT applications rapidly expand, it becomes vital to ascertain if the knowledge of ChatGPT-3.5 matches the established standards that oncologists are recommended to follow. OBJECTIVE This study aims to evaluate whether ChatGPT-3.5's knowledge aligns with the established benchmarks that oncologists are expected to adhere to. This will furnish us with a deeper understanding of the potential applications of this tool as a support for clinical decision-making. METHODS We conducted a systematic assessment of the performance of ChatGPT-3.5 on the ASCO-SEP, the leading educational and assessment tool for medical oncologists in training and practice. Over 1000 multiple choice questions covering the spectrum of cancer care were extracted. Questions were categorized by cancer type or discipline, with subcategorization as treatment, diagnosis, or other. Answers were scored as correct if ChatGPT-3.5 selected the answer as defined by ASCO-SEP. RESULTS Overall, ChatGPT-3.5 achieved a score of 56.1% (583/1040) for the correct answers provided. The program demonstrated varying levels of accuracy across cancer types or disciplines. The highest accuracy was observed in questions related to developmental therapeutics (8/10; 80% correct), while the lowest accuracy was observed in questions related to gastrointestinal cancer (102/209; 48.8% correct). There was no significant difference in the program's performance across the predefined subcategories of diagnosis, treatment, and other (P=.16, which is greater than .05). CONCLUSIONS This study evaluated ChatGPT-3.5's oncology knowledge using the ASCO-SEP, aiming to address uncertainties regarding AI tools like ChatGPT in clinical decision-making. Our findings suggest that while ChatGPT-3.5 offers a hopeful outlook for AI in oncology, its present performance in ASCO-SEP tests necessitates further refinement to reach the requisite competency levels. Future assessments could explore ChatGPT's clinical decision support capabilities with real-world clinical scenarios, its ease of integration into medical workflows, and its potential to foster interdisciplinary collaboration and patient engagement in health care settings.
Collapse
Affiliation(s)
- Roupen Odabashian
- Department of Oncology, Barbara Ann Karmanos Cancer Institute, Wayne State University, Detroit, MI, United States
| | - Donald Bastin
- Department of Medicine, Division of Internal Medicine, The Ottawa Hospital and the University of Ottawa, Ottawa, ON, Canada
| | - Georden Jones
- Mary A Rackham Institute, University of Michigan, Ann Arbor, MI, United States
| | | | | | - Malke Assad
- Department of Plastic Surgery, University of Pittsburgh Medical Center, Pittsburgh, PA, United States
| | - Sunita Lakhani
- Department of Medicine, Division of Internal Medicine, Jefferson Abington Hospital, Philadelphia, PA, United States
| | - Maritsa Odabashian
- Mary A Rackham Institute, University of Michigan, Ann Arbor, MI, United States
- The Ottawa Hospital Research Institute, Ottawa, ON, Canada
| | - Sharon McGee
- Department of Medicine, Division of Medical Oncology, The Ottawa Hospital and the University of Ottawa, Ottawa, ON, Canada
- Cancer Therapeutics Program, Ottawa Hospital Research Institute, Ottawa, ON, Canada
| |
Collapse
|
27
|
Kollitsch L, Eredics K, Marszalek M, Rauchenwald M, Brookman-May SD, Burger M, Körner-Riffard K, May M. How does artificial intelligence master urological board examinations? A comparative analysis of different Large Language Models' accuracy and reliability in the 2022 In-Service Assessment of the European Board of Urology. World J Urol 2024; 42:20. [PMID: 38197996 DOI: 10.1007/s00345-023-04749-6] [Citation(s) in RCA: 14] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2023] [Accepted: 11/02/2023] [Indexed: 01/11/2024] Open
Abstract
PURPOSE This study is a comparative analysis of three Large Language Models (LLMs) evaluating their rate of correct answers (RoCA) and the reliability of generated answers on a set of urological knowledge-based questions spanning different levels of complexity. METHODS ChatGPT-3.5, ChatGPT-4, and Bing AI underwent two testing rounds, with a 48-h gap in between, using the 100 multiple-choice questions from the 2022 European Board of Urology (EBU) In-Service Assessment (ISA). For conflicting responses, an additional consensus round was conducted to establish conclusive answers. RoCA was compared across various question complexities. Ten weeks after the consensus round, a subsequent testing round was conducted to assess potential knowledge gain and improvement in RoCA, respectively. RESULTS Over three testing rounds, ChatGPT-3.5 achieved RoCa scores of 58%, 62%, and 59%. In contrast, ChatGPT-4 achieved RoCA scores of 63%, 77%, and 77%, while Bing AI yielded scores of 81%, 73%, and 77%, respectively. Agreement rates between rounds 1 and 2 were 84% (κ = 0.67, p < 0.001) for ChatGPT-3.5, 74% (κ = 0.40, p < 0.001) for ChatGPT-4, and 76% (κ = 0.33, p < 0.001) for BING AI. In the consensus round, ChatGPT-4 and Bing AI significantly outperformed ChatGPT-3.5 (77% and 77% vs. 59%, both p = 0.010). All LLMs demonstrated decreasing RoCA scores with increasing question complexity (p < 0.001). In the fourth round, no significant improvement in RoCA was observed across all three LLMs. CONCLUSIONS The performance of the tested LLMs in addressing urological specialist inquiries warrants further refinement. Moreover, the deficiency in response reliability contributes to existing challenges related to their current utility for educational purposes.
Collapse
Affiliation(s)
- Lisa Kollitsch
- Department of Urology and Andrology, Klinik Donaustadt, Vienna, Austria
| | - Klaus Eredics
- Department of Urology and Andrology, Klinik Donaustadt, Vienna, Austria
- Department of Urology, Paracelsus Medical University, Salzburg, Austria
| | - Martin Marszalek
- Department of Urology and Andrology, Klinik Donaustadt, Vienna, Austria
| | - Michael Rauchenwald
- Department of Urology and Andrology, Klinik Donaustadt, Vienna, Austria
- European Board of Urology, Arnhem, The Netherlands
| | - Sabine D Brookman-May
- Department of Urology, University of Munich, LMU, Munich, Germany
- Johnson and Johnson Innovative Medicine, Research and Development, Spring House, PA, USA
| | - Maximilian Burger
- Department of Urology, Caritas St. Josef Medical Centre, University of Regensburg, Regensburg, Germany
| | - Katharina Körner-Riffard
- Department of Urology, Caritas St. Josef Medical Centre, University of Regensburg, Regensburg, Germany
| | - Matthias May
- Department of Urology, St. Elisabeth Hospital Straubing, Brothers of Mercy Hospital, Straubing, Germany.
| |
Collapse
|
28
|
Sallam M, Al-Salahat K. Below average ChatGPT performance in medical microbiology exam compared to university students. FRONTIERS IN EDUCATION 2023; 8. [DOI: 10.3389/feduc.2023.1333415] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/01/2024]
Abstract
BackgroundThe transformative potential of artificial intelligence (AI) in higher education is evident, with conversational models like ChatGPT poised to reshape teaching and assessment methods. The rapid evolution of AI models requires a continuous evaluation. AI-based models can offer personalized learning experiences but raises accuracy concerns. MCQs are widely used for competency assessment. The aim of this study was to evaluate ChatGPT performance in medical microbiology MCQs compared to the students’ performance.MethodsThe study employed an 80-MCQ dataset from a 2021 medical microbiology exam at the University of Jordan Doctor of Dental Surgery (DDS) Medical Microbiology 2 course. The exam contained 40 midterm and 40 final MCQs, authored by a single instructor without copyright issues. The MCQs were categorized based on the revised Bloom’s Taxonomy into four categories: Remember, Understand, Analyze, or Evaluate. Metrics, including facility index and discriminative efficiency, were derived from 153 midterm and 154 final exam DDS student performances. ChatGPT 3.5 was used to answer questions, and responses were assessed for correctness and clarity by two independent raters.ResultsChatGPT 3.5 correctly answered 64 out of 80 medical microbiology MCQs (80%) but scored below the student average (80.5/100 vs. 86.21/100). Incorrect ChatGPT responses were more common in MCQs with longer choices (p = 0.025). ChatGPT 3.5 performance varied across cognitive domains: Remember (88.5% correct), Understand (82.4% correct), Analyze (75% correct), Evaluate (72% correct), with no statistically significant differences (p = 0.492). Correct ChatGPT responses received statistically significant higher average clarity and correctness scores compared to incorrect responses.ConclusionThe study findings emphasized the need for ongoing refinement and evaluation of ChatGPT performance. ChatGPT 3.5 showed the potential to correctly and clearly answer medical microbiology MCQs; nevertheless, its performance was below-bar compared to the students. Variability in ChatGPT performance in different cognitive domains should be considered in future studies. The study insights could contribute to the ongoing evaluation of the AI-based models’ role in educational assessment and to augment the traditional methods in higher education.
Collapse
|
29
|
Talyshinskii A, Naik N, Hameed BMZ, Zhanbyrbekuly U, Khairli G, Guliev B, Juilebø-Jones P, Tzelves L, Somani BK. Expanding horizons and navigating challenges for enhanced clinical workflows: ChatGPT in urology. Front Surg 2023; 10:1257191. [PMID: 37744723 PMCID: PMC10512827 DOI: 10.3389/fsurg.2023.1257191] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Accepted: 08/28/2023] [Indexed: 09/26/2023] Open
Abstract
Purpose of review ChatGPT has emerged as a potential tool for facilitating doctors' workflows. However, when it comes to applying these findings within a urological context, there have not been many studies. Thus, our objective was rooted in analyzing the pros and cons of ChatGPT use and how it can be exploited and used by urologists. Recent findings ChatGPT can facilitate clinical documentation and note-taking, patient communication and support, medical education, and research. In urology, it was proven that ChatGPT has the potential as a virtual healthcare aide for benign prostatic hyperplasia, an educational and prevention tool on prostate cancer, educational support for urological residents, and as an assistant in writing urological papers and academic work. However, several concerns about its exploitation are presented, such as lack of web crawling, risk of accidental plagiarism, and concerns about patients-data privacy. Summary The existing limitations mediate the need for further improvement of ChatGPT, such as ensuring the privacy of patient data and expanding the learning dataset to include medical databases, and developing guidance on its appropriate use. Urologists can also help by conducting studies to determine the effectiveness of ChatGPT in urology in clinical scenarios and nosologies other than those previously listed.
Collapse
Affiliation(s)
- Ali Talyshinskii
- Department of Urology, Astana Medical University, Astana, Kazakhstan
| | - Nithesh Naik
- Department of Mechanical and Industrial Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, India
| | | | | | - Gafur Khairli
- Department of Urology, Astana Medical University, Astana, Kazakhstan
| | - Bakhman Guliev
- Department of Urology, Mariinsky Hospital, St Petersburg, Russia
| | | | - Lazaros Tzelves
- Department of Urology, National and Kapodistrian University of Athens, Sismanogleion Hospital, Athens, Marousi, Greece
| | - Bhaskar Kumar Somani
- Department of Urology, University Hospital Southampton NHS Trust, Southampton, United Kingdom
| |
Collapse
|