1
|
Mavrych V, Yaqinuddin A, Bolgova O. Claude, ChatGPT, Copilot, and Gemini performance versus students in different topics of neuroscience. ADVANCES IN PHYSIOLOGY EDUCATION 2025; 49:430-437. [PMID: 39824512 DOI: 10.1152/advan.00093.2024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Revised: 06/13/2024] [Accepted: 10/13/2024] [Indexed: 01/20/2025]
Abstract
Despite extensive studies on large language models and their capability to respond to questions from various licensed exams, there has been limited focus on employing chatbots for specific subjects within the medical curriculum, specifically medical neuroscience. This research compared the performances of Claude 3.5 Sonnet (Anthropic), GPT-3.5 and GPT-4-1106 (OpenAI), Copilot free version (Microsoft), and Gemini 1.5 Flash (Google) versus students on multiple-choice questions (MCQs) from the medical neuroscience course database to evaluate chatbot reliability. Five successive attempts of each chatbot to answer 200 United States Medical Licensing Examination (USMLE)-style questions were evaluated based on accuracy, relevance, and comprehensiveness. MCQs were categorized into 12 categories/topics. The results indicated that, at the current level of development, selected AI-driven chatbots, on average, can accurately answer 67.2% of MCQs from the medical neuroscience course, which is 7.4% below the students' average. However, Claude and GPT-4 outperformed other chatbots, with 83% and 81.7% correct answers, which is better than the average student result. They were followed by Copilot (59.5%), GPT-3.5 (58.3%), and Gemini (53.6%). Concerning different categories, Neurocytology, Embryology, and Diencephalon were the three best topics, with average results of 78.1-86.7%, and the lowest results were for Brain stem, Special senses, and Cerebellum, with 54.4-57.7% correct answers. Our study suggested that Claude and GPT-4 are currently two of the most evolved chatbots. They exhibit proficiency in answering MCQs related to neuroscience that surpasses that of the average medical student. This breakthrough indicates a significant milestone in how AI can supplement and enhance educational tools and techniques.NEW & NOTEWORTHY This research evaluates the effectiveness of different AI-driven large language models (Claude, ChatGPT, Copilot, and Gemini) compared to medical students in answering neuroscience questions. The study offers insights into the specific areas of neuroscience in which these chatbots may excel or have limitations, providing a comprehensive analysis of chatbots' current capabilities in processing and interacting with certain topics of the basic medical sciences curriculum.
Collapse
Affiliation(s)
- Volodymyr Mavrych
- College of Medicine, Alfaisal University, Riyadh, Kingdom of Saudi Arabia
| | - Ahmed Yaqinuddin
- College of Medicine, Alfaisal University, Riyadh, Kingdom of Saudi Arabia
| | - Olena Bolgova
- College of Medicine, Alfaisal University, Riyadh, Kingdom of Saudi Arabia
| |
Collapse
|
2
|
Diamond CJ, Thate J, Withall JB, Lee RY, Cato K, Rossetti SC. Generative AI Demonstrated Difficulty Reasoning on Nursing Flowsheet Data. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2025; 2024:349-358. [PMID: 40417556 PMCID: PMC12099445] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/27/2025]
Abstract
Excessive documentation burden is linked to clinician burnout, thus motivating efforts to reduce burden. Generative artificial intelligence (AI) poses opportunities for burden reduction but requires rigorous assessment. We evaluated the ability of a large language model (LLM) (OpenAI's GPT-4) to interpret various intervention-response relationships presented on nursing flowsheets, assessing performance using MUC-5 evaluation metrics, and compared its assessments to those of nurse expert evaluators. ChatGPT correctly assessed 3 of 14 clinical scenarios, and partially correctly assessed 6 of 14, frequently omitting data from its reasoning. Nurse expert evaluators correctly assessed all relationships and provided additional language reflective of standard nursing practice beyond the intervention-response relationships evidenced in nursing flowsheets. Future work should ensure the training data used for electronic health record (EHR)-integrated LLMs includes all types of narrative nursing documentation that reflect nurses' clinical reasoning, and verification of LLM-based information summarization does not burden end-users.
Collapse
Affiliation(s)
| | - Jennifer Thate
- Columbia University Department of Biomedical Informatics, New York, NY
- Siena College, Loudonville, NY
| | | | - Rachel Y Lee
- Columbia University School of Nursing, New York, NY
| | - Kenrick Cato
- University of Pennsylvania School of Nursing, Philadelphia, PA
| | - Sarah C Rossetti
- Columbia University Department of Biomedical Informatics, New York, NY
- Columbia University School of Nursing, New York, NY
| |
Collapse
|
3
|
Masison J, Lehmann HP, Wan J. Utilization of Computable Phenotypes in Electronic Health Record Research: A Review and Case Study in Atopic Dermatitis. J Invest Dermatol 2025; 145:1008-1016. [PMID: 39488781 PMCID: PMC12018156 DOI: 10.1016/j.jid.2024.08.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2024] [Revised: 08/05/2024] [Accepted: 08/18/2024] [Indexed: 11/04/2024]
Abstract
Querying electronic health records databases to accurately identify specific cohorts of patients has countless observational and interventional research applications. Computable phenotypes are computationally executable, explicit sets of selection criteria composed of data elements, logical expressions, and a combination of natural language processing and machine learning techniques enabling expedited patient cohort identification. Phenotyping encompasses a range of implementations, each with advantages and use cases. In this paper, the dermatologic computable phenotype literature is reviewed. We identify and evaluate approaches and community supports for computable phenotyping that have been used both generally and within dermatology and, as a case study, focus on studied phenotypes for atopic dermatitis.
Collapse
Affiliation(s)
- Joseph Masison
- University of Connecticut School of Medicine, Farmington, Connecticut, USA
| | - Harold P Lehmann
- Division of General Internal Medicine, Department of Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA
| | - Joy Wan
- Department of Dermatology, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA.
| |
Collapse
|
4
|
Ahmed M, Lam J, Chow A, Chow CM. A Primer on Large Language Models (LLMs) and ChatGPT for Cardiovascular Healthcare Professionals. CJC Open 2025; 7:660-666. [PMID: 40433202 PMCID: PMC12105510 DOI: 10.1016/j.cjco.2025.02.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2024] [Accepted: 02/17/2025] [Indexed: 05/29/2025] Open
Abstract
Generative artificial intelligence (AI), particularly large language models (LLMs), such as ChatGPT, is transforming healthcare by offering novel ways to synthesize and communicate medical knowledge. This development is especially relevant in cardiology, as patient education, clinical decision-making, and administrative workflows play pivotal roles in this area. ChatGPT, originally built on GPT-3 and refined into GPT-4, can simplify complex cardiology literature, translate technical explanations into plain language, and address questions across different linguistic backgrounds. Studies show that although ChatGPT demonstrates considerable promise in performing text-based tasks-ranging from passing portions of the European Exam in Core Cardiology to creating patient-friendly educational materials-its inability to interpret images remains a major limitation. Meanwhile, concerns around false information, data bias, and ethical issues highlight the need for careful oversight. Future directions include integrating LLMs with computer-vision modules for image-based diagnostics and combining unstructured patient data to improve risk prediction and phenotyping. Social-media research suggests that chatbots sometimes provide more-empathetic responses than do physicians, underscoring both their potential advantages and complexities. LLM-based tools can also generate letters for insurance prior authorizations or appeals, helping reduce administrative burden. New multimodal approaches, such as ChatGPT Vision, have the potential to enable direct image processing, although clinical validation of this function is yet to be established. The judicious integration of ChatGPT and other LLMs into cardiology requires ongoing validation, robust regulatory frameworks, and strong ethical guidelines to ensure patient privacy, avoid misinformation, and promote equitable healthcare delivery. This review aims to provide a primer on LLMs for cardiovascular professionals, summarizing key applications, current limitations, and prospects in this rapidly evolving field of digital health.
Collapse
Affiliation(s)
- Muneeb Ahmed
- Department of Medicine, University of Toronto, Toronto, Ontario, Canada
| | - Jeffrey Lam
- Division of Cardiology, Department of Medicine, Queen's University, Kingston, Ontario, Canada
| | | | - Chi-Ming Chow
- Department of Medicine, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
5
|
Holt NM, Byrne MF. The Role of Artificial Intelligence and Big Data for Gastrointestinal Disease. Gastrointest Endosc Clin N Am 2025; 35:291-308. [PMID: 40021230 DOI: 10.1016/j.giec.2024.09.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/03/2025]
Abstract
Artificial intelligence (AI) is a rapidly evolving presence in all fields and industries, with the ability to both improve quality and reduce the burden of human effort. Gastroenterology is a field with a focus on diagnostic techniques and procedures, and AI and big data have established and growing roles to play. Alongside these opportunities are challenges, which will evolve in parallel.
Collapse
Affiliation(s)
- Nicholas Mathew Holt
- Gastroenterology and Hepatology Unit, The Canberra Hospital, Yamba Drive, Garran, ACT 2605, Australia.
| | - Michael Francis Byrne
- Division of Gastroenterology, Vancouver General Hospital, University of British Columbia, UBC Division of Gastroenterology, 5153 - 2775 Laurel Street, Vancouver, British Columbia V5Z 1M9, Canada
| |
Collapse
|
6
|
Hasan SS, Fury MS, Woo JJ, Kunze KN, Ramkumar PN. Ethical Application of Generative Artificial Intelligence in Medicine. Arthroscopy 2025; 41:874-885. [PMID: 39689842 DOI: 10.1016/j.arthro.2024.12.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/09/2024] [Revised: 11/25/2024] [Accepted: 12/03/2024] [Indexed: 12/19/2024]
Abstract
Generative artificial intelligence (AI) may revolutionize health care, providing solutions that range from enhancing diagnostic accuracy to personalizing treatment plans. However, its rapid and largely unregulated integration into medicine raises ethical concerns related to data integrity, patient safety, and appropriate oversight. One of the primary ethical challenges lies in generative AI's potential to produce misleading or fabricated information, posing risks of misdiagnosis or inappropriate treatment recommendations, which underscore the necessity for robust physician oversight. Transparency also remains a critical concern, as the closed-source nature of many large-language models prevents both patients and health care providers from understanding the reasoning behind AI-generated outputs, potentially eroding trust. The lack of regulatory approval for AI as a medical device, combined with concerns around the security of patient-derived data and AI-generated synthetic data, further complicates its safe integration into clinical workflows. Furthermore, synthetic datasets generated by AI, although valuable for augmenting research in areas with scarce data, complicate questions of data ownership, patient consent, and scientific validity. In addition, generative AI's ability to streamline administrative tasks risks depersonalizing care, further distancing providers from patients. These challenges compound the deeper issues plaguing the health care system, including the emphasis of volume and speed over value and expertise. The use of generative AI in medicine brings about mass scaling of synthetic information, thereby necessitating careful adoption to protect patient care and medical advancement. Given these considerations, generative AI applications warrant regulatory and critical scrutiny. Key starting points include establishing strict standards for data security and transparency, implementing oversight akin to institutional review boards to govern data usage, and developing interdisciplinary guidelines that involve developers, clinicians, and ethicists. By addressing these concerns, we can better align generative AI adoption with the core foundations of humanistic health care, preserving patient safety, autonomy, and trust while harnessing AI's transformative potential. LEVEL OF EVIDENCE: Level V, expert opinion.
Collapse
Affiliation(s)
| | - Matthew S Fury
- Baton Rouge Orthopaedic Clinic, Baton Rouge, Louisiana, U.S.A
| | - Joshua J Woo
- Brown University/The Warren Alpert School of Brown University, Providence, Rhode Island, U.S.A
| | - Kyle N Kunze
- Hospital for Special Surgery, New York, New York, U.S.A
| | | |
Collapse
|
7
|
Li J, Yang Y, Chen R, Zheng D, Pang PCI, Lam CK, Wong D, Wang Y. Identifying healthcare needs with patient experience reviews using ChatGPT. PLoS One 2025; 20:e0313442. [PMID: 40100826 PMCID: PMC11918364 DOI: 10.1371/journal.pone.0313442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2024] [Accepted: 10/23/2024] [Indexed: 03/20/2025] Open
Abstract
BACKGROUND Valuable findings can be obtained through data mining in patients' online reviews. Also identifying healthcare needs from the patient's perspective can more accurately improve the quality of care and the experience of the visit. Thereby avoiding unnecessary waste of health care resources. The large language model (LLM) can be a promising tool due to research that demonstrates its outstanding performance and potential in directions such as data mining, healthcare management, and more. OBJECTIVE We aim to propose a methodology to address this problem, specifically, the recent breakthrough of LLM can be leveraged for effectively understanding healthcare needs from patient experience reviews. METHODS We used 504,198 reviews collected from a large online medical platform, haodf.com. We used the reviews to create Aspect Based Sentiment Analysis (ABSA) templates, which categorized patient reviews into three categories, reflecting the areas of concern of patients. With the introduction of thought chains, we embedded ABSA templates into the prompts for ChatGPT, which was then used to identify patient needs. RESULTS Our method has a weighted total precision of 0.944, which was outstanding compared to the direct narrative tasks in ChatGPT-4o, which have a weighted total precision of 0.890. Weighted total recall and F1 scores also reached 0.884 and 0.912 respectively, surpassing the 0.802 and 0.843 scores for "direct narratives in ChatGPT." Finally, the accuracy of the three sampling methods was 91.8%, 91.7%, and 91.2%, with an average accuracy of over 91.5%. CONCLUSIONS Combining ChatGPT with ABSA templates can achieve satisfactory results in analyzing patient reviews. As our work applies to other LLMs, we shed light on understanding the demands of patients and health consumers with novel models, which can contribute to the agenda of enhancing patient experience and better healthcare resource allocations effectively.
Collapse
Affiliation(s)
- Jiaxuan Li
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
| | - Yunchu Yang
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
| | - Rong Chen
- Department of Rehabilitation Medicine, The First Affiliated Hospital, Sun Yat-Sen University, Guangzhou, China
| | - Dashun Zheng
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
| | | | - Chi Kin Lam
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
| | - Dennis Wong
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
- State University of New York, Songdo, Korea
| | - Yapeng Wang
- Faculty of Applied Sciences, Macao Polytechnic University, Macao, China
| |
Collapse
|
8
|
Grosser J, Düvel J, Hasemann L, Schneider E, Greiner W. Studying the Potential Effects of Artificial Intelligence on Physician Autonomy: Scoping Review. JMIR AI 2025; 4:e59295. [PMID: 40080059 PMCID: PMC11950692 DOI: 10.2196/59295] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/08/2024] [Revised: 05/15/2024] [Accepted: 12/31/2024] [Indexed: 03/15/2025]
Abstract
BACKGROUND Physician autonomy has been found to play a role in physician acceptance and adoption of artificial intelligence (AI) in medicine. However, there is still no consensus in the literature on how to define and assess physician autonomy. Furthermore, there is a lack of research focusing specifically on the potential effects of AI on physician autonomy. OBJECTIVE This scoping review addresses the following research questions: (1) How do qualitative studies conceptualize and assess physician autonomy? (2) Which aspects of physician autonomy are addressed by these studies? (3) What are the potential benefits and harms of AI for physician autonomy identified by these studies? METHODS We performed a scoping review of qualitative studies on AI and physician autonomy published before November 6, 2023, by searching MEDLINE and Web of Science. To answer research question 1, we determined whether the included studies explicitly include physician autonomy as a research focus and whether their interview, survey, and focus group questions explicitly name or implicitly include aspects of physician autonomy. To answer research question 2, we extracted the qualitative results of the studies, categorizing them into the 7 components of physician autonomy introduced by Schulz and Harrison. We then inductively formed subcomponents based on the results of the included studies in each component. To answer research question 3, we summarized the potentially harmful and beneficial effects of AI on physician autonomy in each of the inductively formed subcomponents. RESULTS The search yielded 369 studies after duplicates were removed. Of these, 27 studies remained after titles and abstracts were screened. After full texts were screened, we included a total of 7 qualitative studies. Most studies did not explicitly name physician autonomy as a research focus or explicitly address physician autonomy in their interview, survey, and focus group questions. No studies addressed a complete set of components of physician autonomy; while 3 components were addressed by all included studies, 2 components were addressed by none. We identified a total of 11 subcomponents for the 5 components of physician autonomy that were addressed by at least 1 study. For most of these subcomponents, studies reported both potential harms and potential benefits of AI for physician autonomy. CONCLUSIONS Little research to date has explicitly addressed the potential effects of AI on physician autonomy and existing results on these potential effects are mixed. Further qualitative and quantitative research is needed that focuses explicitly on physician autonomy and addresses all relevant components of physician autonomy.
Collapse
Affiliation(s)
- John Grosser
- Department of Health Economics and Health Care Management, School of Public Health, Bielefeld University, Bielefeld, Germany
| | - Juliane Düvel
- Centre for Electronic Public Health Research (CePHR), School of Public Health, Bielefeld University, Bielefeld, Germany
| | - Lena Hasemann
- Department of Health Economics and Health Care Management, School of Public Health, Bielefeld University, Bielefeld, Germany
| | - Emilia Schneider
- Department of Health Economics and Health Care Management, School of Public Health, Bielefeld University, Bielefeld, Germany
| | - Wolfgang Greiner
- Department of Health Economics and Health Care Management, School of Public Health, Bielefeld University, Bielefeld, Germany
| |
Collapse
|
9
|
Fernández-Pichel M, Pichel JC, Losada DE. Evaluating search engines and large language models for answering health questions. NPJ Digit Med 2025; 8:153. [PMID: 40065094 PMCID: PMC11894092 DOI: 10.1038/s41746-025-01546-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2024] [Accepted: 02/27/2025] [Indexed: 03/14/2025] Open
Abstract
Search engines (SEs) have traditionally been primary tools for information seeking, but the new large language models (LLMs) are emerging as powerful alternatives, particularly for question-answering tasks. This study compares the performance of four popular SEs, seven LLMs, and retrieval-augmented (RAG) variants in answering 150 health-related questions from the TREC Health Misinformation (HM) Track. Results reveal SEs correctly answer 50-70% of questions, often hindered by many retrieval results not responding to the health question. LLMs deliver higher accuracy, correctly answering about 80% of questions, though their performance is sensitive to input prompts. RAG methods significantly enhance smaller LLMs' effectiveness, improving accuracy by up to 30% by integrating retrieval evidence.
Collapse
Affiliation(s)
- Marcos Fernández-Pichel
- Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela, Santiago de Compostela, Galicia, Spain.
| | - Juan C Pichel
- Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela, Santiago de Compostela, Galicia, Spain
| | - David E Losada
- Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela, Santiago de Compostela, Galicia, Spain
| |
Collapse
|
10
|
Khabaz K, Newman‐Hung NJ, Kallini JR, Kendal J, Christ AB, Bernthal NM, Wessel LE. Assessment of Artificial Intelligence Chatbot Responses to Common Patient Questions on Bone Sarcoma. J Surg Oncol 2025; 131:719-724. [PMID: 39470681 PMCID: PMC12065442 DOI: 10.1002/jso.27966] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2024] [Revised: 10/04/2024] [Accepted: 10/12/2024] [Indexed: 10/30/2024]
Abstract
BACKGROUND AND OBJECTIVES The potential impacts of artificial intelligence (AI) chatbots on care for patients with bone sarcoma is poorly understood. Elucidating potential risks and benefits would allow surgeons to define appropriate roles for these tools in clinical care. METHODS Eleven questions on bone sarcoma diagnosis, treatment, and recovery were inputted into three AI chatbots. Answers were assessed on a 5-point Likert scale for five clinical accuracy metrics: relevance to the question, balance and lack of bias, basis on established data, factual accuracy, and completeness in scope. Responses were quantitatively assessed for empathy and readability. The Patient Education Materials Assessment Tool (PEMAT) was assessed for understandability and actionability. RESULTS Chatbots scored highly on relevance (4.24) and balance/lack of bias (4.09) but lower on basing responses on established data (3.77), completeness (3.68), and factual accuracy (3.66). Responses generally scored well on understandability (84.30%), while actionability scores were low for questions on treatment (64.58%) and recovery (60.64%). GPT-4 exhibited the highest empathy (4.12). Readability scores averaged between 10.28 for diagnosis questions to 11.65 for recovery questions. CONCLUSIONS While AI chatbots are promising tools, current limitations in factual accuracy and completeness, as well as concerns of inaccessibility to populations with lower health literacy, may significantly limit their clinical utility.
Collapse
Affiliation(s)
- Kameel Khabaz
- David Geffen School of Medicine at UCLALos AngelesCaliforniaUSA
| | | | - Jennifer R. Kallini
- Department of Orthopaedic SurgeryUniversity of CaliforniaLos AngelesCaliforniaUSA
| | - Joseph Kendal
- Department of SurgeryUniversity of CalgaryCalgaryAlbertaCanada
| | - Alexander B. Christ
- Department of Orthopaedic SurgeryUniversity of CaliforniaLos AngelesCaliforniaUSA
| | - Nicholas M. Bernthal
- Department of Orthopaedic SurgeryUniversity of CaliforniaLos AngelesCaliforniaUSA
| | - Lauren E. Wessel
- Department of Orthopaedic SurgeryUniversity of CaliforniaLos AngelesCaliforniaUSA
| |
Collapse
|
11
|
Li R, Wu T. Application of Artificial Intelligence Generated Content in Medical Examinations. ADVANCES IN MEDICAL EDUCATION AND PRACTICE 2025; 16:331-339. [PMID: 40026780 PMCID: PMC11871906 DOI: 10.2147/amep.s492895] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/24/2024] [Accepted: 02/21/2025] [Indexed: 03/05/2025]
Abstract
As the rapid development of large language model, artificial intelligence generated content (AIGC) presents novel opportunities for constructing medical examination questions. However, it is unclear about the way of effectively utilizing AIGC for designing medical questions. AIGC is characterized by its rapid response capabilities and high efficiency, as well as good performance in mimicking clinical realities. In this study, we revealed the limitations inherent in paper-based examinations, and provided a streamlined instruction for generating questions using AIGC, with a particular focus on multiple-choice questions, case study questions, and video questions. Manual review remains necessary to ensure the accuracy and quality of the generated content. Future development will be benefited from technologies like retrieval augmented generation, multi-agent system, and video generation technology. As AIGC continues to evolve, it is anticipated to bring transformative changes to medical examinations, enhancing the quality of examination preparation, and contributing to the effective cultivation of medical students.
Collapse
Affiliation(s)
- Rui Li
- Emergency Department, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, People’s Republic of China
| | - Tong Wu
- National Clinical Research Center for Obstetrical and Gynecological Diseases, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, People’s Republic of China
- Key Laboratory of Cancer Invasion and Metastasis, Ministry of Education, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, People’s Republic of China
- Department of Obstetrics and Gynecology, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, People’s Republic of China
| |
Collapse
|
12
|
Yang Z, Yao Z, Tasmin M, Vashisht P, Jang WS, Ouyang F, Wang B, McManus D, Berlowitz D, Yu H. Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study. J Med Internet Res 2025; 27:e65146. [PMID: 39919278 PMCID: PMC11845889 DOI: 10.2196/65146] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2024] [Revised: 11/06/2024] [Accepted: 11/26/2024] [Indexed: 02/09/2025] Open
Abstract
BACKGROUND Recent advancements in artificial intelligence, such as GPT-3.5 Turbo (OpenAI) and GPT-4, have demonstrated significant potential by achieving good scores on text-only United States Medical Licensing Examination (USMLE) exams and effectively answering questions from physicians. However, the ability of these models to interpret medical images remains underexplored. OBJECTIVE This study aimed to comprehensively evaluate the performance, interpretability, and limitations of GPT-3.5 Turbo, GPT-4, and its successor, GPT-4 Vision (GPT-4V), specifically focusing on GPT-4V's newly introduced image-understanding feature. By assessing the models on medical licensing examination questions that require image interpretation, we sought to highlight the strengths and weaknesses of GPT-4V in handling complex multimodal clinical information, thereby exposing hidden flaws and providing insights into its readiness for integration into clinical settings. METHODS This cross-sectional study tested GPT-4V, GPT-4, and ChatGPT-3.5 Turbo on a total of 227 multiple-choice questions with images from USMLE Step 1 (n=19), Step 2 clinical knowledge (n=14), Step 3 (n=18), the Diagnostic Radiology Qualifying Core Exam (DRQCE) (n=26), and AMBOSS question banks (n=150). AMBOSS provided expert-written hints and question difficulty levels. GPT-4V's accuracy was compared with 2 state-of-the-art large language models, GPT-3.5 Turbo and GPT-4. The quality of the explanations was evaluated by choosing human preference between an explanation by GPT-4V (without hint), an explanation by an expert, or a tie, using 3 qualitative metrics: comprehensive explanation, question information, and image interpretation. To better understand GPT-4V's explanation ability, we modified a patient case report to resemble a typical "curbside consultation" between physicians. RESULTS For questions with images, GPT-4V achieved an accuracy of 84.2%, 85.7%, 88.9%, and 73.1% in Step 1, Step 2 clinical knowledge, Step 3 of USMLE, and DRQCE, respectively. It outperformed GPT-3.5 Turbo (42.1%, 50%, 50%, 19.2%) and GPT-4 (63.2%, 64.3%, 66.7%, 26.9%). When GPT-4V answered correctly, its explanations were nearly as good as those provided by domain experts from AMBOSS. However, incorrect answers often had poor explanation quality: 18.2% (10/55) contained inaccurate text, 45.5% (25/55) had inference errors, and 76.3% (42/55) demonstrated image misunderstandings. With human expert assistance, GPT-4V reduced errors by an average of 40% (22/55). GPT-4V accuracy improved with hints, maintaining stable performance across difficulty levels, while medical student performance declined as difficulty increased. In a simulated curbside consultation scenario, GPT-4V required multiple specific prompts to interpret complex case data accurately. CONCLUSIONS GPT-4V achieved high accuracy on multiple-choice questions with images, highlighting its potential in medical assessments. However, significant shortcomings were observed in the quality of explanations when questions were answered incorrectly, particularly in the interpretation of images, which could not be efficiently resolved through expert interaction. These findings reveal hidden flaws in the image interpretation capabilities of GPT-4V, underscoring the need for more comprehensive evaluations beyond multiple-choice questions before integrating GPT-4V into clinical settings.
Collapse
Affiliation(s)
- Zhichao Yang
- College of Information and Computer Science, University of Massachusetts Amherst, Amherst, MA, United States
| | - Zonghai Yao
- College of Information and Computer Science, University of Massachusetts Amherst, Amherst, MA, United States
| | - Mahbuba Tasmin
- College of Information and Computer Science, University of Massachusetts Amherst, Amherst, MA, United States
| | - Parth Vashisht
- College of Information and Computer Science, University of Massachusetts Amherst, Amherst, MA, United States
| | - Won Seok Jang
- Miner School of Computer & Information Sciences, University of Massachusetts Lowell, Lowell, MA, United States
| | - Feiyun Ouyang
- Miner School of Computer & Information Sciences, University of Massachusetts Lowell, Lowell, MA, United States
| | - Beining Wang
- Shanghai Medical College, Fudan University, Shanghai, China
| | - David McManus
- Department of Medicine, University of Massachusetts Chan Medical School, Worcester, MA, United States
| | - Dan Berlowitz
- Department of Public Health, University of Massachusetts Lowell, Lowell, MA, United States
- Center for Biomedical and Health Research in Data Sciences, University of Massachusetts Lowell, Lowell, MA, United States
| | - Hong Yu
- College of Information and Computer Science, University of Massachusetts Amherst, Amherst, MA, United States
- Miner School of Computer & Information Sciences, University of Massachusetts Lowell, Lowell, MA, United States
- Center for Biomedical and Health Research in Data Sciences, University of Massachusetts Lowell, Lowell, MA, United States
- Center for Healthcare Organization and Implementation Research, VA Bedford Health Care System, Bedford, MA, United States
| |
Collapse
|
13
|
Hassan M, Ayad M, Nembhard C, Hayes-Dixon A, Lin A, Janjua M, Franko J, Tee M. Artificial Intelligence Compared to Manual Selection of Prospective Surgical Residents. JOURNAL OF SURGICAL EDUCATION 2025; 82:103308. [PMID: 39509905 DOI: 10.1016/j.jsurg.2024.103308] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/30/2024] [Revised: 09/26/2024] [Accepted: 10/02/2024] [Indexed: 11/15/2024]
Abstract
BACKGROUND Artificial Intelligence (AI) in the selection of residency program applicants is a new tool that is gaining traction, with the aim of screening high numbers of applicants while introducing objectivity and mitigating bias in a traditionally subjective process. This study aims to compare applicants screened by an AI software to a single Program Director (PD) for interview selection. METHODS A single PD at an ACGME-accredited, academic general surgery program screened applicants. A parallel screen by AI software, programmed by the same PD, was conducted on the same pool of applicants. Weighted preferences were assigned in the following order: personal statement, research, medical school rankings, letters of recommendation, personal qualities, board scores, graduate degree, geographic preference, past experiences, program signal, honor society membership, and multilingualism. Statistical analyses were conducted by chi-square, ANOVA, and independent two-sided t-tests. RESULTS Out of 1235 applications, 144 applications were PD-selected and 150 AI-selected (294 top applications). Twenty applications (7.3%) were both PD and AI selected for a total analysis cohort of 274 prospective residents. We performed two analyses: 1) PD-selected vs. AI-selected vs. Both and 2) PD-selected vs. AI-selected with the overlapping applicants censored. For the first analysis, AI selected significantly: more White/Hispanic applicants (p < 0.001), less signals (p < 0.001), more AOA honors society (p = 0.016), and more publications (p < 0.001). When censoring overlapping PD and AI selection, AI selected significantly: more White/Hispanic applicants (p < 0.001), less signals (p < 0.001), more US medical graduates (p = 0.027), less applicants needing visa sponsorship (p = 0.01), younger applicants (p = 0.024), higher USMLE Step 2 CK scores (p < 0.001), and more publications (p < 0.001). CONCLUSIONS There was only a 7% overlap between PD-selected and AI-selected applicants for interview screening in the same applicant pool. Despite the same PD educating the AI software, the 2 application pools differed significantly. In its present state, AI may be utilized as a tool in resident application selection but should not completely replace human review. We recommend careful analysis of the performance of each AI model in the respective environment of each institution applying it, as it may alter the group of interviewees.
Collapse
Affiliation(s)
- Monalisa Hassan
- Department of Surgery, Howard University Hospital, Washington, District of Columbia; Department of Surgery, University of California, Davis, California
| | - Marco Ayad
- Department of Surgery, Howard University Hospital, Washington, District of Columbia
| | - Christine Nembhard
- Department of Surgery, Howard University Hospital, Washington, District of Columbia
| | - Andrea Hayes-Dixon
- Department of Surgery, Howard University Hospital, Washington, District of Columbia
| | - Anna Lin
- Department of Surgery, Howard University Hospital, Washington, District of Columbia
| | - Mahin Janjua
- Department of Surgery, Howard University Hospital, Washington, District of Columbia
| | - Jan Franko
- Department of Surgery, MercyOne Medical Center, Des Moines, Iowa
| | - May Tee
- Department of Surgery, Howard University Hospital, Washington, District of Columbia.
| |
Collapse
|
14
|
Qin H, Tong Y. Opportunities and Challenges for Large Language Models in Primary Health Care. J Prim Care Community Health 2025; 16:21501319241312571. [PMID: 40162893 PMCID: PMC11960148 DOI: 10.1177/21501319241312571] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2024] [Revised: 12/14/2024] [Accepted: 12/17/2024] [Indexed: 04/02/2025] Open
Abstract
Primary Health Care (PHC) is the cornerstone of the global health care system and the primary objective for achieving universal health coverage. China's PHC system faces several challenges, including uneven distribution of medical resources, a lack of qualified primary healthcare personnel, an ineffective implementation of the hierarchical medical treatment, and a serious situation regarding the prevention and control of chronic diseases. The rapid advancement of artificial intelligence (AI) technology, large language models (LLMs) demonstrate significant potential in the medical field with their powerful natural language processing and reasoning capabilities, especially in PHC. This review focuses on the various potential applications of LLMs in China's PHC, including health promotion and disease prevention, medical consultation and health management, diagnosis and triage, chronic disease management, and mental health support. Additionally, pragmatic obstacles were analyzed, such as transparency, outcomes misrepresentation, privacy concerns, and social biases. Future development should emphasize interdisciplinary collaboration and resource sharing, ongoing improvements in health equity, and innovative advancements in medical large models. There is a demand to establish a safe, effective, equitable, and flexible ethical and legal framework, along with a robust accountability mechanism, to support the achievement of universal health coverage.
Collapse
Affiliation(s)
- Hongyang Qin
- The Second Affiliated Hospital of Zhejiang University School of Medicine, Hangzhou, China
- Beigan Street Community Health Service Center, Xiaoshan District, Hangzhou, China
| | - Yuling Tong
- The Second Affiliated Hospital of Zhejiang University School of Medicine, Hangzhou, China
| |
Collapse
|
15
|
Arvidsson R, Gunnarsson R, Entezarjou A, Sundemo D, Wikberg C. ChatGPT (GPT-4) versus doctors on complex cases of the Swedish family medicine specialist examination: an observational comparative study. BMJ Open 2024; 14:e086148. [PMID: 39730155 PMCID: PMC11683950 DOI: 10.1136/bmjopen-2024-086148] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/06/2024] [Accepted: 11/22/2024] [Indexed: 12/29/2024] Open
Abstract
BACKGROUND Recent breakthroughs in artificial intelligence research include the development of generative pretrained transformers (GPT). ChatGPT has been shown to perform well when answering several sets of medical multiple-choice questions. However, it has not been tested for writing free-text assessments of complex cases in primary care. OBJECTIVES To compare the performance of ChatGPT, version GPT-4, with that of real doctors. DESIGN AND SETTING A blinded observational comparative study conducted in the Swedish primary care setting. Responses from GPT-4 and real doctors to cases from the Swedish family medicine specialist examination were scored by blinded reviewers, and the scores were compared. PARTICIPANTS Anonymous responses from the Swedish family medicine specialist examination 2017-2022 were used. OUTCOME MEASURES Primary: the mean difference in scores between GPT-4's responses and randomly selected responses by human doctors, as well as between GPT-4's responses and top-tier responses by human doctors. Secondary: the correlation between differences in response length and response score; the intraclass correlation coefficient between reviewers; and the percentage of maximum score achieved by each group in different subject categories. RESULTS The mean scores were 6.0, 7.2 and 4.5 for randomly selected doctor responses, top-tier doctor responses and GPT-4 responses, respectively, on a 10-point scale. The scores for the random doctor responses were, on average, 1.6 points higher than those of GPT-4 (p<0.001, 95% CI 0.9 to 2.2) and the top-tier doctor scores were, on average, 2.7 points higher than those of GPT-4 (p<0.001, 95 % CI 2.2 to 3.3). Following the release of GPT-4o, the experiment was repeated, although this time with only a single reviewer scoring the answers. In this follow-up, random doctor responses were scored 0.7 points higher than those of GPT-4o (p=0.044). CONCLUSION In complex primary care cases, GPT-4 performs worse than human doctors taking the family medicine specialist examination. Future GPT-based chatbots may perform better, but comprehensive evaluations are needed before implementing chatbots for medical decision support in primary care.
Collapse
Affiliation(s)
- Rasmus Arvidsson
- General Practice / Family Medicine, School of Public Health and Community Medicine, Sahlgrenska Academy, University of Gothenburg Institute of Medicine, Gothenburg, Sweden
- Hälsocentralen Sankt Hans, Praktikertjänst AB, Lund, Sweden
| | - Ronny Gunnarsson
- General Practice / Family Medicine, School of Public Health and Community Medicine, Sahlgrenska Academy, University of Gothenburg Institute of Medicine, Gothenburg, Sweden
- Närhälsan, Vårdcentralen Hemlösa, Region Vastra Gotaland, Gothenburg, Sweden
| | - Artin Entezarjou
- General Practice / Family Medicine, School of Public Health and Community Medicine, Sahlgrenska Academy, University of Gothenburg Institute of Medicine, Gothenburg, Sweden
| | - David Sundemo
- General Practice / Family Medicine, School of Public Health and Community Medicine, Sahlgrenska Academy, University of Gothenburg Institute of Medicine, Gothenburg, Sweden
- Lerum Primary Healthcare Center, Närhälsan, Lerum, Sweden
| | - Carl Wikberg
- General Practice / Family Medicine, School of Public Health and Community Medicine, Sahlgrenska Academy, University of Gothenburg Institute of Medicine, Gothenburg, Sweden
- Research, Education, Development & Innovation, Primary Health Care, Region Vastra Gotaland, Gothenburg, Sweden
| |
Collapse
|
16
|
Duggan R, Tsuruda KM. ChatGPT performance on radiation technologist and therapist entry to practice exams. J Med Imaging Radiat Sci 2024; 55:101426. [PMID: 38797622 DOI: 10.1016/j.jmir.2024.04.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2024] [Revised: 04/24/2024] [Accepted: 04/29/2024] [Indexed: 05/29/2024]
Abstract
BACKGROUND The aim of this study was to describe the proficiency of ChatGPT (GPT-4) on certification style exams from the Canadian Association of Medical Radiation Technologists (CAMRT), and describe its performance across multiple exam attempts. METHODS ChatGPT was prompted with questions from CAMRT practice exams in the disciplines of radiological technology, magnetic resonance (MRI), nuclear medicine and radiation therapy (87-98 questions each). ChatGPT attempted each exam five times. Exam performance was evaluated using descriptive statistics, stratified by discipline and question type (knowledge, application, critical thinking). Light's Kappa was used to assess agreement in answers across attempts. RESULTS Using a passing grade of 65 %, ChatGPT passed the radiological technology exam only once (20 %), MRI all five times (100 %), nuclear medicine three times (60 %), and radiation therapy all five times (100 %). ChatGPT's performance was best on knowledge questions across all disciplines except radiation therapy. It performed worst on critical thinking questions. Agreement in ChatGPT's responses across attempts was substantial within the disciplines of radiological technology, MRI, and nuclear medicine, and almost perfect for radiation therapy. CONCLUSION ChatGPT (GPT-4) was able to pass certification style exams for radiation technologists and therapists, but its performance varied between disciplines. The algorithm demonstrated substantial to almost perfect agreement in the responses it provided across multiple exam attempts. Future research evaluating ChatGPT's performance on standardized tests should consider using repeated measures.
Collapse
Affiliation(s)
- Ryan Duggan
- School of Health Sciences, Dalhousie University, Halifax, Nova Scotia, Canada; Miramichi Regional Hospital, Horizon Health Network, New Brunswick, Canada.
| | | |
Collapse
|
17
|
Ozden I, Gokyar M, Ozden ME, Sazak Ovecoglu H. Assessment of artificial intelligence applications in responding to dental trauma. Dent Traumatol 2024; 40:722-729. [PMID: 38742754 DOI: 10.1111/edt.12965] [Citation(s) in RCA: 20] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2024] [Revised: 04/15/2024] [Accepted: 04/16/2024] [Indexed: 05/16/2024]
Abstract
BACKGROUND This study assessed the consistency and accuracy of responses provided by two artificial intelligence (AI) applications, ChatGPT and Google Bard (Gemini), to questions related to dental trauma. MATERIALS AND METHODS Based on the International Association of Dental Traumatology guidelines, 25 dichotomous (yes/no) questions were posed to ChatGPT and Google Bard over 10 days. The responses were recorded and compared with the correct answers. Statistical analyses, including Fleiss kappa, were conducted to determine the agreement and consistency of the responses. RESULTS Analysis of 4500 responses revealed that both applications provided correct answers to 57.5% of the questions. Google Bard demonstrated a moderate level of agreement, with varying rates of incorrect answers and referrals to physicians. CONCLUSIONS Although ChatGPT and Google Bard are potential knowledge resources, their consistency and accuracy in responding to dental trauma queries remain limited. Further research involving specially trained AI models in endodontics is warranted to assess their suitability for clinical use.
Collapse
Affiliation(s)
- Idil Ozden
- Department of Endodontics, Marmara University Faculty of Dentistry, Istanbul, Turkey
| | - Merve Gokyar
- Department of Endodontics, Marmara University Faculty of Dentistry, Istanbul, Turkey
| | - Mustafa Enes Ozden
- Department of Public Health, Hacettepe University Faculty of Medicine, Ankara, Turkey
| | - Hesna Sazak Ovecoglu
- Department of Endodontics, Marmara University Faculty of Dentistry, Istanbul, Turkey
| |
Collapse
|
18
|
Suárez A, Jiménez J, Llorente de Pedro M, Andreu-Vázquez C, Díaz-Flores García V, Gómez Sánchez M, Freire Y. Beyond the Scalpel: Assessing ChatGPT's potential as an auxiliary intelligent virtual assistant in oral surgery. Comput Struct Biotechnol J 2024; 24:46-52. [PMID: 38162955 PMCID: PMC10755495 DOI: 10.1016/j.csbj.2023.11.058] [Citation(s) in RCA: 25] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 11/28/2023] [Accepted: 11/28/2023] [Indexed: 01/03/2024] Open
Abstract
AI has revolutionized the way we interact with technology. Noteworthy advances in AI algorithms and large language models (LLM) have led to the development of natural generative language (NGL) systems such as ChatGPT. Although these LLM can simulate human conversations and generate content in real time, they face challenges related to the topicality and accuracy of the information they generate. This study aimed to assess whether ChatGPT-4 could provide accurate and reliable answers to general dentists in the field of oral surgery, and thus explore its potential as an intelligent virtual assistant in clinical decision making in oral surgery. Thirty questions related to oral surgery were posed to ChatGPT4, each question repeated 30 times. Subsequently, a total of 900 responses were obtained. Two surgeons graded the answers according to the guidelines of the Spanish Society of Oral Surgery, using a three-point Likert scale (correct, partially correct/incomplete, and incorrect). Disagreements were arbitrated by an experienced oral surgeon, who provided the final grade Accuracy was found to be 71.7%, and consistency of the experts' grading across iterations, ranged from moderate to almost perfect. ChatGPT-4, with its potential capabilities, will inevitably be integrated into dental disciplines, including oral surgery. In the future, it could be considered as an auxiliary intelligent virtual assistant, though it would never replace oral surgery experts. Proper training and verified information by experts will remain vital to the implementation of the technology. More comprehensive research is needed to ensure the safe and successful application of AI in oral surgery.
Collapse
Affiliation(s)
- Ana Suárez
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Jaime Jiménez
- Department of Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - María Llorente de Pedro
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Cristina Andreu-Vázquez
- Department of Veterinary Medicine, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Víctor Díaz-Flores García
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Margarita Gómez Sánchez
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| | - Yolanda Freire
- Department of Pre-Clinic Dentistry, Faculty of Biomedical and Health Sciences, Universidad Europea de Madrid, Calle Tajo s/n, Villaviciosa de Odón, 28670 Madrid, Spain
| |
Collapse
|
19
|
Abhari S, Afshari Y, Fatehi F, Salmani H, Garavand A, Chumachenko D, Zakerabasali S, Morita PP. Exploring ChatGPT in clinical inquiry: a scoping review of characteristics, applications, challenges, and evaluation. Ann Med Surg (Lond) 2024; 86:7094-7104. [PMID: 39649918 PMCID: PMC11623824 DOI: 10.1097/ms9.0000000000002716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2024] [Accepted: 10/25/2024] [Indexed: 12/11/2024] Open
Abstract
Introduction Recent advancements in generative AI, exemplified by ChatGPT, hold promise for healthcare applications such as decision-making support, education, and patient engagement. However, rigorous evaluation is crucial to ensure reliability and safety in clinical contexts. This scoping review explores ChatGPT's role in clinical inquiry, focusing on its characteristics, applications, challenges, and evaluation. Methods This review, conducted in 2023, followed PRISMA-ScR guidelines (Supplemental Digital Content 1, http://links.lww.com/MS9/A636). Searches were performed across PubMed, Scopus, IEEE, Web of Science, Cochrane, and Google Scholar using relevant keywords. The review explored ChatGPT's effectiveness in various medical domains, evaluation methods, target users, and comparisons with other AI models. Data synthesis and analysis incorporated both quantitative and qualitative approaches. Results Analysis of 41 academic studies highlights ChatGPT's potential in medical education, patient care, and decision support, though performance varies by medical specialty and linguistic context. GPT-3.5, frequently referenced in 26 studies, demonstrated adaptability across diverse scenarios. Challenges include limited access to official answer keys and inconsistent performance, underscoring the need for ongoing refinement. Evaluation methods, including expert comparisons and statistical analyses, provided significant insights into ChatGPT's efficacy. The identification of target users, such as medical educators and nonexpert clinicians, illustrates its broad applicability. Conclusion ChatGPT shows significant potential in enhancing clinical practice and medical education. Nevertheless, continuous refinement is essential for its successful integration into healthcare, aiming to improve patient care outcomes, and address the evolving needs of the medical community.
Collapse
Affiliation(s)
- Shahabeddin Abhari
- School of Public Health Sciences, University of Waterloo, Waterloo, Ontario, Canada
| | - Yasna Afshari
- Department of Radiology and Nuclear Medicine, Erasmus MC University Medical Center Rotterdam, Rotterdam
- Department of Epidemiology, Erasmus MC University Medical Center Rotterdam, Rotterdam, The Netherlands
| | - Farhad Fatehi
- Business School, The University of Queensland, Brisbane, Australia
| | - Hosna Salmani
- Department of Health Information Management, School of Health Management and Information Sciences, Iran University of Medical Sciences, Tehran, Iran
| | - Ali Garavand
- Department of Health Information Technology, School of Allied Medical Sciences, Lorestan University of Medical Sciences, Khorramabad, Iran
| | - Dmytro Chumachenko
- Department of Mathematical Modeling and Artificial Intelligence, National Aerospace University ‘Kharkiv Aviation Institute’, Kharkiv, Ukraine
| | - Somayyeh Zakerabasali
- Department of Health Information Management, Clinical Education Research Center, Health Human Resources Research Center, School of Health Management and Information Sciences, Shiraz University of Medical Sciences, Shiraz, Iran
| | - Plinio P. Morita
- School of Public Health Sciences, University of Waterloo, Waterloo, Ontario, Canada
- Department of Systems Design Engineering, University of Waterloo
- Research Institute for Aging, University of Waterloo, Waterloo, Ontario, Canada
- Centre for Digital Therapeutics, Techna Institute, University Health Network, Toronto
- Dalla Lana School of Public Health, Institute of Health Policy, Management, and Evaluation, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
20
|
Pavone M, Palmieri L, Bizzarri N, Rosati A, Campolo F, Innocenzi C, Taliento C, Restaino S, Catena U, Vizzielli G, Akladios C, Ianieri MM, Marescaux J, Campo R, Fanfani F, Scambia G. Artificial Intelligence, the ChatGPT Large Language Model: Assessing the Accuracy of Responses to the Gynaecological Endoscopic Surgical Education and Assessment (GESEA) Level 1-2 knowledge tests. Facts Views Vis Obgyn 2024; 16:449-456. [PMID: 39718328 DOI: 10.52054/fvvo.16.4.052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2024] Open
Abstract
Background In 2022, OpenAI launched ChatGPT 3.5, which is now widely used in medical education, training, and research. Despite its valuable use for the generation of information, concerns persist about its authenticity and accuracy. Its undisclosed information source and outdated dataset pose risks of misinformation. Although it is widely used, AI-generated text inaccuracies raise doubts about its reliability. The ethical use of such technologies is crucial to uphold scientific accuracy in research. Objective This study aimed to assess the accuracy of ChatGPT in doing GESEA tests 1 and 2. Materials and Methods The 100 multiple-choice theoretical questions from GESEA certifications 1 and 2 were presented to ChatGPT, requesting the selection of the correct answer along with an explanation. Expert gynaecologists evaluated and graded the explanations for accuracy. Main outcome measures ChatGPT showed a 59% accuracy in responses, with 64% providing comprehensive explanations. It performed better in GESEA Level 1 (64% accuracy) than in GESEA Level 2 (54% accuracy) questions. Conclusions ChatGPT is a versatile tool in medicine and research, offering knowledge, information, and promoting evidence-based practice. Despite its widespread use, its accuracy has not been validated yet. This study found a 59% correct response rate, highlighting the need for accuracy validation and ethical use considerations. Future research should investigate ChatGPT's truthfulness in subspecialty fields such as gynaecologic oncology and compare different versions of chatbot for continuous improvement. What is new? Artificial intelligence (AI) has a great potential in scientific research. However, the validity of outputs remains unverified. This study aims to evaluate the accuracy of responses generated by ChatGPT to enhance the critical use of this tool.
Collapse
|
21
|
Lee J, Park S, Shin J, Cho B. Analyzing evaluation methods for large language models in the medical field: a scoping review. BMC Med Inform Decis Mak 2024; 24:366. [PMID: 39614219 PMCID: PMC11606129 DOI: 10.1186/s12911-024-02709-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Accepted: 10/03/2024] [Indexed: 12/01/2024] Open
Abstract
BACKGROUND Owing to the rapid growth in the popularity of Large Language Models (LLMs), various performance evaluation studies have been conducted to confirm their applicability in the medical field. However, there is still no clear framework for evaluating LLMs. OBJECTIVE This study reviews studies on LLM evaluations in the medical field and analyzes the research methods used in these studies. It aims to provide a reference for future researchers designing LLM studies. METHODS & MATERIALS We conducted a scoping review of three databases (PubMed, Embase, and MEDLINE) to identify LLM-related articles published between January 1, 2023, and September 30, 2023. We analyzed the types of methods, number of questions (queries), evaluators, repeat measurements, additional analysis methods, use of prompt engineering, and metrics other than accuracy. RESULTS A total of 142 articles met the inclusion criteria. LLM evaluation was primarily categorized as either providing test examinations (n = 53, 37.3%) or being evaluated by a medical professional (n = 80, 56.3%), with some hybrid cases (n = 5, 3.5%) or a combination of the two (n = 4, 2.8%). Most studies had 100 or fewer questions (n = 18, 29.0%), 15 (24.2%) performed repeated measurements, 18 (29.0%) performed additional analyses, and 8 (12.9%) used prompt engineering. For medical assessment, most studies used 50 or fewer queries (n = 54, 64.3%), had two evaluators (n = 43, 48.3%), and 14 (14.7%) used prompt engineering. CONCLUSIONS More research is required regarding the application of LLMs in healthcare. Although previous studies have evaluated performance, future studies will likely focus on improving performance. A well-structured methodology is required for these studies to be conducted systematically.
Collapse
Affiliation(s)
- Junbok Lee
- Institute for Innovation in Digital Healthcare, Yonsei University, Seoul, Republic of Korea
- Department of Human Systems Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Sungkyung Park
- Department of Bigdata AI Management Information, Seoul National University of Science and Technology, Seoul, Republic of Korea
| | - Jaeyong Shin
- Department of Preventive Medicine and Public Health, Yonsei University College of Medicine, 50-1, Yonsei-ro, Seodaemun-gu, Seoul, 03722, Republic of Korea.
- Institute of Health Services Research, Yonsei University College of Medicine, Seoul, Korea.
| | - Belong Cho
- Department of Human Systems Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea.
- Department of Family Medicine, Seoul National University Hospital, Seoul, Republic of Korea.
- Department of Family Medicine, Seoul National University College of Medicine, 101 Daehak-ro, Jongno-gu, Seoul, 03080, Republic of Korea.
| |
Collapse
|
22
|
Ho CN, Tian T, Ayers AT, Aaron RE, Phillips V, Wolf RM, Mathioudakis N, Dai T, Klonoff DC. Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: a narrative review. BMC Med Inform Decis Mak 2024; 24:357. [PMID: 39593074 PMCID: PMC11590327 DOI: 10.1186/s12911-024-02757-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2024] [Accepted: 11/08/2024] [Indexed: 11/28/2024] Open
Abstract
BACKGROUND The large language models (LLMs), most notably ChatGPT, released since November 30, 2022, have prompted shifting attention to their use in medicine, particularly for supporting clinical decision-making. However, there is little consensus in the medical community on how LLM performance in clinical contexts should be evaluated. METHODS We performed a literature review of PubMed to identify publications between December 1, 2022, and April 1, 2024, that discussed assessments of LLM-generated diagnoses or treatment plans. RESULTS We selected 108 relevant articles from PubMed for analysis. The most frequently used LLMs were GPT-3.5, GPT-4, Bard, LLaMa/Alpaca-based models, and Bing Chat. The five most frequently used criteria for scoring LLM outputs were "accuracy", "completeness", "appropriateness", "insight", and "consistency". CONCLUSIONS The most frequently used criteria for defining high-quality LLMs have been consistently selected by researchers over the past 1.5 years. We identified a high degree of variation in how studies reported their findings and assessed LLM performance. Standardized reporting of qualitative evaluation metrics that assess the quality of LLM outputs can be developed to facilitate research studies on LLMs in healthcare.
Collapse
Affiliation(s)
- Cindy N Ho
- Diabetes Technology Society, Burlingame, CA, USA
| | - Tiffany Tian
- Diabetes Technology Society, Burlingame, CA, USA
| | | | | | - Vidith Phillips
- School of Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Risa M Wolf
- Division of Pediatric Endocrinology, The Johns Hopkins Hospital, Baltimore, MD, USA
- Hopkins Business of Health Initiative, Johns Hopkins University, Washington, DC, USA
| | | | - Tinglong Dai
- Hopkins Business of Health Initiative, Johns Hopkins University, Washington, DC, USA
- Carey Business School, Johns Hopkins University, Baltimore, MD, USA
- School of Nursing, Johns Hopkins University, Baltimore, MD, USA
| | - David C Klonoff
- Diabetes Research Institute, Mills-Peninsula Medical Center, 100 South San Mateo Drive, Room 1165, San Mateo, CA, 94401, USA.
| |
Collapse
|
23
|
Ros-Arlanzón P, Perez-Sempere A. Evaluating AI Competence in Specialized Medicine: Comparative Analysis of ChatGPT and Neurologists in a Neurology Specialist Examination in Spain. JMIR MEDICAL EDUCATION 2024; 10:e56762. [PMID: 39622707 PMCID: PMC11611784 DOI: 10.2196/56762] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Revised: 07/29/2024] [Accepted: 10/07/2024] [Indexed: 12/06/2024]
Abstract
Background With the rapid advancement of artificial intelligence (AI) in various fields, evaluating its application in specialized medical contexts becomes crucial. ChatGPT, a large language model developed by OpenAI, has shown potential in diverse applications, including medicine. Objective This study aims to compare the performance of ChatGPT with that of attending neurologists in a real neurology specialist examination conducted in the Valencian Community, Spain, assessing the AI's capabilities and limitations in medical knowledge. Methods We conducted a comparative analysis using the 2022 neurology specialist examination results from 120 neurologists and responses generated by ChatGPT versions 3.5 and 4. The examination consisted of 80 multiple-choice questions, with a focus on clinical neurology and health legislation. Questions were classified according to Bloom's Taxonomy. Statistical analysis of performance, including the κ coefficient for response consistency, was performed. Results Human participants exhibited a median score of 5.91 (IQR: 4.93-6.76), with 32 neurologists failing to pass. ChatGPT-3.5 ranked 116th out of 122, answering 54.5% of questions correctly (score 3.94). ChatGPT-4 showed marked improvement, ranking 17th with 81.8% of correct answers (score 7.57), surpassing several human specialists. No significant variations were observed in the performance on lower-order questions versus higher-order questions. Additionally, ChatGPT-4 demonstrated increased interrater reliability, as reflected by a higher κ coefficient of 0.73, compared to ChatGPT-3.5's coefficient of 0.69. Conclusions This study underscores the evolving capabilities of AI in medical knowledge assessment, particularly in specialized fields. ChatGPT-4's performance, outperforming the median score of human participants in a rigorous neurology examination, represents a significant milestone in AI development, suggesting its potential as an effective tool in specialized medical education and assessment.
Collapse
Affiliation(s)
- Pablo Ros-Arlanzón
- Department of Neurology, Dr. Balmis General University Hospital, C/ Pintor Baeza, Nº 11, Alicante, 03010, Spain, 34 965933000
- Department of Neuroscience, Instituto de Investigación Sanitaria y Biomédica de Alicante, Alicante, Spain
| | - Angel Perez-Sempere
- Department of Neurology, Dr. Balmis General University Hospital, C/ Pintor Baeza, Nº 11, Alicante, 03010, Spain, 34 965933000
- Department of Neuroscience, Instituto de Investigación Sanitaria y Biomédica de Alicante, Alicante, Spain
- Department of Clinical Medicine, Miguel Hernández University, Alicante, Spain
| |
Collapse
|
24
|
Wang L, Wan Z, Ni C, Song Q, Li Y, Clayton E, Malin B, Yin Z. Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review. J Med Internet Res 2024; 26:e22769. [PMID: 39509695 PMCID: PMC11582494 DOI: 10.2196/22769] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Revised: 09/19/2024] [Accepted: 10/03/2024] [Indexed: 11/15/2024] Open
Abstract
BACKGROUND The launch of ChatGPT (OpenAI) in November 2022 attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including health care. Numerous studies have since been conducted regarding how to use state-of-the-art LLMs in health-related scenarios. OBJECTIVE This review aims to summarize applications of and concerns regarding conversational LLMs in health care and provide an agenda for future research in this field. METHODS We used PubMed, ACM, and the IEEE digital libraries as primary sources for this review. We followed the guidance of PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) to screen and select peer-reviewed research articles that (1) were related to health care applications and conversational LLMs and (2) were published before September 1, 2023, the date when we started paper collection. We investigated these papers and classified them according to their applications and concerns. RESULTS Our search initially identified 820 papers according to targeted keywords, out of which 65 (7.9%) papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT (60/65, 92% of papers), followed by Bard (Google LLC; 1/65, 2% of papers), LLaMA (Meta; 1/65, 2% of papers), and other LLMs (6/65, 9% papers). These papers were classified into four categories of applications: (1) summarization, (2) medical knowledge inquiry, (3) prediction (eg, diagnosis, treatment recommendation, and drug synergy), and (4) administration (eg, documentation and information collection), and four categories of concerns: (1) reliability (eg, training data quality, accuracy, interpretability, and consistency in responses), (2) bias, (3) privacy, and (4) public acceptability. There were 49 (75%) papers using LLMs for either summarization or medical knowledge inquiry, or both, and there are 58 (89%) papers expressing concerns about either reliability or bias, or both. We found that conversational LLMs exhibited promising results in summarization and providing general medical knowledge to patients with a relatively high accuracy. However, conversational LLMs such as ChatGPT are not always able to provide reliable answers to complex health-related tasks (eg, diagnosis) that require specialized domain expertise. While bias or privacy issues are often noted as concerns, no experiments in our reviewed papers thoughtfully examined how conversational LLMs lead to these issues in health care research. CONCLUSIONS Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications bring bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in health care.
Collapse
Affiliation(s)
- Leyao Wang
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Zhiyu Wan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
- School of Biomedical Engineering, ShanghaiTech University, Shanghai, China
| | - Congning Ni
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Qingyuan Song
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Yang Li
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
| | - Ellen Clayton
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, TN, United States
- School of Law, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Bradley Malin
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Zhijun Yin
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
| |
Collapse
|
25
|
Liu F, Chang X, Zhu Q, Huang Y, Li Y, Wang H. Assessing clinical medicine students' acceptance of large language model: based on technology acceptance model. BMC MEDICAL EDUCATION 2024; 24:1251. [PMID: 39490999 PMCID: PMC11533422 DOI: 10.1186/s12909-024-06232-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/02/2024] [Accepted: 10/21/2024] [Indexed: 11/05/2024]
Abstract
While large language models (LLMs) have demonstrated significant potential in medical education, there is limited understanding of medical students' acceptance of LLMs and the factors influencing their use. This study explores medical students' acceptance of LLMs in learning and examines the factors influencing this acceptance through the lens of the Technology Acceptance Model (TAM). A questionnaire survey conducted among Chinese medical students revealed a high willingness to use LLMs in their studies. The findings suggest that attitudes play a crucial role in predicting medical students' behavioral intentions to use LLMs, mediating the effects of perceived usefulness, perceived ease of use, and perceived risk. Additionally, perceived risk and social influence directly impact behavioral intentions. This study provides compelling evidence supporting the applicability of the TAM to the acceptance of LLMs in medical education, highlighting the necessity for medical students to utilize LLMs as an auxiliary tool in their learning process.
Collapse
Affiliation(s)
- Fuze Liu
- Department of Orthopaedic Surgery, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, No. 1 Shuaifuyuan, Beijing, 100730, People's Republic of China
| | - Xiao Chang
- Department of Orthopaedic Surgery, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, No. 1 Shuaifuyuan, Beijing, 100730, People's Republic of China
| | - Qi Zhu
- Department of Orthopaedic Surgery, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, No. 1 Shuaifuyuan, Beijing, 100730, People's Republic of China
| | - Yue Huang
- Department of Orthopaedic Surgery, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, No. 1 Shuaifuyuan, Beijing, 100730, People's Republic of China
| | - Yifei Li
- Department of Orthopaedic Surgery, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, No. 1 Shuaifuyuan, Beijing, 100730, People's Republic of China
| | - Hai Wang
- Department of Orthopaedic Surgery, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, No. 1 Shuaifuyuan, Beijing, 100730, People's Republic of China.
| |
Collapse
|
26
|
Pereyra L, Schlottmann F, Steinberg L, Lasa J. Colorectal Cancer Prevention: Is Chat Generative Pretrained Transformer (Chat GPT) ready to Assist Physicians in Determining Appropriate Screening and Surveillance Recommendations? J Clin Gastroenterol 2024; 58:1022-1027. [PMID: 38319619 DOI: 10.1097/mcg.0000000000001979] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/11/2023] [Accepted: 01/12/2024] [Indexed: 02/07/2024]
Abstract
OBJECTIVE To determine whether a publicly available advanced language model could help determine appropriate colorectal cancer (CRC) screening and surveillance recommendations. BACKGROUND Poor physician knowledge or inability to accurately recall recommendations might affect adherence to CRC screening guidelines. Adoption of newer technologies can help improve the delivery of such preventive care services. METHODS An assessment with 10 multiple choice questions, including 5 CRC screening and 5 CRC surveillance clinical vignettes, was inputted into chat generative pretrained transformer (ChatGPT) 3.5 in 4 separate sessions. Responses were recorded and screened for accuracy to determine the reliability of this tool. The mean number of correct answers was then compared against a control group of gastroenterologists and colorectal surgeons answering the same questions with and without the help of a previously validated CRC screening mobile app. RESULTS The average overall performance of ChatGPT was 45%. The mean number of correct answers was 2.75 (95% CI: 2.26-3.24), 1.75 (95% CI: 1.26-2.24), and 4.5 (95% CI: 3.93-5.07) for screening, surveillance, and total questions, respectively. ChatGPT showed inconsistency and gave a different answer in 4 questions among the different sessions. A total of 238 physicians also responded to the assessment; 123 (51.7%) without and 115 (48.3%) with the mobile app. The mean number of total correct answers of ChatGPT was significantly lower than those of physicians without [5.62 (95% CI: 5.32-5.92)] and with the mobile app [7.71 (95% CI: 7.39-8.03); P < 0.001]. CONCLUSIONS Large language models developed with artificial intelligence require further refinements to serve as reliable assistants in clinical practice.
Collapse
Affiliation(s)
- Lisandro Pereyra
- Department of Gastroenterology
- Endoscopy Unit, Department of Surgery
| | - Francisco Schlottmann
- Endoscopy Unit, Department of Surgery
- Department of Surgery, Hospital Alemán of Buenos Aires
| | - Leandro Steinberg
- Department of Gastroenterology, Fundacion Favaloro, Buenos Aires, Argentina
| | - Juan Lasa
- Department of Gastroenterology, CEMIC, Buenos Aires, Argentina
| |
Collapse
|
27
|
Attanasio M, Mazza M, Le Donne I, Masedu F, Greco MP, Valenti M. Does ChatGPT have a typical or atypical theory of mind? Front Psychol 2024; 15:1488172. [PMID: 39534470 PMCID: PMC11554496 DOI: 10.3389/fpsyg.2024.1488172] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2024] [Accepted: 10/18/2024] [Indexed: 11/16/2024] Open
Abstract
In recent years, the capabilities of Large Language Models (LLMs), such as ChatGPT, to imitate human behavioral patterns have been attracting growing interest from experimental psychology. Although ChatGPT can successfully generate accurate theoretical and inferential information in several fields, its ability to exhibit a Theory of Mind (ToM) is a topic of debate and interest in literature. Impairments in ToM are considered responsible for social difficulties in many clinical conditions, such as Autism Spectrum Disorder (ASD). Some studies showed that ChatGPT can successfully pass classical ToM tasks, however, the response style used by LLMs to solve advanced ToM tasks, comparing their abilities with those of typical development (TD) individuals and clinical populations, has not been explored. In this preliminary study, we administered the Advanced ToM Test and the Emotion Attribution Task to ChatGPT 3.5 and ChatGPT-4 and compared their responses with those of an ASD and TD group. Our results showed that the two LLMs had higher accuracy in understanding mental states, although ChatGPT-3.5 failed with more complex mental states. In understanding emotional states, ChatGPT-3.5 performed significantly worse than TDs but did not differ from ASDs, showing difficulty with negative emotions. ChatGPT-4 achieved higher accuracy, but difficulties with recognizing sadness and anger persisted. The style adopted by both LLMs appeared verbose, and repetitive, tending to violate Grice's maxims. This conversational style seems similar to that adopted by high-functioning ASDs. Clinical implications and potential applications are discussed.
Collapse
Affiliation(s)
- Margherita Attanasio
- Department of Biotechnological and Applied Clinical Sciences, University of L’Aquila, L’Aquila, Italy
| | - Monica Mazza
- Department of Biotechnological and Applied Clinical Sciences, University of L’Aquila, L’Aquila, Italy
- Reference Regional Centre for Autism, Abruzzo Region, Local Health Unit, L’Aquila, Italy
| | - Ilenia Le Donne
- Department of Biotechnological and Applied Clinical Sciences, University of L’Aquila, L’Aquila, Italy
| | - Francesco Masedu
- Department of Biotechnological and Applied Clinical Sciences, University of L’Aquila, L’Aquila, Italy
| | - Maria Paola Greco
- Department of Biotechnological and Applied Clinical Sciences, University of L’Aquila, L’Aquila, Italy
| | - Marco Valenti
- Department of Biotechnological and Applied Clinical Sciences, University of L’Aquila, L’Aquila, Italy
- Reference Regional Centre for Autism, Abruzzo Region, Local Health Unit, L’Aquila, Italy
| |
Collapse
|
28
|
Goodings AJ, Kajitani S, Chhor A, Albakri A, Pastrak M, Kodancha M, Ives R, Lee YB, Kajitani K. Assessment of ChatGPT-4 in Family Medicine Board Examinations Using Advanced AI Learning and Analytical Methods: Observational Study. JMIR MEDICAL EDUCATION 2024; 10:e56128. [PMID: 39378442 PMCID: PMC11479358 DOI: 10.2196/56128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/11/2024] [Revised: 05/12/2024] [Accepted: 08/15/2024] [Indexed: 10/10/2024]
Abstract
Background This research explores the capabilities of ChatGPT-4 in passing the American Board of Family Medicine (ABFM) Certification Examination. Addressing a gap in existing literature, where earlier artificial intelligence (AI) models showed limitations in medical board examinations, this study evaluates the enhanced features and potential of ChatGPT-4, especially in document analysis and information synthesis. Objective The primary goal is to assess whether ChatGPT-4, when provided with extensive preparation resources and when using sophisticated data analysis, can achieve a score equal to or above the passing threshold for the Family Medicine Board Examinations. Methods In this study, ChatGPT-4 was embedded in a specialized subenvironment, "AI Family Medicine Board Exam Taker," designed to closely mimic the conditions of the ABFM Certification Examination. This subenvironment enabled the AI to access and analyze a range of relevant study materials, including a primary medical textbook and supplementary web-based resources. The AI was presented with a series of ABFM-type examination questions, reflecting the breadth and complexity typical of the examination. Emphasis was placed on assessing the AI's ability to interpret and respond to these questions accurately, leveraging its advanced data processing and analysis capabilities within this controlled subenvironment. Results In our study, ChatGPT-4's performance was quantitatively assessed on 300 practice ABFM examination questions. The AI achieved a correct response rate of 88.67% (95% CI 85.08%-92.25%) for the Custom Robot version and 87.33% (95% CI 83.57%-91.10%) for the Regular version. Statistical analysis, including the McNemar test (P=.45), indicated no significant difference in accuracy between the 2 versions. In addition, the chi-square test for error-type distribution (P=.32) revealed no significant variation in the pattern of errors across versions. These results highlight ChatGPT-4's capacity for high-level performance and consistency in responding to complex medical examination questions under controlled conditions. Conclusions The study demonstrates that ChatGPT-4, particularly when equipped with specialized preparation and when operating in a tailored subenvironment, shows promising potential in handling the intricacies of medical board examinations. While its performance is comparable with the expected standards for passing the ABFM Certification Examination, further enhancements in AI technology and tailored training methods could push these capabilities to new heights. This exploration opens avenues for integrating AI tools such as ChatGPT-4 in medical education and assessment, emphasizing the importance of continuous advancement and specialized training in medical applications of AI.
Collapse
Affiliation(s)
| | - Sten Kajitani
- School of Medicine, University College Cork, Cork, Ireland
| | - Allison Chhor
- Faculty of Medicine, University of Ottawa, Ottawa, ON, Canada
| | - Ahmad Albakri
- School of Medicine, University College Cork, Cork, Ireland
| | - Mila Pastrak
- School of Medicine, University College Cork, Cork, Ireland
| | - Megha Kodancha
- School of Medicine, University College Cork, Cork, Ireland
| | - Rowan Ives
- Saint John's College, University of Oxford, St Giles', Oxford, OX1 3JP, United Kingdom, 44 0 75 4383 4
| | - Yoo Bin Lee
- Faculty of Medicine, University of Ottawa, Ottawa, ON, Canada
| | - Kari Kajitani
- Department of Emergency Medicine, University of California, San Diego, San Diego, CA, United States
| |
Collapse
|
29
|
Casey JC, Dworkin M, Winschel J, Molino J, Daher M, Katarincic JA, Gil JA, Akelman E. ChatGPT: A concise Google alternative for people seeking accurate and comprehensive carpal tunnel syndrome information. HAND SURGERY & REHABILITATION 2024; 43:101757. [PMID: 39103051 DOI: 10.1016/j.hansur.2024.101757] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Revised: 07/29/2024] [Accepted: 07/30/2024] [Indexed: 08/07/2024]
Abstract
Popular artificial intelligence systems, like ChatGPT, may be used by anyone to generate humanlike answers to questions. This study assessed whether ChatGPT version 3.5 (ChatGPTv3.5) or the first five results from a Google search provide more accurate, complete, and concise answers to the most common questions patients have about carpal tunnel syndrome. Three orthopedic hand surgeons blindly graded the answers using Likert scales to assess accuracy, completeness, and conciseness. ChatGPTv3.5 and the first five Google results provide answers to carpal tunnel syndrome questions that are similar in accuracy and completeness, but ChatGPTv3.5 answers are more concise. ChatGPTv3.5, being freely accessible to the public, is therefore a good resource for patients seeking concise, Google-equivalent answers to specific medical questions regarding carpal tunnel syndrome. ChatGPTv3.5, given its lack of updated sourcing and risk of presenting false information, should not replace frequently updated academic websites as the primary online medical resource for patients.
Collapse
Affiliation(s)
- Jack C Casey
- Department of Orthopaedic Surgery, Warren Alpert Medical School of Brown University, Providence, RI, USA.
| | - Myles Dworkin
- Department of Orthopaedic Surgery, Warren Alpert Medical School of Brown University, Providence, RI, USA
| | - Julia Winschel
- Department of Orthopaedic Surgery, Warren Alpert Medical School of Brown University, Providence, RI, USA
| | - Janine Molino
- Department of Orthopaedic Surgery, Warren Alpert Medical School of Brown University, Providence, RI, USA
| | - Mohammad Daher
- Department of Orthopaedic Surgery, Warren Alpert Medical School of Brown University, Providence, RI, USA
| | - Julia A Katarincic
- Department of Orthopaedic Surgery, Warren Alpert Medical School of Brown University, Providence, RI, USA
| | - Joseph A Gil
- Department of Orthopaedic Surgery, Warren Alpert Medical School of Brown University, Providence, RI, USA
| | - Edward Akelman
- Department of Orthopaedic Surgery, Warren Alpert Medical School of Brown University, Providence, RI, USA
| |
Collapse
|
30
|
Hager P, Jungmann F, Holland R, Bhagat K, Hubrecht I, Knauer M, Vielhauer J, Makowski M, Braren R, Kaissis G, Rueckert D. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med 2024; 30:2613-2622. [PMID: 38965432 PMCID: PMC11405275 DOI: 10.1038/s41591-024-03097-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Accepted: 05/29/2024] [Indexed: 07/06/2024]
Abstract
Clinical decision-making is one of the most impactful parts of a physician's responsibilities and stands to benefit greatly from artificial intelligence solutions and large language models (LLMs) in particular. However, while LLMs have achieved excellent performance on medical licensing exams, these tests fail to assess many skills necessary for deployment in a realistic clinical decision-making environment, including gathering information, adhering to guidelines, and integrating into clinical workflows. Here we have created a curated dataset based on the Medical Information Mart for Intensive Care database spanning 2,400 real patient cases and four common abdominal pathologies as well as a framework to simulate a realistic clinical setting. We show that current state-of-the-art LLMs do not accurately diagnose patients across all pathologies (performing significantly worse than physicians), follow neither diagnostic nor treatment guidelines, and cannot interpret laboratory results, thus posing a serious risk to the health of patients. Furthermore, we move beyond diagnostic accuracy and demonstrate that they cannot be easily integrated into existing workflows because they often fail to follow instructions and are sensitive to both the quantity and order of information. Overall, our analysis reveals that LLMs are currently not ready for autonomous clinical decision-making while providing a dataset and framework to guide future studies.
Collapse
Affiliation(s)
- Paul Hager
- Institute for AI and Informatics, Klinikum rechts der Isar, Technical University of Munich, Munich, Germany.
- Institute for Diagnostic and Interventional Radiology, Klinikum rechts der Isar, Technical University of Munich, Munich, Germany.
| | - Friederike Jungmann
- Institute for AI and Informatics, Klinikum rechts der Isar, Technical University of Munich, Munich, Germany
- Institute for Diagnostic and Interventional Radiology, Klinikum rechts der Isar, Technical University of Munich, Munich, Germany
| | | | - Kunal Bhagat
- Department of Medicine, ChristianaCare Health System, Wilmington, DE, USA
| | - Inga Hubrecht
- Department of Medicine III, Klinikum rechts der Isar, Technical University of Munich, Munich, Germany
| | - Manuel Knauer
- Department of Medicine III, Klinikum rechts der Isar, Technical University of Munich, Munich, Germany
| | - Jakob Vielhauer
- Department of Medicine II, University Hospital of the Ludwig Maximilian University of Munich, Munich, Germany
| | - Marcus Makowski
- Institute for Diagnostic and Interventional Radiology, Klinikum rechts der Isar, Technical University of Munich, Munich, Germany
| | - Rickmer Braren
- Institute for Diagnostic and Interventional Radiology, Klinikum rechts der Isar, Technical University of Munich, Munich, Germany
| | - Georgios Kaissis
- Institute for AI and Informatics, Klinikum rechts der Isar, Technical University of Munich, Munich, Germany
- Institute for Diagnostic and Interventional Radiology, Klinikum rechts der Isar, Technical University of Munich, Munich, Germany
- Department of Computing, Imperial College, London, UK
- Reliable AI Group, Institute for Machine Learning in Biomedical Imaging, Helmholtz Munich, Munich, Germany
| | - Daniel Rueckert
- Institute for AI and Informatics, Klinikum rechts der Isar, Technical University of Munich, Munich, Germany
- Department of Computing, Imperial College, London, UK
| |
Collapse
|
31
|
Lucas MM, Yang J, Pomeroy JK, Yang CC. Reasoning with large language models for medical question answering. J Am Med Inform Assoc 2024; 31:1964-1975. [PMID: 38960731 PMCID: PMC11339506 DOI: 10.1093/jamia/ocae131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Revised: 05/13/2024] [Accepted: 05/20/2024] [Indexed: 07/05/2024] Open
Abstract
OBJECTIVES To investigate approaches of reasoning with large language models (LLMs) and to propose a new prompting approach, ensemble reasoning, to improve medical question answering performance with refined reasoning and reduced inconsistency. MATERIALS AND METHODS We used multiple choice questions from the USMLE Sample Exam question files on 2 closed-source commercial and 1 open-source clinical LLM to evaluate our proposed approach ensemble reasoning. RESULTS On GPT-3.5 turbo and Med42-70B, our proposed ensemble reasoning approach outperformed zero-shot chain-of-thought with self-consistency on Steps 1, 2, and 3 questions (+3.44%, +4.00%, and +2.54%) and (2.3%, 5.00%, and 4.15%), respectively. With GPT-4 turbo, there were mixed results with ensemble reasoning again outperforming zero-shot chain-of-thought with self-consistency on Step 1 questions (+1.15%). In all cases, the results demonstrated improved consistency of responses with our approach. A qualitative analysis of the reasoning from the model demonstrated that the ensemble reasoning approach produces correct and helpful reasoning. CONCLUSION The proposed iterative ensemble reasoning has the potential to improve the performance of LLMs in medical question answering tasks, particularly with the less powerful LLMs like GPT-3.5 turbo and Med42-70B, which may suggest that this is a promising approach for LLMs with lower capabilities. Additionally, the findings show that our approach helps to refine the reasoning generated by the LLM and thereby improve consistency even with the more powerful GPT-4 turbo. We also identify the potential and need for human-artificial intelligence teaming to improve the reasoning beyond the limits of the model.
Collapse
Affiliation(s)
- Mary M Lucas
- College of Computing and Informatics, Drexel University, Philadelphia, PA 19104, United States
| | - Justin Yang
- Department of Computer Science, University of Maryland, College Park, MD 20742, United States
| | - Jon K Pomeroy
- College of Computing and Informatics, Drexel University, Philadelphia, PA 19104, United States
- Penn Medicine, Philadelphia, PA 19104, United States
| | - Christopher C Yang
- College of Computing and Informatics, Drexel University, Philadelphia, PA 19104, United States
| |
Collapse
|
32
|
Peled T, Sela HY, Weiss A, Grisaru-Granovsky S, Agrawal S, Rottenstreich M. Evaluating the validity of ChatGPT responses on common obstetric issues: Potential clinical applications and implications. Int J Gynaecol Obstet 2024; 166:1127-1133. [PMID: 38523565 DOI: 10.1002/ijgo.15501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Revised: 02/29/2024] [Accepted: 03/10/2024] [Indexed: 03/26/2024]
Abstract
OBJECTIVE To evaluate the quality of ChatGPT responses to common issues in obstetrics and assess its ability to provide reliable responses to pregnant individuals. The study aimed to examine the responses based on expert opinions using predetermined criteria, including "accuracy," "completeness," and "safety." METHODS We curated 15 common and potentially clinically significant questions that pregnant women are asking. Two native English-speaking women were asked to reframe the questions in their own words, and we employed the ChatGPT language model to generate responses to the questions. To evaluate the accuracy, completeness, and safety of the ChatGPT's generated responses, we developed a questionnaire with a scale of 1 to 5 that obstetrics and gynecology experts from different countries were invited to rate accordingly. The ratings were analyzed to evaluate the average level of agreement and percentage of positive ratings (≥4) for each criterion. RESULTS Of the 42 experts invited, 20 responded to the questionnaire. The combined score for all responses yielded a mean rating of 4, with 75% of responses receiving a positive rating (≥4). While examining specific criteria, the ChatGPT responses were better for the accuracy criterion, with a mean rating of 4.2 and 80% of the questions received a positive rating. The responses scored less for the completeness criterion, with a mean rating of 3.8 and 46.7% of questions received a positive rating. For safety, the mean rating was 3.9 and 53.3% of questions received a positive rating. There was no response with an average negative rating below three. CONCLUSION This study demonstrates promising results regarding potential use of ChatGPT's in providing accurate responses to obstetric clinical questions posed by pregnant women. However, it is crucial to exercise caution when addressing inquiries concerning the safety of the fetus or the mother.
Collapse
Affiliation(s)
- Tzuria Peled
- Department of Obstetrics and Gynecology, Shaare Zedek Medical Center, Affiliated with the Hebrew University School of Medicine, Jerusalem, Israel
| | - Hen Y Sela
- Department of Obstetrics and Gynecology, Shaare Zedek Medical Center, Affiliated with the Hebrew University School of Medicine, Jerusalem, Israel
| | - Ari Weiss
- Department of Obstetrics and Gynecology, Shaare Zedek Medical Center, Affiliated with the Hebrew University School of Medicine, Jerusalem, Israel
| | - Sorina Grisaru-Granovsky
- Department of Obstetrics and Gynecology, Shaare Zedek Medical Center, Affiliated with the Hebrew University School of Medicine, Jerusalem, Israel
| | - Swati Agrawal
- Division of Maternal-Fetal Medicine, Department of Obstetrics and Gynecology, Hamilton Health Sciences, McMaster University, Hamilton, Ontario, Canada
| | - Misgav Rottenstreich
- Department of Obstetrics and Gynecology, Shaare Zedek Medical Center, Affiliated with the Hebrew University School of Medicine, Jerusalem, Israel
- Division of Maternal-Fetal Medicine, Department of Obstetrics and Gynecology, Hamilton Health Sciences, McMaster University, Hamilton, Ontario, Canada
- Department of Nursing, Jerusalem College of Technology, Jerusalem, Israel
| |
Collapse
|
33
|
Pan G, Ni J. A cross sectional investigation of ChatGPT-like large language models application among medical students in China. BMC MEDICAL EDUCATION 2024; 24:908. [PMID: 39180023 PMCID: PMC11342543 DOI: 10.1186/s12909-024-05871-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Accepted: 08/07/2024] [Indexed: 08/26/2024]
Abstract
OBJECTIVE To investigate the level of understanding and trust of medical students towards ChatGPT-like large language models, as well as their utilization and attitudes towards these models. METHODS Data collection was concentrated from December 2023 to mid-January 2024, utilizing a self-designed questionnaire to assess the use of large language models among undergraduate medical students at Anhui Medical University. The normality of the data was confirmed with Shapiro-Wilk tests. We used Chi-square tests for comparisons of categorical variables, Mann-Whitney U tests for comparisons of ordinal variables and non-normal continuous variables between two groups, Kruskall-Wallis H tests for comparisons of ordinal variables between multiple groups, and Bonferroni tests for post hoc comparisons. RESULTS A total of 1774 questionnaires were distributed and 1718 valid questionnaires were collected, with an effective rate of 96.84%. Among these students, 34.5% had heard and used large language models. There were statistically significant differences in the understanding of large language models between genders (p < 0.001), grade levels (junior-level students and senior-level students) (p = 0.03), and major (p < 0.001). Male, junior-level students, and public health management had a higher level of understanding of these models. Genders and majors had statistically significant effects on the degree of trust in large language models (p = 0.004; p = 0.02). Male and nursing students exhibited a higher degree of trust in large language models. As for usage, Male and junior-level students showed a significantly higher proportion of using these models for assisted learning (p < 0.001). Neutral sentiments were held by over two-thirds of the students (66.7%) regarding large language models, with only 51(3.0%) expressing pessimism. There were significant gender-based disparities in attitudes towards large language models, and male exhibited a more optimistic attitude towards these models (p < 0.001). Notably, among students with different levels of knowledge and trust in large language models, statistically significant differences were observed in their perceptions of the shortcomings and benefits of these models. CONCLUSION Our study identified gender, grade levels, and major as influential factors in students' understanding and utilization of large language models. This also suggested the feasibility of integrating large language models with traditional medical education to further enhance teaching effectiveness in the future.
Collapse
Affiliation(s)
- Guixia Pan
- Department of Epidemiology and Biostatistics, School of Public Health, Anhui Medical University, Meishan Road 81, Hefei, 230032, Anhui, China.
| | - Jing Ni
- Department of Epidemiology and Biostatistics, School of Public Health, Anhui Medical University, Meishan Road 81, Hefei, 230032, Anhui, China
| |
Collapse
|
34
|
Cherrez-Ojeda I, Gallardo-Bastidas JC, Robles-Velasco K, Osorio MF, Velez Leon EM, Leon Velastegui M, Pauletto P, Aguilar-Díaz FC, Squassi A, González Eras SP, Cordero Carrasco E, Chavez Gonzalez KL, Calderon JC, Bousquet J, Bedbrook A, Faytong-Haro M. Understanding Health Care Students' Perceptions, Beliefs, and Attitudes Toward AI-Powered Language Models: Cross-Sectional Study. JMIR MEDICAL EDUCATION 2024; 10:e51757. [PMID: 39137029 DOI: 10.2196/51757] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/10/2023] [Revised: 09/26/2023] [Accepted: 04/30/2024] [Indexed: 08/15/2024]
Abstract
BACKGROUND ChatGPT was not intended for use in health care, but it has potential benefits that depend on end-user understanding and acceptability, which is where health care students become crucial. There is still a limited amount of research in this area. OBJECTIVE The primary aim of our study was to assess the frequency of ChatGPT use, the perceived level of knowledge, the perceived risks associated with its use, and the ethical issues, as well as attitudes toward the use of ChatGPT in the context of education in the field of health. In addition, we aimed to examine whether there were differences across groups based on demographic variables. The second part of the study aimed to assess the association between the frequency of use, the level of perceived knowledge, the level of risk perception, and the level of perception of ethics as predictive factors for participants' attitudes toward the use of ChatGPT. METHODS A cross-sectional survey was conducted from May to June 2023 encompassing students of medicine, nursing, dentistry, nutrition, and laboratory science across the Americas. The study used descriptive analysis, chi-square tests, and ANOVA to assess statistical significance across different categories. The study used several ordinal logistic regression models to analyze the impact of predictive factors (frequency of use, perception of knowledge, perception of risk, and ethics perception scores) on attitude as the dependent variable. The models were adjusted for gender, institution type, major, and country. Stata was used to conduct all the analyses. RESULTS Of 2661 health care students, 42.99% (n=1144) were unaware of ChatGPT. The median score of knowledge was "minimal" (median 2.00, IQR 1.00-3.00). Most respondents (median 2.61, IQR 2.11-3.11) regarded ChatGPT as neither ethical nor unethical. Most participants (median 3.89, IQR 3.44-4.34) "somewhat agreed" that ChatGPT (1) benefits health care settings, (2) provides trustworthy data, (3) is a helpful tool for clinical and educational medical information access, and (4) makes the work easier. In total, 70% (7/10) of people used it for homework. As the perceived knowledge of ChatGPT increased, there was a stronger tendency with regard to having a favorable attitude toward ChatGPT. Higher ethical consideration perception ratings increased the likelihood of considering ChatGPT as a source of trustworthy health care information (odds ratio [OR] 1.620, 95% CI 1.498-1.752), beneficial in medical issues (OR 1.495, 95% CI 1.452-1.539), and useful for medical literature (OR 1.494, 95% CI 1.426-1.564; P<.001 for all results). CONCLUSIONS Over 40% of American health care students (1144/2661, 42.99%) were unaware of ChatGPT despite its extensive use in the health field. Our data revealed the positive attitudes toward ChatGPT and the desire to learn more about it. Medical educators must explore how chatbots may be included in undergraduate health care education programs.
Collapse
Affiliation(s)
- Ivan Cherrez-Ojeda
- Universidad Espiritu Santo, Samborondon, Ecuador
- Respiralab Research Group, Guayaquil, Ecuador
| | | | - Karla Robles-Velasco
- Universidad Espiritu Santo, Samborondon, Ecuador
- Respiralab Research Group, Guayaquil, Ecuador
| | - María F Osorio
- Universidad Espiritu Santo, Samborondon, Ecuador
- Respiralab Research Group, Guayaquil, Ecuador
| | | | | | | | - F C Aguilar-Díaz
- Departamento Salud Pública, Escuela Nacional de Estudios Superiores, Universidad Nacional Autónoma de México, Guanajuato, Mexico
| | - Aldo Squassi
- Universidad de Buenos Aires, Facultad de Odontologìa, Cátedra de Odontología Preventiva y Comunitaria, Buenos Aires, Argentina
| | | | - Erita Cordero Carrasco
- Departamento de cirugía y traumatología bucal y maxilofacial, Universidad de Chile, Santiago, Chile
| | | | - Juan C Calderon
- Universidad Espiritu Santo, Samborondon, Ecuador
- Respiralab Research Group, Guayaquil, Ecuador
| | - Jean Bousquet
- Institute of Allergology, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
- Fraunhofer Institute for Translational Medicine and Pharmacology ITMP, Allergology and Immunology, Berlin, Germany
- MASK-air, Montpellier, France
| | | | - Marco Faytong-Haro
- Respiralab Research Group, Guayaquil, Ecuador
- Universidad Estatal de Milagro, Cdla Universitaria "Dr. Rómulo Minchala Murillo", Milagro, Ecuador
- Ecuadorian Development Research Lab, Daule, Ecuador
| |
Collapse
|
35
|
Fatima A, Shafique MA, Alam K, Fadlalla Ahmed TK, Mustafa MS. ChatGPT in medicine: A cross-disciplinary systematic review of ChatGPT's (artificial intelligence) role in research, clinical practice, education, and patient interaction. Medicine (Baltimore) 2024; 103:e39250. [PMID: 39121303 PMCID: PMC11315549 DOI: 10.1097/md.0000000000039250] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 07/19/2024] [Indexed: 08/11/2024] Open
Abstract
BACKGROUND ChatGPT, a powerful AI language model, has gained increasing prominence in medicine, offering potential applications in healthcare, clinical decision support, patient communication, and medical research. This systematic review aims to comprehensively assess the applications of ChatGPT in healthcare education, research, writing, patient communication, and practice while also delineating potential limitations and areas for improvement. METHOD Our comprehensive database search retrieved relevant papers from PubMed, Medline and Scopus. After the screening process, 83 studies met the inclusion criteria. This review includes original studies comprising case reports, analytical studies, and editorials with original findings. RESULT ChatGPT is useful for scientific research and academic writing, and assists with grammar, clarity, and coherence. This helps non-English speakers and improves accessibility by breaking down linguistic barriers. However, its limitations include probable inaccuracy and ethical issues, such as bias and plagiarism. ChatGPT streamlines workflows and offers diagnostic and educational potential in healthcare but exhibits biases and lacks emotional sensitivity. It is useful in inpatient communication, but requires up-to-date data and faces concerns about the accuracy of information and hallucinatory responses. CONCLUSION Given the potential for ChatGPT to transform healthcare education, research, and practice, it is essential to approach its adoption in these areas with caution due to its inherent limitations.
Collapse
Affiliation(s)
- Afia Fatima
- Department of Medicine, Jinnah Sindh Medical University, Karachi, Pakistan
| | | | - Khadija Alam
- Department of Medicine, Liaquat National Medical College, Karachi, Pakistan
| | | | | |
Collapse
|
36
|
Chow JC, Cheng TY, Chien TW, Chou W. Assessing ChatGPT's Capability for Multiple Choice Questions Using RaschOnline: Observational Study. JMIR Form Res 2024; 8:e46800. [PMID: 39115919 PMCID: PMC11346125 DOI: 10.2196/46800] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2023] [Revised: 07/03/2023] [Accepted: 07/31/2023] [Indexed: 08/10/2024] Open
Abstract
BACKGROUND ChatGPT (OpenAI), a state-of-the-art large language model, has exhibited remarkable performance in various specialized applications. Despite the growing popularity and efficacy of artificial intelligence, there is a scarcity of studies that assess ChatGPT's competence in addressing multiple-choice questions (MCQs) using KIDMAP of Rasch analysis-a website tool used to evaluate ChatGPT's performance in MCQ answering. OBJECTIVE This study aims to (1) showcase the utility of the website (Rasch analysis, specifically RaschOnline), and (2) determine the grade achieved by ChatGPT when compared to a normal sample. METHODS The capability of ChatGPT was evaluated using 10 items from the English tests conducted for Taiwan college entrance examinations in 2023. Under a Rasch model, 300 simulated students with normal distributions were simulated to compete with ChatGPT's responses. RaschOnline was used to generate 5 visual presentations, including item difficulties, differential item functioning, item characteristic curve, Wright map, and KIDMAP, to address the research objectives. RESULTS The findings revealed the following: (1) the difficulty of the 10 items increased in a monotonous pattern from easier to harder, represented by logits (-2.43, -1.78, -1.48, -0.64, -0.1, 0.33, 0.59, 1.34, 1.7, and 2.47); (2) evidence of differential item functioning was observed between gender groups for item 5 (P=.04); (3) item 5 displayed a good fit to the Rasch model (P=.61); (4) all items demonstrated a satisfactory fit to the Rasch model, indicated by Infit mean square errors below the threshold of 1.5; (5) no significant difference was found in the measures obtained between gender groups (P=.83); (6) a significant difference was observed among ability grades (P<.001); and (7) ChatGPT's capability was graded as A, surpassing grades B to E. CONCLUSIONS By using RaschOnline, this study provides evidence that ChatGPT possesses the ability to achieve a grade A when compared to a normal sample. It exhibits excellent proficiency in answering MCQs from the English tests conducted in 2023 for the Taiwan college entrance examinations.
Collapse
Affiliation(s)
- Julie Chi Chow
- Department of Pediatrics, Chi Mei Medical Center, Tainan, Taiwan
- Department of Pediatrics, School of Medicine, College of Medicine, Chung Shan Medical University, Taichung, Taiwan
| | - Teng Yun Cheng
- Department of Emergency Medicine, Chi Mei Medical Center, Tainan, Taiwan
| | - Tsair-Wei Chien
- Department of Statistics, Coding Data Analytics, Tainan, Taiwan
| | - Willy Chou
- Department of Physical Medicine and Rehabilitation, Chi Mei Medical Center, Tainan, Taiwan
- Department of Leisure and Sports Management, Far East University, Tainan, Taiwan
| |
Collapse
|
37
|
Laymouna M, Ma Y, Lessard D, Schuster T, Engler K, Lebouché B. Roles, Users, Benefits, and Limitations of Chatbots in Health Care: Rapid Review. J Med Internet Res 2024; 26:e56930. [PMID: 39042446 PMCID: PMC11303905 DOI: 10.2196/56930] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2024] [Revised: 04/07/2024] [Accepted: 04/12/2024] [Indexed: 07/24/2024] Open
Abstract
BACKGROUND Chatbots, or conversational agents, have emerged as significant tools in health care, driven by advancements in artificial intelligence and digital technology. These programs are designed to simulate human conversations, addressing various health care needs. However, no comprehensive synthesis of health care chatbots' roles, users, benefits, and limitations is available to inform future research and application in the field. OBJECTIVE This review aims to describe health care chatbots' characteristics, focusing on their diverse roles in the health care pathway, user groups, benefits, and limitations. METHODS A rapid review of published literature from 2017 to 2023 was performed with a search strategy developed in collaboration with a health sciences librarian and implemented in the MEDLINE and Embase databases. Primary research studies reporting on chatbot roles or benefits in health care were included. Two reviewers dual-screened the search results. Extracted data on chatbot roles, users, benefits, and limitations were subjected to content analysis. RESULTS The review categorized chatbot roles into 2 themes: delivery of remote health services, including patient support, care management, education, skills building, and health behavior promotion, and provision of administrative assistance to health care providers. User groups spanned across patients with chronic conditions as well as patients with cancer; individuals focused on lifestyle improvements; and various demographic groups such as women, families, and older adults. Professionals and students in health care also emerged as significant users, alongside groups seeking mental health support, behavioral change, and educational enhancement. The benefits of health care chatbots were also classified into 2 themes: improvement of health care quality and efficiency and cost-effectiveness in health care delivery. The identified limitations encompassed ethical challenges, medicolegal and safety concerns, technical difficulties, user experience issues, and societal and economic impacts. CONCLUSIONS Health care chatbots offer a wide spectrum of applications, potentially impacting various aspects of health care. While they are promising tools for improving health care efficiency and quality, their integration into the health care system must be approached with consideration of their limitations to ensure optimal, safe, and equitable use.
Collapse
Affiliation(s)
- Moustafa Laymouna
- Department of Family Medicine, Faculty of Medicine and Health Sciences, McGill University, Montreal, QC, Canada
- Centre for Outcomes Research and Evaluation, Research Institute of the McGill University Health Centre, Montreal, QC, Canada
- Infectious Diseases and Immunity in Global Health Program, Research Institute of McGill University Health Centre, Montreal, QC, Canada
| | - Yuanchao Ma
- Centre for Outcomes Research and Evaluation, Research Institute of the McGill University Health Centre, Montreal, QC, Canada
- Infectious Diseases and Immunity in Global Health Program, Research Institute of McGill University Health Centre, Montreal, QC, Canada
- Chronic and Viral Illness Service, Division of Infectious Disease, Department of Medicine, McGill University Health Centre, Montreal, QC, Canada
- Department of Biomedical Engineering, Polytechnique Montréal, Montreal, QC, Canada
| | - David Lessard
- Centre for Outcomes Research and Evaluation, Research Institute of the McGill University Health Centre, Montreal, QC, Canada
- Infectious Diseases and Immunity in Global Health Program, Research Institute of McGill University Health Centre, Montreal, QC, Canada
- Chronic and Viral Illness Service, Division of Infectious Disease, Department of Medicine, McGill University Health Centre, Montreal, QC, Canada
| | - Tibor Schuster
- Department of Family Medicine, Faculty of Medicine and Health Sciences, McGill University, Montreal, QC, Canada
| | - Kim Engler
- Centre for Outcomes Research and Evaluation, Research Institute of the McGill University Health Centre, Montreal, QC, Canada
- Infectious Diseases and Immunity in Global Health Program, Research Institute of McGill University Health Centre, Montreal, QC, Canada
- Chronic and Viral Illness Service, Division of Infectious Disease, Department of Medicine, McGill University Health Centre, Montreal, QC, Canada
| | - Bertrand Lebouché
- Department of Family Medicine, Faculty of Medicine and Health Sciences, McGill University, Montreal, QC, Canada
- Centre for Outcomes Research and Evaluation, Research Institute of the McGill University Health Centre, Montreal, QC, Canada
- Infectious Diseases and Immunity in Global Health Program, Research Institute of McGill University Health Centre, Montreal, QC, Canada
- Chronic and Viral Illness Service, Division of Infectious Disease, Department of Medicine, McGill University Health Centre, Montreal, QC, Canada
| |
Collapse
|
38
|
Şan H, Bayrakcı Ö, Çağdaş B, Serdengeçti M, Alagöz E. Reliability and readability analysis of ChatGPT-4 and Google Bard as a patient information source for the most commonly applied radionuclide treatments in cancer patients. Rev Esp Med Nucl Imagen Mol 2024; 43:500021. [PMID: 38821410 DOI: 10.1016/j.remnie.2024.500021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Accepted: 05/21/2024] [Indexed: 06/02/2024]
Abstract
PURPOSE Searching for online health information is a popular approach employed by patients to enhance their knowledge for their diseases. Recently developed AI chatbots are probably the easiest way in this regard. The purpose of the study is to analyze the reliability and readability of AI chatbot responses in terms of the most commonly applied radionuclide treatments in cancer patients. METHODS Basic patient questions, thirty about RAI, PRRT and TARE treatments and twenty-nine about PSMA-TRT, were asked one by one to GPT-4 and Bard on January 2024. The reliability and readability of the responses were assessed by using DISCERN scale, Flesch Reading Ease(FRE) and Flesch-Kincaid Reading Grade Level(FKRGL). RESULTS The mean (SD) FKRGL scores for the responses of GPT-4 and Google Bard about RAI, PSMA-TRT, PRRT and TARE treatmens were 14.57 (1.19), 14.65 (1.38), 14.25 (1.10), 14.38 (1.2) and 11.49 (1.59), 12.42 (1.71), 11.35 (1.80), 13.01 (1.97), respectively. In terms of readability the FRKGL scores of the responses of GPT-4 and Google Bard about RAI, PSMA-TRT, PRRT and TARE treatments were above the general public reading grade level. The mean (SD) DISCERN scores assesses by nuclear medicine phsician for the responses of GPT-4 and Bard about RAI, PSMA-TRT, PRRT and TARE treatments were 47.86 (5.09), 48.48 (4.22), 46.76 (4.09), 48.33 (5.15) and 51.50 (5.64), 53.44 (5.42), 53 (6.36), 49.43 (5.32), respectively. Based on mean DISCERN scores, the reliability of the responses of GPT-4 and Google Bard about RAI, PSMA-TRT, PRRT, and TARE treatments ranged from fair to good. The inter-rater reliability correlation coefficient of DISCERN scores assessed by GPT-4, Bard and nuclear medicine physician for the responses of GPT-4 about RAI, PSMA-TRT, PRRT and TARE treatments were 0.512(95% CI 0.296: 0.704), 0.695(95% CI 0.518: 0.829), 0.687(95% CI 0.511: 0.823) and 0.649 (95% CI 0.462: 0.798), respectively (p < 0.01). The inter-rater reliability correlation coefficient of DISCERN scores assessed by GPT-4, Bard and nuclear medicine physician for the responses of Bard about RAI, PSMA-TRT, PRRT and TARE treatments were 0.753(95% CI 0.602: 0.863), 0.812(95% CI 0.686: 0.899), 0.804(95% CI 0.677: 0.894) and 0.671 (95% CI 0.489: 0.812), respectively (p < 0.01). The inter-rater reliability for the responses of Bard and GPT-4 about RAİ, PSMA-TRT, PRRT and TARE treatments were moderate to good. Further, consulting to the nuclear medicine physician was rarely emphasized both in GPT-4 and Google Bard and references were included in some responses of Google Bard, but there were no references in GPT-4. CONCLUSION Although the information provided by AI chatbots may be acceptable in medical terms, it can not be easy to read for the general public, which may prevent it from being understandable. Effective prompts using 'prompt engineering' may refine the responses in a more comprehensible manner. Since radionuclide treatments are specific to nuclear medicine expertise, nuclear medicine physician need to be stated as a consultant in responses in order to guide patients and caregivers to obtain accurate medical advice. Referencing is significant in terms of confidence and satisfaction of patients and caregivers seeking information.
Collapse
Affiliation(s)
- Hüseyin Şan
- Ankara Bilkent City Hospital, Department of Nuclear Medicine, Ankara, Turkey.
| | - Özkan Bayrakcı
- Ankara Bilkent City Hospital, Department of Nuclear Medicine, Ankara, Turkey
| | - Berkay Çağdaş
- Ankara Bilkent City Hospital, Department of Nuclear Medicine, Ankara, Turkey
| | - Mustafa Serdengeçti
- Ankara Bilkent City Hospital, Department of Nuclear Medicine, Ankara, Turkey
| | - Engin Alagöz
- Gulhane Training and Research Hospital, Department of Nuclear Medicine, Ankara, Turkey
| |
Collapse
|
39
|
Heinke A, Radgoudarzi N, Huang BB, Baxter SL. A review of ophthalmology education in the era of generative artificial intelligence. Asia Pac J Ophthalmol (Phila) 2024; 13:100089. [PMID: 39134176 PMCID: PMC11934932 DOI: 10.1016/j.apjo.2024.100089] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2024] [Revised: 07/31/2024] [Accepted: 08/02/2024] [Indexed: 08/18/2024] Open
Abstract
PURPOSE To explore the integration of generative AI, specifically large language models (LLMs), in ophthalmology education and practice, addressing their applications, benefits, challenges, and future directions. DESIGN A literature review and analysis of current AI applications and educational programs in ophthalmology. METHODS Analysis of published studies, reviews, articles, websites, and institutional reports on AI use in ophthalmology. Examination of educational programs incorporating AI, including curriculum frameworks, training methodologies, and evaluations of AI performance on medical examinations and clinical case studies. RESULTS Generative AI, particularly LLMs, shows potential to improve diagnostic accuracy and patient care in ophthalmology. Applications include aiding in patient, physician, and medical students' education. However, challenges such as AI hallucinations, biases, lack of interpretability, and outdated training data limit clinical deployment. Studies revealed varying levels of accuracy of LLMs on ophthalmology board exam questions, underscoring the need for more reliable AI integration. Several educational programs nationwide provide AI and data science training relevant to clinical medicine and ophthalmology. CONCLUSIONS Generative AI and LLMs offer promising advancements in ophthalmology education and practice. Addressing challenges through comprehensive curricula that include fundamental AI principles, ethical guidelines, and updated, unbiased training data is crucial. Future directions include developing clinically relevant evaluation metrics, implementing hybrid models with human oversight, leveraging image-rich data, and benchmarking AI performance against ophthalmologists. Robust policies on data privacy, security, and transparency are essential for fostering a safe and ethical environment for AI applications in ophthalmology.
Collapse
Affiliation(s)
- Anna Heinke
- Division of Ophthalmology Informatics and Data Science, The Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California San Diego, 9415 Campus Point Drive, La Jolla, CA 92037, USA; Jacobs Retina Center, 9415 Campus Point Drive, La Jolla, CA 92037, USA
| | - Niloofar Radgoudarzi
- Division of Ophthalmology Informatics and Data Science, The Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California San Diego, 9415 Campus Point Drive, La Jolla, CA 92037, USA; Division of Biomedical Informatics, Department of Medicine, University of California San Diego Health System, University of California San Diego, La Jolla, CA, USA
| | - Bonnie B Huang
- Division of Ophthalmology Informatics and Data Science, The Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California San Diego, 9415 Campus Point Drive, La Jolla, CA 92037, USA; Division of Biomedical Informatics, Department of Medicine, University of California San Diego Health System, University of California San Diego, La Jolla, CA, USA; Northwestern University Feinberg School of Medicine, Chicago, IL, USA
| | - Sally L Baxter
- Division of Ophthalmology Informatics and Data Science, The Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California San Diego, 9415 Campus Point Drive, La Jolla, CA 92037, USA; Division of Biomedical Informatics, Department of Medicine, University of California San Diego Health System, University of California San Diego, La Jolla, CA, USA.
| |
Collapse
|
40
|
Tessler I, Wolfovitz A, Alon EE, Gecel NA, Livneh N, Zimlichman E, Klang E. ChatGPT's adherence to otolaryngology clinical practice guidelines. Eur Arch Otorhinolaryngol 2024; 281:3829-3834. [PMID: 38647684 DOI: 10.1007/s00405-024-08634-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Accepted: 03/22/2024] [Indexed: 04/25/2024]
Abstract
OBJECTIVES Large language models, including ChatGPT, has the potential to transform the way we approach medical knowledge, yet accuracy in clinical topics is critical. Here we assessed ChatGPT's performance in adhering to the American Academy of Otolaryngology-Head and Neck Surgery guidelines. METHODS We presented ChatGPT with 24 clinical otolaryngology questions based on the guidelines of the American Academy of Otolaryngology. This was done three times (N = 72) to test the model's consistency. Two otolaryngologists evaluated the responses for accuracy and relevance to the guidelines. Cohen's Kappa was used to measure evaluator agreement, and Cronbach's alpha assessed the consistency of ChatGPT's responses. RESULTS The study revealed mixed results; 59.7% (43/72) of ChatGPT's responses were highly accurate, while only 2.8% (2/72) directly contradicted the guidelines. The model showed 100% accuracy in Head and Neck, but lower accuracy in Rhinology and Otology/Neurotology (66%), Laryngology (50%), and Pediatrics (8%). The model's responses were consistent in 17/24 (70.8%), with a Cronbach's alpha value of 0.87, indicating a reasonable consistency across tests. CONCLUSIONS Using a guideline-based set of structured questions, ChatGPT demonstrates consistency but variable accuracy in otolaryngology. Its lower performance in some areas, especially Pediatrics, suggests that further rigorous evaluation is needed before considering real-world clinical use.
Collapse
Affiliation(s)
- Idit Tessler
- Department of Otolaryngology and Head and Neck Surgery, Sheba Medical Center, Ramat Gan, Israel.
- School of Medicine, Tel Aviv University, Tel Aviv, Israel.
- ARC Innovation Center, Sheba Medical Center, Ramat Gan, Israel.
| | - Amit Wolfovitz
- Department of Otolaryngology and Head and Neck Surgery, Sheba Medical Center, Ramat Gan, Israel
- School of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Eran E Alon
- Department of Otolaryngology and Head and Neck Surgery, Sheba Medical Center, Ramat Gan, Israel
- School of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Nir A Gecel
- School of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Nir Livneh
- Department of Otolaryngology and Head and Neck Surgery, Sheba Medical Center, Ramat Gan, Israel
- School of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Eyal Zimlichman
- School of Medicine, Tel Aviv University, Tel Aviv, Israel
- ARC Innovation Center, Sheba Medical Center, Ramat Gan, Israel
- The Sheba Talpiot Medical Leadership Program, Ramat Gan, Israel
- Hospital Management, Sheba Medical Center, Ramat Gan, Israel
| | - Eyal Klang
- The Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, USA
| |
Collapse
|
41
|
Suwała S, Szulc P, Guzowski C, Kamińska B, Dorobiała J, Wojciechowska K, Berska M, Kubicka O, Kosturkiewicz O, Kosztulska B, Rajewska A, Junik R. ChatGPT-3.5 passes Poland's medical final examination-Is it possible for ChatGPT to become a doctor in Poland? SAGE Open Med 2024; 12:20503121241257777. [PMID: 38895543 PMCID: PMC11185017 DOI: 10.1177/20503121241257777] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Accepted: 05/08/2024] [Indexed: 06/21/2024] Open
Abstract
Objectives ChatGPT is an advanced chatbot based on Large Language Model that has the ability to answer questions. Undoubtedly, ChatGPT is capable of transforming communication, education, and customer support; however, can it play the role of a doctor? In Poland, prior to obtaining a medical diploma, candidates must successfully pass the Medical Final Examination. Methods The purpose of this research was to determine how well ChatGPT performed on the Polish Medical Final Examination, which passing is required to become a doctor in Poland (an exam is considered passed if at least 56% of the tasks are answered correctly). A total of 2138 categorized Medical Final Examination questions (from 11 examination sessions held between 2013-2015 and 2021-2023) were presented to ChatGPT-3.5 from 19 to 26 May 2023. For further analysis, the questions were divided into quintiles based on difficulty and duration, as well as question types (simple A-type or complex K-type). The answers provided by ChatGPT were compared to the official answer key, reviewed for any changes resulting from the advancement of medical knowledge. Results ChatGPT correctly answered 53.4%-64.9% of questions. In 8 out of 11 exam sessions, ChatGPT achieved the scores required to successfully pass the examination (60%). The correlation between the efficacy of artificial intelligence and the level of complexity, difficulty, and length of a question was found to be negative. AI outperformed humans in one category: psychiatry (77.18% vs. 70.25%, p = 0.081). Conclusions The performance of artificial intelligence is deemed satisfactory; however, it is observed to be markedly inferior to that of human graduates in the majority of instances. Despite its potential utility in many medical areas, ChatGPT is constrained by its inherent limitations that prevent it from entirely supplanting human expertise and knowledge.
Collapse
Affiliation(s)
- Szymon Suwała
- Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Paulina Szulc
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Cezary Guzowski
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Barbara Kamińska
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Jakub Dorobiała
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Karolina Wojciechowska
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Maria Berska
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Olga Kubicka
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Oliwia Kosturkiewicz
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Bernadetta Kosztulska
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Alicja Rajewska
- Evidence-Based Medicine Students Scientific Club of Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| | - Roman Junik
- Department of Endocrinology and Diabetology, Nicolaus Copernicus University, Collegium Medicum, Bydgoszcz, Poland
| |
Collapse
|
42
|
Bonnechère B. Unlocking the Black Box? A Comprehensive Exploration of Large Language Models in Rehabilitation. Am J Phys Med Rehabil 2024; 103:532-537. [PMID: 38261757 DOI: 10.1097/phm.0000000000002440] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2024]
Abstract
ABSTRACT Rehabilitation is a vital component of health care, aiming to restore function and improve the well-being of individuals with disabilities or injuries. Nevertheless, the rehabilitation process is often likened to a " black box ," with complexities that pose challenges for comprehensive analysis and optimization. The emergence of large language models offers promising solutions to better understand this " black box ." Large language models excel at comprehending and generating human-like text, making them valuable in the healthcare sector. In rehabilitation, healthcare professionals must integrate a wide range of data to create effective treatment plans, akin to selecting the best ingredients for the " black box. " Large language models enhance data integration, communication, assessment, and prediction.This article delves into the ground-breaking use of large language models as a tool to further understand the rehabilitation process. Large language models address current rehabilitation issues, including data bias, contextual comprehension, and ethical concerns. Collaboration with healthcare experts and rigorous validation is crucial when deploying large language models. Integrating large language models into rehabilitation yields insights into this intricate process, enhancing data-driven decision making, refining clinical practices, and predicting rehabilitation outcomes. Although challenges persist, large language models represent a significant stride in rehabilitation, underscoring the importance of ethical use and collaboration.
Collapse
Affiliation(s)
- Bruno Bonnechère
- From the REVAL Rehabilitation Research Center, Faculty of Rehabilitation Sciences, Hasselt University, Diepenbeek, Belgium; Technology-Supported and Data-Driven Rehabilitation, Data Sciences Institute, Hasselt University, Diepenbeek, Belgium; and Department of PXL-Healthcare, PXL University of Applied Sciences and Arts, Hasselt, Belgium
| |
Collapse
|
43
|
Zhang Y, Xu L, Ji H. Author's reply: AI in medicine, bridging the chasm between potential and capability. Dig Liver Dis 2024; 56:1116. [PMID: 38521671 DOI: 10.1016/j.dld.2024.02.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/25/2024] [Revised: 02/28/2024] [Accepted: 02/29/2024] [Indexed: 03/25/2024]
Affiliation(s)
- Yiwen Zhang
- Department of Endocrinology and Metabolic Hepatology, The Affiliated Hospital of Qingdao University, Qingdao, China
| | - Lili Xu
- Department of Endocrinology and Metabolic Hepatology, The Affiliated Hospital of Qingdao University, Qingdao, China
| | - Hongwei Ji
- Tsinghua Medicine, Beijing Tsinghua Changgung Hospital, Tsinghua University, Beijing, China.
| |
Collapse
|
44
|
Mousavi M, Shafiee S, Harley JM, Cheung JCK, Abbasgholizadeh Rahimi S. Performance of generative pre-trained transformers (GPTs) in Certification Examination of the College of Family Physicians of Canada. Fam Med Community Health 2024; 12:e002626. [PMID: 38806403 PMCID: PMC11138270 DOI: 10.1136/fmch-2023-002626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/30/2024] Open
Abstract
INTRODUCTION The application of large language models such as generative pre-trained transformers (GPTs) has been promising in medical education, and its performance has been tested for different medical exams. This study aims to assess the performance of GPTs in responding to a set of sample questions of short-answer management problems (SAMPs) from the certification exam of the College of Family Physicians of Canada (CFPC). METHOD Between August 8th and 25th, 2023, we used GPT-3.5 and GPT-4 in five rounds to answer a sample of 77 SAMPs questions from the CFPC website. Two independent certified family physician reviewers scored AI-generated responses twice: first, according to the CFPC answer key (ie, CFPC score), and second, based on their knowledge and other references (ie, Reviews' score). An ordinal logistic generalised estimating equations (GEE) model was applied to analyse repeated measures across the five rounds. RESULT According to the CFPC answer key, 607 (73.6%) lines of answers by GPT-3.5 and 691 (81%) by GPT-4 were deemed accurate. Reviewer's scoring suggested that about 84% of the lines of answers provided by GPT-3.5 and 93% of GPT-4 were correct. The GEE analysis confirmed that over five rounds, the likelihood of achieving a higher CFPC Score Percentage for GPT-4 was 2.31 times more than GPT-3.5 (OR: 2.31; 95% CI: 1.53 to 3.47; p<0.001). Similarly, the Reviewers' Score percentage for responses provided by GPT-4 over 5 rounds were 2.23 times more likely to exceed those of GPT-3.5 (OR: 2.23; 95% CI: 1.22 to 4.06; p=0.009). Running the GPTs after a one week interval, regeneration of the prompt or using or not using the prompt did not significantly change the CFPC score percentage. CONCLUSION In our study, we used GPT-3.5 and GPT-4 to answer complex, open-ended sample questions of the CFPC exam and showed that more than 70% of the answers were accurate, and GPT-4 outperformed GPT-3.5 in responding to the questions. Large language models such as GPTs seem promising for assisting candidates of the CFPC exam by providing potential answers. However, their use for family medicine education and exam preparation needs further studies.
Collapse
Affiliation(s)
- Mehdi Mousavi
- Department of Family Medicine, Faculty of Medicine, University of Saskatchewan, Nipawin, Saskatchewan, Canada
| | - Shabnam Shafiee
- Department of Family Medicine, Saskatchewan Health Authority, Riverside Health Complex, Turtleford, Saskatchewan, Canada
| | - Jason M Harley
- Department of Surgery, Faculty of Medicine and Health Sciences, McGill University, Montreal, Quebec, Canada
- Research Institute of the McGill University Health Centre, Montreal, Quebec, Canada
- Institute for Health Sciences Education, Faculty of Medicine and Health Sciences, McGill University, Montreal, Quebec, Canada
| | - Jackie Chi Kit Cheung
- McGill University School of Computer Science, Montreal, Quebec, Canada
- CIFAR AI Chair, Mila-Quebec AI Institute, Montreal, Quebec, Canada
| | - Samira Abbasgholizadeh Rahimi
- Department of Family Medicine, McGill University, Montreal, Quebec, Canada
- Mila Quebec AI-Institute, Montreal, Quebec, Canada
- Faculty of Dentistry Medicine and Oral Health Sciences, McGill University, Montreal, Quebec, Canada
- Lady Davis Institute for Medical Research, Jewish General Hospital, Montreal, Quebec, Canada
| |
Collapse
|
45
|
Goglia M, Pace M, Yusef M, Gallo G, Pavone M, Petrucciani N, Aurello P. Artificial Intelligence and ChatGPT in Abdominopelvic Surgery: A Systematic Review of Applications and Impact. In Vivo 2024; 38:1009-1015. [PMID: 38688653 PMCID: PMC11059919 DOI: 10.21873/invivo.13534] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Revised: 02/27/2024] [Accepted: 03/11/2024] [Indexed: 05/02/2024]
Abstract
BACKGROUND/AIM The integration of AI and natural language processing technologies, such as ChatGPT, into surgical practice has shown promising potential in enhancing various aspects of abdominopelvic surgical procedures. This systematic review aims to comprehensively evaluate the current state of research on the applications and impact of artificial intelligence (AI) and ChatGPT in abdominopelvic surgery summarizing existing literature towards providing a comprehensive overview of the diverse applications, effectiveness, challenges, and future directions of these innovative technologies. MATERIALS AND METHODS A systematic search of major electronic databases, including PubMed, Google Scholar, Cochrane Library, Web of Science, was conducted from October to November 2023, to identify relevant studies. Inclusion criteria encompassed studies that investigated the utilization of AI and ChatGPT in abdominopelvic surgical settings, including, but not limited to preoperative planning, intraoperative decision-making, postoperative care, and patient communication. RESULTS Fourteen studies met the inclusion criteria and were included in this review. The majority of the studies were analysing ChatGPT's data output and decision making while two studies reported patient and general surgery resident perception of the tool applied to clinical practice. Most studies reported a high accuracy of ChatGPT in data output and decision-making process, however with an unforgettable number of errors. CONCLUSION This systematic review contributes to the current understanding of the role of AI and ChatGPT in abdominopelvic surgery, providing insight into their applications and impact on clinical practice. The synthesis of available evidence will inform future research directions, clinical guidelines, and development of these technologies to optimize their potential benefits in enhancing surgical care within the abdominopelvic domain.
Collapse
Affiliation(s)
- Marta Goglia
- Department of Medical and Surgical Sciences and Translational Medicine, School in Translational Medicine and Oncology, Faculty of Medicine and Psychology, Sapienza University of Rome, Rome, Italy
- IHU Strasbourg, Institute of Image-Guided Surgery, Strasbourg, France
- IRCAD, Research Institute Against Digestive Cancer, Strasbourg, France
| | - Marco Pace
- Department of Medical and Surgical Sciences and Translational Medicine, School in Translational Medicine and Oncology, Faculty of Medicine and Psychology, Sapienza University of Rome, Rome, Italy;
| | - Marco Yusef
- Department of Medical and Surgical Sciences and Translational Medicine, School in Translational Medicine and Oncology, Faculty of Medicine and Psychology, Sapienza University of Rome, Rome, Italy
| | - Gaetano Gallo
- Department of Surgery, Sapienza University of Rome, Rome, Italy
| | - Matteo Pavone
- Dipartimento di Scienze per la Salute della Donna e del Bambino e di Sanità Pubblica, Fondazione Policlinico Universitario A. Gemelli, IRCCS, UOC Ginecologia Oncologica, Rome, Italy
| | - Niccolò Petrucciani
- Department of Medical and Surgical Sciences and Translational Medicine, School in Translational Medicine and Oncology, Faculty of Medicine and Psychology, Sapienza University of Rome, Rome, Italy
| | - Paolo Aurello
- Department of Surgery, Sapienza University of Rome, Rome, Italy
| |
Collapse
|
46
|
Varghese C, Harrison EM, O'Grady G, Topol EJ. Artificial intelligence in surgery. Nat Med 2024; 30:1257-1268. [PMID: 38740998 DOI: 10.1038/s41591-024-02970-3] [Citation(s) in RCA: 37] [Impact Index Per Article: 37.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Accepted: 04/03/2024] [Indexed: 05/16/2024]
Abstract
Artificial intelligence (AI) is rapidly emerging in healthcare, yet applications in surgery remain relatively nascent. Here we review the integration of AI in the field of surgery, centering our discussion on multifaceted improvements in surgical care in the preoperative, intraoperative and postoperative space. The emergence of foundation model architectures, wearable technologies and improving surgical data infrastructures is enabling rapid advances in AI interventions and utility. We discuss how maturing AI methods hold the potential to improve patient outcomes, facilitate surgical education and optimize surgical care. We review the current applications of deep learning approaches and outline a vision for future advances through multimodal foundation models.
Collapse
Affiliation(s)
- Chris Varghese
- Department of Surgery, University of Auckland, Auckland, New Zealand
| | - Ewen M Harrison
- Centre for Medical Informatics, Usher Institute, University of Edinburgh, Edinburgh, UK
| | - Greg O'Grady
- Department of Surgery, University of Auckland, Auckland, New Zealand
- Auckland Bioengineering Institute, University of Auckland, Auckland, New Zealand
| | - Eric J Topol
- Scripps Research Translational Institute, La Jolla, CA, USA.
| |
Collapse
|
47
|
Scott IA, Zuccon G. The new paradigm in machine learning - foundation models, large language models and beyond: a primer for physicians. Intern Med J 2024; 54:705-715. [PMID: 38715436 DOI: 10.1111/imj.16393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Accepted: 03/26/2024] [Indexed: 05/18/2024]
Abstract
Foundation machine learning models are deep learning models capable of performing many different tasks using different data modalities such as text, audio, images and video. They represent a major shift from traditional task-specific machine learning prediction models. Large language models (LLM), brought to wide public prominence in the form of ChatGPT, are text-based foundational models that have the potential to transform medicine by enabling automation of a range of tasks, including writing discharge summaries, answering patients questions and assisting in clinical decision-making. However, such models are not without risk and can potentially cause harm if their development, evaluation and use are devoid of proper scrutiny. This narrative review describes the different types of LLM, their emerging applications and potential limitations and bias and likely future translation into clinical practice.
Collapse
Affiliation(s)
- Ian A Scott
- Centre for Health Services Research, University of Queensland, Woolloongabba, Australia
| | - Guido Zuccon
- School of Electrical Engineering and Computer Sciences, The University of Queensland, St Lucia, Queensland, Australia
| |
Collapse
|
48
|
Al-Sharif EM, Penteado RC, Dib El Jalbout N, Topilow NJ, Shoji MK, Kikkawa DO, Liu CY, Korn BS. Evaluating the Accuracy of ChatGPT and Google BARD in Fielding Oculoplastic Patient Queries: A Comparative Study on Artificial versus Human Intelligence. Ophthalmic Plast Reconstr Surg 2024; 40:303-311. [PMID: 38215452 DOI: 10.1097/iop.0000000000002567] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2024]
Abstract
PURPOSE This study evaluates and compares the accuracy of responses from 2 artificial intelligence platforms to patients' oculoplastics-related questions. METHODS Questions directed toward oculoplastic surgeons were collected, rephrased, and input independently into ChatGPT-3.5 and BARD chatbots, using the prompt: "As an oculoplastic surgeon, how can I respond to my patient's question?." Responses were independently evaluated by 4 experienced oculoplastic specialists as comprehensive, correct but inadequate, mixed correct and incorrect/outdated data, and completely incorrect. Additionally, the empathy level, length, and automated readability index of the responses were assessed. RESULTS A total of 112 patient questions underwent evaluation. The rates of comprehensive, correct but inadequate, mixed, and completely incorrect answers for ChatGPT were 71.4%, 12.9%, 10.5%, and 5.1%, respectively, compared with 53.1%, 18.3%, 18.1%, and 10.5%, respectively, for BARD. ChatGPT showed more empathy (48.9%) than BARD (13.2%). All graders found that ChatGPT outperformed BARD in question categories of postoperative healing, medical eye conditions, and medications. Categorizing questions by anatomy, ChatGPT excelled in answering lacrimal questions (83.8%), while BARD performed best in the eyelid group (60.4%). ChatGPT's answers were longer and potentially more challenging to comprehend than BARD's. CONCLUSION This study emphasizes the promising role of artificial intelligence-powered chatbots in oculoplastic patient education and support. With continued development, these chatbots may potentially assist physicians and offer patients accurate information, ultimately contributing to improved patient care while alleviating surgeon burnout. However, it is crucial to highlight that artificial intelligence may be good at answering questions, but physician oversight remains essential to ensure the highest standard of care and address complex medical cases.
Collapse
Affiliation(s)
- Eman M Al-Sharif
- Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
- Clinical Sciences Department, College of Medicine, Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia
| | - Rafaella C Penteado
- Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
| | - Nahia Dib El Jalbout
- Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
| | - Nicole J Topilow
- Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
| | - Marissa K Shoji
- Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
| | - Don O Kikkawa
- Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
- Division of Plastic and Reconstructive Surgery, Department of Surgery, UC San Diego School of Medicine, La Jolla, California, U.S.A
| | - Catherine Y Liu
- Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
| | - Bobby S Korn
- Division of Oculofacial Plastic and Reconstructive Surgery, Viterbi Family Department of Ophthalmology, UC San Diego Shiley Eye Institute, La Jolla, California, U.S.A
- Division of Plastic and Reconstructive Surgery, Department of Surgery, UC San Diego School of Medicine, La Jolla, California, U.S.A
| |
Collapse
|
49
|
Su MC, Lin LE, Lin LH, Chen YC. Assessing question characteristic influences on ChatGPT's performance and response-explanation consistency: Insights from Taiwan's Nursing Licensing Exam. Int J Nurs Stud 2024; 153:104717. [PMID: 38401366 DOI: 10.1016/j.ijnurstu.2024.104717] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2023] [Revised: 01/25/2024] [Accepted: 02/02/2024] [Indexed: 02/26/2024]
Abstract
BACKGROUND Investigates the integration of an artificial intelligence tool, specifically ChatGPT, in nursing education, addressing its effectiveness in exam preparation and self-assessment. OBJECTIVE This study aims to evaluate the performance of ChatGPT, one of the most promising artificial intelligence-driven linguistic understanding tools in answering question banks for nursing licensing examination preparation. It further analyzes question characteristics that might impact the accuracy of ChatGPT-generated answers and examines its reliability through human expert reviews. DESIGN Cross-sectional survey comparing ChatGPT-generated answers and their explanations. SETTING 400 questions from Taiwan's 2022 Nursing Licensing Exam. METHODS The study analyzed 400 questions from five distinct subjects of Taiwan's 2022 Nursing Licensing Exam using the ChatGPT model which provided answers and in-depth explanations for each question. The impact of various question characteristics, such as type and cognitive level, on the accuracy of the ChatGPT-generated responses was assessed using logistic regression analysis. Additionally, human experts evaluated the explanations for each question, comparing them with the ChatGPT-generated answers to determine consistency. RESULTS ChatGPT exhibited overall accuracy at 80.75 % for Taiwan's National Nursing Exam, which passes the exam. The accuracy of ChatGPT-generated answers diverged significantly across test subjects, demonstrating a hierarchy ranging from General Medicine at 88.75 %, Medical-Surgical Nursing at 80.0 %, Psychology and Community Nursing at 70.0 %, Obstetrics and Gynecology Nursing at 67.5 %, down to Basic Nursing at 63.0 %. ChatGPT had a higher probability of eliciting incorrect responses for questions with certain characteristics, notably those with clinical vignettes [odds ratio 2.19, 95 % confidence interval 1.24-3.87, P = 0.007] and complex multiple-choice questions [odds ratio 2.37, 95 % confidence interval 1.00-5.60, P = 0.049]. Furthermore, 14.25 % of ChatGPT-generated answers were inconsistent with their explanations, leading to a reduction in the overall accuracy to 74 %. CONCLUSIONS This study reveals the ChatGPT's capabilities and limitations in nursing exam preparation, underscoring its potential as an auxiliary educational tool. It highlights the model's varied performance across different question types and notable inconsistencies between its answers and explanations. The study contributes significantly to the understanding of artificial intelligence in learning environments, guiding the future development of more effective and reliable artificial intelligence-based educational technologies. TWEETABLE ABSTRACT New study reveals ChatGPT's potential and challenges in nursing education: Achieves 80.75 % accuracy in exam prep but faces hurdles with complex questions and logical consistency. #AIinNursing #AIinEducation #NursingExams #ChatGPT.
Collapse
Affiliation(s)
- Mei-Chin Su
- Department of Nursing, Taipei Veterans General Hospital, Taipei, Taiwan
| | - Li-En Lin
- Big Data Center, Taipei Veterans General Hospital, Taipei, Taiwan
| | - Li-Hwa Lin
- Department of Nursing, Taipei Veterans General Hospital, Taipei, Taiwan; Institute of Community Health Care, College of Nursing, National Yang Ming Chiao Tung University, Taipei, Taiwan
| | - Yu-Chun Chen
- Big Data Center, Taipei Veterans General Hospital, Taipei, Taiwan; Institute of Hospital and Health Care Administration, National Yang-Ming Chiao-Tung University, Taipei, Taiwan; School of Medicine, National Yang-Ming Chiao-Tung University, Taipei, Taiwan; Department of Family Medicine, Taipei Veterans General Hospital, Taipei, Taiwan.
| |
Collapse
|
50
|
Wang L, Wan Z, Ni C, Song Q, Li Y, Clayton EW, Malin BA, Yin Z. A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.04.26.24306390. [PMID: 38712148 PMCID: PMC11071576 DOI: 10.1101/2024.04.26.24306390] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
Background The launch of the Chat Generative Pre-trained Transformer (ChatGPT) in November 2022 has attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including healthcare. Numerous studies have since been conducted regarding how to employ state-of-the-art LLMs in health-related scenarios to assist patients, doctors, and public health administrators. Objective This review aims to summarize the applications and concerns of applying conversational LLMs in healthcare and provide an agenda for future research on LLMs in healthcare. Methods We utilized PubMed, ACM, and IEEE digital libraries as primary sources for this review. We followed the guidance of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRIMSA) to screen and select peer-reviewed research articles that (1) were related to both healthcare applications and conversational LLMs and (2) were published before September 1st, 2023, the date when we started paper collection and screening. We investigated these papers and classified them according to their applications and concerns. Results Our search initially identified 820 papers according to targeted keywords, out of which 65 papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT from OpenAI (60), followed by Bard from Google (1), Large Language Model Meta AI (LLaMA) from Meta (1), and other LLMs (5). These papers were classified into four categories in terms of their applications: 1) summarization, 2) medical knowledge inquiry, 3) prediction, and 4) administration, and four categories of concerns: 1) reliability, 2) bias, 3) privacy, and 4) public acceptability. There are 49 (75%) research papers using LLMs for summarization and/or medical knowledge inquiry, and 58 (89%) research papers expressing concerns about reliability and/or bias. We found that conversational LLMs exhibit promising results in summarization and providing medical knowledge to patients with a relatively high accuracy. However, conversational LLMs like ChatGPT are not able to provide reliable answers to complex health-related tasks that require specialized domain expertise. Additionally, no experiments in our reviewed papers have been conducted to thoughtfully examine how conversational LLMs lead to bias or privacy issues in healthcare research. Conclusions Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications brought bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in healthcare.
Collapse
Affiliation(s)
- Leyao Wang
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Zhiyu Wan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, TN, USA, 37203
| | - Congning Ni
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Qingyuan Song
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Yang Li
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Ellen Wright Clayton
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, Tennessee, USA, 37203
- Center for Biomedical Ethics and Society, Vanderbilt University Medical Center, Nashville, Tennessee, USA, 37203
| | - Bradley A. Malin
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
- Department of Biomedical Informatics, Vanderbilt University Medical Center, TN, USA, 37203
- Department of Biostatistics, Vanderbilt University Medical Center, TN, USA, 37203
| | - Zhijun Yin
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
- Department of Biomedical Informatics, Vanderbilt University Medical Center, TN, USA, 37203
| |
Collapse
|