1
|
Preiksaitis C, Ashenburg N, Bunney G, Chu A, Kabeer R, Riley F, Ribeira R, Rose C. The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review. JMIR Med Inform 2024; 12:e53787. [PMID: 38728687 DOI: 10.2196/53787] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2023] [Revised: 12/20/2023] [Accepted: 04/05/2024] [Indexed: 05/12/2024] Open
Abstract
BACKGROUND Artificial intelligence (AI), more specifically large language models (LLMs), holds significant potential in revolutionizing emergency care delivery by optimizing clinical workflows and enhancing the quality of decision-making. Although enthusiasm for integrating LLMs into emergency medicine (EM) is growing, the existing literature is characterized by a disparate collection of individual studies, conceptual analyses, and preliminary implementations. Given these complexities and gaps in understanding, a cohesive framework is needed to comprehend the existing body of knowledge on the application of LLMs in EM. OBJECTIVE Given the absence of a comprehensive framework for exploring the roles of LLMs in EM, this scoping review aims to systematically map the existing literature on LLMs' potential applications within EM and identify directions for future research. Addressing this gap will allow for informed advancements in the field. METHODS Using PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) criteria, we searched Ovid MEDLINE, Embase, Web of Science, and Google Scholar for papers published between January 2018 and August 2023 that discussed LLMs' use in EM. We excluded other forms of AI. A total of 1994 unique titles and abstracts were screened, and each full-text paper was independently reviewed by 2 authors. Data were abstracted independently, and 5 authors performed a collaborative quantitative and qualitative synthesis of the data. RESULTS A total of 43 papers were included. Studies were predominantly from 2022 to 2023 and conducted in the United States and China. We uncovered four major themes: (1) clinical decision-making and support was highlighted as a pivotal area, with LLMs playing a substantial role in enhancing patient care, notably through their application in real-time triage, allowing early recognition of patient urgency; (2) efficiency, workflow, and information management demonstrated the capacity of LLMs to significantly boost operational efficiency, particularly through the automation of patient record synthesis, which could reduce administrative burden and enhance patient-centric care; (3) risks, ethics, and transparency were identified as areas of concern, especially regarding the reliability of LLMs' outputs, and specific studies highlighted the challenges of ensuring unbiased decision-making amidst potentially flawed training data sets, stressing the importance of thorough validation and ethical oversight; and (4) education and communication possibilities included LLMs' capacity to enrich medical training, such as through using simulated patient interactions that enhance communication skills. CONCLUSIONS LLMs have the potential to fundamentally transform EM, enhancing clinical decision-making, optimizing workflows, and improving patient outcomes. This review sets the stage for future advancements by identifying key research areas: prospective validation of LLM applications, establishing standards for responsible use, understanding provider and patient perceptions, and improving physicians' AI literacy. Effective integration of LLMs into EM will require collaborative efforts and thorough evaluation to ensure these technologies can be safely and effectively applied.
Collapse
Affiliation(s)
- Carl Preiksaitis
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Nicholas Ashenburg
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Gabrielle Bunney
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Andrew Chu
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Rana Kabeer
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Fran Riley
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Ryan Ribeira
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| | - Christian Rose
- Department of Emergency Medicine, Stanford University School of Medicine, Palo Alto, CA, United States
| |
Collapse
|
2
|
D'Anna G, Van Cauter S, Thurnher M, Van Goethem J, Haller S. Can large language models pass official high-grade exams of the European Society of Neuroradiology courses? A direct comparison between OpenAI chatGPT 3.5, OpenAI GPT4 and Google Bard. Neuroradiology 2024:10.1007/s00234-024-03371-6. [PMID: 38705899 DOI: 10.1007/s00234-024-03371-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Accepted: 04/30/2024] [Indexed: 05/07/2024]
Abstract
We compared different LLMs, notably chatGPT, GPT4, and Google Bard and we tested whether their performance differs in subspeciality domains, in executing examinations from four different courses of the European Society of Neuroradiology (ESNR) notably anatomy/embryology, neuro-oncology, head and neck and pediatrics. Written exams of ESNR were used as input data, related to anatomy/embryology (30 questions), neuro-oncology (50 questions), head and neck (50 questions), and pediatrics (50 questions). All exams together, and each exam separately were introduced to the three LLMs: chatGPT 3.5, GPT4, and Google Bard. Statistical analyses included a group-wise Friedman test followed by a pair-wise Wilcoxon test with multiple comparison corrections. Overall, there was a significant difference between the 3 LLMs (p < 0.0001), with GPT4 having the highest accuracy (70%), followed by chatGPT 3.5 (54%) and Google Bard (36%). The pair-wise comparison showed significant differences between chatGPT vs GPT 4 (p < 0.0001), chatGPT vs Bard (p < 0. 0023), and GPT4 vs Bard (p < 0.0001). Analyses per subspecialty showed the highest difference between the best LLM (GPT4, 70%) versus the worst LLM (Google Bard, 24%) in the head and neck exam, while the difference was least pronounced in neuro-oncology (GPT4, 62% vs Google Bard, 48%). We observed significant differences in the performance of the three different LLMs in the running of official exams organized by ESNR. Overall GPT 4 performed best, and Google Bard performed worst. This difference varied depending on subspeciality and was most pronounced in head and neck subspeciality.
Collapse
Affiliation(s)
- Gennaro D'Anna
- Neuroimaging Unit, ASST Ovest Milanese, Legnano, Milan, Italy.
| | - Sofie Van Cauter
- Department of Medical Imaging, Ziekenhuis Oost-Limburg, Genk, Belgium
- Department of Medicine and Life Sciences, Hasselt University, Hasselt, Belgium
| | - Majda Thurnher
- Department for Biomedical Imaging and Image-Guided Therapy, Medical University of Vienna, Vienna, Austria
| | - Johan Van Goethem
- Department of Medical and Molecular Imaging, VITAZ, Sint-Niklaas, Belgium
- Department of Radiology, University Hospital Antwerp, Antwerp, Belgium
| | - Sven Haller
- CIMC-Centre d'Imagerie Médicale de Cornavin, Geneva, Switzerland
- Department of Surgical Sciences, Radiology, Uppsala University, Uppsala, Sweden
- Faculty of Medicine, University of Geneva, Geneva, Switzerland
- Department of Radiology, Beijing Tiantan Hospital, Capital Medical University, Beijing, People's Republic of China
| |
Collapse
|
3
|
Leas EC, Ayers JW, Desai N, Dredze M, Hogarth M, Smith DM. Using Large Language Models to Support Content Analysis: A Case Study of ChatGPT for Adverse Event Detection. J Med Internet Res 2024; 26:e52499. [PMID: 38696245 DOI: 10.2196/52499] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Revised: 03/14/2024] [Accepted: 03/28/2024] [Indexed: 05/04/2024] Open
Abstract
This study explores the potential of using large language models to assist content analysis by conducting a case study to identify adverse events (AEs) in social media posts. The case study compares ChatGPT's performance with human annotators' in detecting AEs associated with delta-8-tetrahydrocannabinol, a cannabis-derived product. Using the identical instructions given to human annotators, ChatGPT closely approximated human results, with a high degree of agreement noted: 94.4% (9436/10,000) for any AE detection (Fleiss κ=0.95) and 99.3% (9931/10,000) for serious AEs (κ=0.96). These findings suggest that ChatGPT has the potential to replicate human annotation accurately and efficiently. The study recognizes possible limitations, including concerns about the generalizability due to ChatGPT's training data, and prompts further research with different models, data sources, and content analysis tasks. The study highlights the promise of large language models for enhancing the efficiency of biomedical research.
Collapse
Affiliation(s)
- Eric C Leas
- Herbert Wertheim School of Public Health and Human Longevity Science, University of California San Diego, La Jolla, CA, United States
- Qualcomm Institute, University of California San Diego, La Jolla, CA, United States
| | - John W Ayers
- Qualcomm Institute, University of California San Diego, La Jolla, CA, United States
- Division of Infectious Diseases and Global Public Health, Department of Medicine, University of California San Diego, La Jolla, CA, United States
- Altman Clinical Translational Research Institute, University of California San Diego, La Jolla, CA, United States
| | - Nimit Desai
- Qualcomm Institute, University of California San Diego, La Jolla, CA, United States
| | - Mark Dredze
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, United States
| | - Michael Hogarth
- Altman Clinical Translational Research Institute, University of California San Diego, La Jolla, CA, United States
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, United States
| | - Davey M Smith
- Division of Infectious Diseases and Global Public Health, Department of Medicine, University of California San Diego, La Jolla, CA, United States
- Altman Clinical Translational Research Institute, University of California San Diego, La Jolla, CA, United States
| |
Collapse
|
4
|
Sohail SS. A Promising Start and Not a Panacea: ChatGPT's Early Impact and Potential in Medical Science and Biomedical Engineering Research. Ann Biomed Eng 2024; 52:1131-1135. [PMID: 37540292 DOI: 10.1007/s10439-023-03335-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2023] [Accepted: 07/26/2023] [Indexed: 08/05/2023]
Abstract
The advent of artificial intelligence (AI) has catalyzed a revolutionary transformation across various industries, including healthcare. Medical applications of ChatGPT, a powerful language model based on the generative pre-trained transformer (GPT) architecture, encompass the creation of conversational agents capable of accessing and generating medical information from multiple sources and formats. This study investigates the research trends of large language models such as ChatGPT, GPT 4, and Google Bard, comparing their publication trends with early COVID-19 research. The findings underscore the current prominence of AI research and its potential implications in biomedical engineering. A search of the Scopus database on July 23, 2023, yielded 1,096 articles related to ChatGPT, with approximately 26% being medical science-related. Keywords related to artificial intelligence, natural language processing (NLP), LLM, and generative AI dominate ChatGPT research, while a focused representation of medical science research emerges, with emphasis on biomedical research and engineering. This analysis serves as a call to action for researchers, healthcare professionals, and policymakers to recognize and harness AI's potential in healthcare, particularly in the realm of biomedical research.
Collapse
Affiliation(s)
- Shahab Saquib Sohail
- Department of Computer Science and Engineering, School of Engineering Sciences and Technology, Jamia Hamdard, New Delhi, 110062, India.
| |
Collapse
|
5
|
Vaid A, Duong SQ, Lampert J, Kovatch P, Freeman R, Argulian E, Croft L, Lerakis S, Goldman M, Khera R, Nadkarni GN. Local large language models for privacy-preserving accelerated review of historic echocardiogram reports. J Am Med Inform Assoc 2024:ocae085. [PMID: 38687616 DOI: 10.1093/jamia/ocae085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Revised: 02/28/2024] [Accepted: 04/08/2024] [Indexed: 05/02/2024] Open
Abstract
OBJECTIVES The study developed framework that leverages an open-source Large Language Model (LLM) to enable clinicians to ask plain-language questions about a patient's entire echocardiogram report history. This approach is intended to streamline the extraction of clinical insights from multiple echocardiogram reports, particularly in patients with complex cardiac diseases, thereby enhancing both patient care and research efficiency. MATERIALS AND METHODS Data from over 10 years were collected, comprising echocardiogram reports from patients with more than 10 echocardiograms on file at the Mount Sinai Health System. These reports were converted into a single document per patient for analysis, broken down into snippets and relevant snippets were retrieved using text similarity measures. The LLaMA-2 70B model was employed for analyzing the text using a specially crafted prompt. The model's performance was evaluated against ground-truth answers created by faculty cardiologists. RESULTS The study analyzed 432 reports from 37 patients for a total of 100 question-answer pairs. The LLM correctly answered 90% questions, with accuracies of 83% for temporality, 93% for severity assessment, 84% for intervention identification, and 100% for diagnosis retrieval. Errors mainly stemmed from the LLM's inherent limitations, such as misinterpreting numbers or hallucinations. CONCLUSION The study demonstrates the feasibility and effectiveness of using a local, open-source LLM for querying and interpreting echocardiogram report data. This approach offers a significant improvement over traditional keyword-based searches, enabling more contextually relevant and semantically accurate responses; in turn showing promise in enhancing clinical decision-making and research by facilitating more efficient access to complex patient data.
Collapse
Affiliation(s)
- Akhil Vaid
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- The Division of Data Driven and Digital Medicine (D3M), Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Son Q Duong
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- Division of Pediatric Cardiology, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Joshua Lampert
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- Helmsley Electrophysiology Center, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Patricia Kovatch
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Robert Freeman
- Department of Population Health Science and Policy, Institute for Healthcare Delivery Science, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Edgar Argulian
- The Zena and Michael A. Wiener Cardiovascular Institute, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- Mount Sinai Heart, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Lori Croft
- The Zena and Michael A. Wiener Cardiovascular Institute, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- Mount Sinai Heart, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Stamatios Lerakis
- The Zena and Michael A. Wiener Cardiovascular Institute, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- Mount Sinai Heart, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Martin Goldman
- The Zena and Michael A. Wiener Cardiovascular Institute, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- Mount Sinai Heart, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| | - Rohan Khera
- Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, New Haven, CT 06510, United States
| | - Girish N Nadkarni
- The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
- The Division of Data Driven and Digital Medicine (D3M), Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, United States
| |
Collapse
|
6
|
Pham C, Govender R, Tehami S, Chavez S, Adepoju OE, Liaw W. ChatGPT's Performance in Cardiac Arrest and Bradycardia Simulations Using the American Heart Association's Advanced Cardiovascular Life Support Guidelines: Exploratory Study. J Med Internet Res 2024; 26:e55037. [PMID: 38648098 DOI: 10.2196/55037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Revised: 02/22/2024] [Accepted: 03/10/2024] [Indexed: 04/25/2024] Open
Abstract
BACKGROUND ChatGPT is the most advanced large language model to date, with prior iterations having passed medical licensing examinations, providing clinical decision support, and improved diagnostics. Although limited, past studies of ChatGPT's performance found that artificial intelligence could pass the American Heart Association's advanced cardiovascular life support (ACLS) examinations with modifications. ChatGPT's accuracy has not been studied in more complex clinical scenarios. As heart disease and cardiac arrest remain leading causes of morbidity and mortality in the United States, finding technologies that help increase adherence to ACLS algorithms, which improves survival outcomes, is critical. OBJECTIVE This study aims to examine the accuracy of ChatGPT in following ACLS guidelines for bradycardia and cardiac arrest. METHODS We evaluated the accuracy of ChatGPT's responses to 2 simulations based on the 2020 American Heart Association ACLS guidelines with 3 primary outcomes of interest: the mean individual step accuracy, the accuracy score per simulation attempt, and the accuracy score for each algorithm. For each simulation step, ChatGPT was scored for correctness (1 point) or incorrectness (0 points). Each simulation was conducted 20 times. RESULTS ChatGPT's median accuracy for each step was 85% (IQR 40%-100%) for cardiac arrest and 30% (IQR 13%-81%) for bradycardia. ChatGPT's median accuracy over 20 simulation attempts for cardiac arrest was 69% (IQR 67%-74%) and for bradycardia was 42% (IQR 33%-50%). We found that ChatGPT's outputs varied despite consistent input, the same actions were persistently missed, repetitive overemphasis hindered guidance, and erroneous medication information was presented. CONCLUSIONS This study highlights the need for consistent and reliable guidance to prevent potential medical errors and optimize the application of ChatGPT to enhance its reliability and effectiveness in clinical practice.
Collapse
Affiliation(s)
- Cecilia Pham
- Tilman J Fertitta Family College of Medicine, University of Houston, Houston, TX, United States
| | - Romi Govender
- Tilman J Fertitta Family College of Medicine, University of Houston, Houston, TX, United States
| | - Salik Tehami
- Tilman J Fertitta Family College of Medicine, University of Houston, Houston, TX, United States
| | - Summer Chavez
- Tilman J Fertitta Family College of Medicine, University of Houston, Houston, TX, United States
- Humana Integrated Health Sciences Institute, University of Houston, Houston, TX, United States
- Department of Health Systems and Population Health Sciences, Tilman J Fertitta Family College of Medicine, Houston, TX, United States
| | - Omolola E Adepoju
- Tilman J Fertitta Family College of Medicine, University of Houston, Houston, TX, United States
- Humana Integrated Health Sciences Institute, University of Houston, Houston, TX, United States
- Department of Health Systems and Population Health Sciences, Tilman J Fertitta Family College of Medicine, Houston, TX, United States
| | - Winston Liaw
- Tilman J Fertitta Family College of Medicine, University of Houston, Houston, TX, United States
- Humana Integrated Health Sciences Institute, University of Houston, Houston, TX, United States
- Department of Health Systems and Population Health Sciences, Tilman J Fertitta Family College of Medicine, Houston, TX, United States
| |
Collapse
|
7
|
Ghosh M, Mukherjee S, Ganguly A, Basuchowdhuri P, Naskar SK, Ganguly D. AlpaPICO: Extraction of PICO frames from clinical trial documents using LLMs. Methods 2024; 226:78-88. [PMID: 38643910 DOI: 10.1016/j.ymeth.2024.04.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Revised: 03/30/2024] [Accepted: 04/03/2024] [Indexed: 04/23/2024] Open
Abstract
In recent years, there has been a surge in the publication of clinical trial reports, making it challenging to conduct systematic reviews. Automatically extracting Population, Intervention, Comparator, and Outcome (PICO) from clinical trial studies can alleviate the traditionally time-consuming process of manually scrutinizing systematic reviews. Existing approaches of PICO frame extraction involves supervised approach that relies on the existence of manually annotated data points in the form of BIO label tagging. Recent approaches, such as In-Context Learning (ICL), which has been shown to be effective for a number of downstream NLP tasks, require the use of labeled examples. In this work, we adopt ICL strategy by employing the pretrained knowledge of Large Language Models (LLMs), gathered during the pretraining phase of an LLM, to automatically extract the PICO-related terminologies from clinical trial documents in unsupervised set up to bypass the availability of large number of annotated data instances. Additionally, to showcase the highest effectiveness of LLM in oracle scenario where large number of annotated samples are available, we adopt the instruction tuning strategy by employing Low Rank Adaptation (LORA) to conduct the training of gigantic model in low resource environment for the PICO frame extraction task. More specifically, both of the proposed frameworks utilize AlpaCare as base LLM which employs both few-shot in-context learning and instruction tuning techniques to extract PICO-related terms from the clinical trial reports. We applied these approaches to the widely used coarse-grained datasets such as EBM-NLP, EBM-COMET and fine-grained datasets such as EBM-NLPrev and EBM-NLPh. Our empirical results show that our proposed ICL-based framework produces comparable results on all the version of EBM-NLP datasets and the proposed instruction tuned version of our framework produces state-of-the-art results on all the different EBM-NLP datasets. Our project is available at https://github.com/shrimonmuke0202/AlpaPICO.git.
Collapse
Affiliation(s)
- Madhusudan Ghosh
- School of Mathematical and Computational Sciences, Indian Association for the Cultivation of Science, India.
| | - Shrimon Mukherjee
- School of Mathematical and Computational Sciences, Indian Association for the Cultivation of Science, India.
| | - Asmit Ganguly
- Computer Science and Engineering, Indian Institute of Technology, Patna, India.
| | - Partha Basuchowdhuri
- School of Mathematical and Computational Sciences, Indian Association for the Cultivation of Science, India.
| | - Sudip Kumar Naskar
- Department of Computer Science and Engineering, Jadavpur University, India.
| | - Debasis Ganguly
- School of Computing Science, University of Glasgow, United Kingdom of Great Britain and Northern Ireland.
| |
Collapse
|
8
|
Herrmann-Werner A, Festl-Wietek T, Holderried F, Herschbach L, Griewatz J, Masters K, Zipfel S, Mahling M. Authors' Reply: "Evaluating GPT-4's Cognitive Functions Through the Bloom Taxonomy: Insights and Clarifications". J Med Internet Res 2024; 26:e57778. [PMID: 38625723 PMCID: PMC11068087 DOI: 10.2196/57778] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2024] [Revised: 03/04/2024] [Accepted: 04/04/2024] [Indexed: 04/17/2024] Open
Affiliation(s)
- Anne Herrmann-Werner
- Tübingen Institute for Medical Education, Faculty of Medicine, University of Tübingen, Tübingen, Germany
- Department of Psychosomatic Medicine and Psychotherapy, University Hospital Tübingen, Tübingen, Germany
| | - Teresa Festl-Wietek
- Tübingen Institute for Medical Education, Faculty of Medicine, University of Tübingen, Tübingen, Germany
| | - Friederike Holderried
- Tübingen Institute for Medical Education, Faculty of Medicine, University of Tübingen, Tübingen, Germany
- University Department of Anesthesiology and Intensive Care Medicine, University Hospital Tübingen, Tübingen, Germany
| | - Lea Herschbach
- Tübingen Institute for Medical Education, Faculty of Medicine, University of Tübingen, Tübingen, Germany
| | - Jan Griewatz
- Tübingen Institute for Medical Education, Faculty of Medicine, University of Tübingen, Tübingen, Germany
| | - Ken Masters
- Medical Education and Informatics Department, College of Medicine and Health Sciences, Sultan Qaboos University, Muscat, Oman
| | - Stephan Zipfel
- Department of Psychosomatic Medicine and Psychotherapy, University Hospital Tübingen, Tübingen, Germany
| | - Moritz Mahling
- Tübingen Institute for Medical Education, Faculty of Medicine, University of Tübingen, Tübingen, Germany
- Department of Diabetology, Endocrinology, Nephrology, Section of Nephrology and Hypertension, University Hospital Tübingen, Tübingen, Germany
| |
Collapse
|
9
|
Hadar-Shoval D, Asraf K, Mizrachi Y, Haber Y, Elyoseph Z. Assessing the Alignment of Large Language Models With Human Values for Mental Health Integration: Cross-Sectional Study Using Schwartz's Theory of Basic Values. JMIR Ment Health 2024; 11:e55988. [PMID: 38593424 DOI: 10.2196/55988] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Revised: 03/01/2024] [Accepted: 03/08/2024] [Indexed: 04/11/2024] Open
Abstract
BACKGROUND Large language models (LLMs) hold potential for mental health applications. However, their opaque alignment processes may embed biases that shape problematic perspectives. Evaluating the values embedded within LLMs that guide their decision-making have ethical importance. Schwartz's theory of basic values (STBV) provides a framework for quantifying cultural value orientations and has shown utility for examining values in mental health contexts, including cultural, diagnostic, and therapist-client dynamics. OBJECTIVE This study aimed to (1) evaluate whether the STBV can measure value-like constructs within leading LLMs and (2) determine whether LLMs exhibit distinct value-like patterns from humans and each other. METHODS In total, 4 LLMs (Bard, Claude 2, Generative Pretrained Transformer [GPT]-3.5, GPT-4) were anthropomorphized and instructed to complete the Portrait Values Questionnaire-Revised (PVQ-RR) to assess value-like constructs. Their responses over 10 trials were analyzed for reliability and validity. To benchmark the LLMs' value profiles, their results were compared to published data from a diverse sample of 53,472 individuals across 49 nations who had completed the PVQ-RR. This allowed us to assess whether the LLMs diverged from established human value patterns across cultural groups. Value profiles were also compared between models via statistical tests. RESULTS The PVQ-RR showed good reliability and validity for quantifying value-like infrastructure within the LLMs. However, substantial divergence emerged between the LLMs' value profiles and population data. The models lacked consensus and exhibited distinct motivational biases, reflecting opaque alignment processes. For example, all models prioritized universalism and self-direction, while de-emphasizing achievement, power, and security relative to humans. Successful discriminant analysis differentiated the 4 LLMs' distinct value profiles. Further examination found the biased value profiles strongly predicted the LLMs' responses when presented with mental health dilemmas requiring choosing between opposing values. This provided further validation for the models embedding distinct motivational value-like constructs that shape their decision-making. CONCLUSIONS This study leveraged the STBV to map the motivational value-like infrastructure underpinning leading LLMs. Although the study demonstrated the STBV can effectively characterize value-like infrastructure within LLMs, substantial divergence from human values raises ethical concerns about aligning these models with mental health applications. The biases toward certain cultural value sets pose risks if integrated without proper safeguards. For example, prioritizing universalism could promote unconditional acceptance even when clinically unwise. Furthermore, the differences between the LLMs underscore the need to standardize alignment processes to capture true cultural diversity. Thus, any responsible integration of LLMs into mental health care must account for their embedded biases and motivation mismatches to ensure equitable delivery across diverse populations. Achieving this will require transparency and refinement of alignment techniques to instill comprehensive human values.
Collapse
Affiliation(s)
- Dorit Hadar-Shoval
- The Psychology Department, Max Stern Yezreel Valley College, Tel Adashim, Israel
| | - Kfir Asraf
- The Psychology Department, Max Stern Yezreel Valley College, Tel Adashim, Israel
| | - Yonathan Mizrachi
- The Jane Goodall Institute, Max Stern Yezreel Valley College, Tel Adashim, Israel
- The Laboratory for AI, Machine Learning, Business & Data Analytics, Tel-Aviv University, Tel Aviv, Israel
| | - Yuval Haber
- The PhD Program of Hermeneutics and Cultural Studies, Interdisciplinary Studies Unit, Bar-Ilan University, Ramat Gan, Israel
| | - Zohar Elyoseph
- The Psychology Department, Center for Psychobiological Research, Max Stern Yezreel Valley College, Tel Adashim, Israel
- Department of Brain Sciences, Faculty of Medicine, Imperial College London, London, United Kingdom
| |
Collapse
|
10
|
Hirosawa T, Harada Y, Tokumasu K, Ito T, Suzuki T, Shimizu T. Evaluating ChatGPT-4's Diagnostic Accuracy: Impact of Visual Data Integration. JMIR Med Inform 2024; 12:e55627. [PMID: 38592758 DOI: 10.2196/55627] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Revised: 02/14/2024] [Accepted: 03/13/2024] [Indexed: 04/10/2024] Open
Abstract
BACKGROUND In the evolving field of health care, multimodal generative artificial intelligence (AI) systems, such as ChatGPT-4 with vision (ChatGPT-4V), represent a significant advancement, as they integrate visual data with text data. This integration has the potential to revolutionize clinical diagnostics by offering more comprehensive analysis capabilities. However, the impact on diagnostic accuracy of using image data to augment ChatGPT-4 remains unclear. OBJECTIVE This study aims to assess the impact of adding image data on ChatGPT-4's diagnostic accuracy and provide insights into how image data integration can enhance the accuracy of multimodal AI in medical diagnostics. Specifically, this study endeavored to compare the diagnostic accuracy between ChatGPT-4V, which processed both text and image data, and its counterpart, ChatGPT-4, which only uses text data. METHODS We identified a total of 557 case reports published in the American Journal of Case Reports from January 2022 to March 2023. After excluding cases that were nondiagnostic, pediatric, and lacking image data, we included 363 case descriptions with their final diagnoses and associated images. We compared the diagnostic accuracy of ChatGPT-4V and ChatGPT-4 without vision based on their ability to include the final diagnoses within differential diagnosis lists. Two independent physicians evaluated their accuracy, with a third resolving any discrepancies, ensuring a rigorous and objective analysis. RESULTS The integration of image data into ChatGPT-4V did not significantly enhance diagnostic accuracy, showing that final diagnoses were included in the top 10 differential diagnosis lists at a rate of 85.1% (n=309), comparable to the rate of 87.9% (n=319) for the text-only version (P=.33). Notably, ChatGPT-4V's performance in correctly identifying the top diagnosis was inferior, at 44.4% (n=161), compared with 55.9% (n=203) for the text-only version (P=.002, χ2 test). Additionally, ChatGPT-4's self-reports showed that image data accounted for 30% of the weight in developing the differential diagnosis lists in more than half of cases. CONCLUSIONS Our findings reveal that currently, ChatGPT-4V predominantly relies on textual data, limiting its ability to fully use the diagnostic potential of visual information. This study underscores the need for further development of multimodal generative AI systems to effectively integrate and use clinical image data. Enhancing the diagnostic performance of such AI systems through improved multimodal data integration could significantly benefit patient care by providing more accurate and comprehensive diagnostic insights. Future research should focus on overcoming these limitations, paving the way for the practical application of advanced AI in medicine.
Collapse
Affiliation(s)
- Takanobu Hirosawa
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga, Japan
| | - Yukinori Harada
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga, Japan
| | - Kazuki Tokumasu
- Department of General Medicine, Okayama University Graduate School of Medicine, Dentistry and Pharmaceutical Sciences, Okayama, Japan
| | | | - Tomoharu Suzuki
- Department of Hospital Medicine, Urasoe General Hospital, Okinawa, Japan
| | - Taro Shimizu
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, Shimotsuga, Japan
| |
Collapse
|
11
|
Sivarajkumar S, Kelley M, Samolyk-Mazzanti A, Visweswaran S, Wang Y. An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study. JMIR Med Inform 2024; 12:e55318. [PMID: 38587879 PMCID: PMC11036183 DOI: 10.2196/55318] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Revised: 02/20/2024] [Accepted: 02/24/2024] [Indexed: 04/09/2024] Open
Abstract
BACKGROUND Large language models (LLMs) have shown remarkable capabilities in natural language processing (NLP), especially in domains where labeled data are scarce or expensive, such as the clinical domain. However, to unlock the clinical knowledge hidden in these LLMs, we need to design effective prompts that can guide them to perform specific clinical NLP tasks without any task-specific training data. This is known as in-context learning, which is an art and science that requires understanding the strengths and weaknesses of different LLMs and prompt engineering approaches. OBJECTIVE The objective of this study is to assess the effectiveness of various prompt engineering techniques, including 2 newly introduced types-heuristic and ensemble prompts, for zero-shot and few-shot clinical information extraction using pretrained language models. METHODS This comprehensive experimental study evaluated different prompt types (simple prefix, simple cloze, chain of thought, anticipatory, heuristic, and ensemble) across 5 clinical NLP tasks: clinical sense disambiguation, biomedical evidence extraction, coreference resolution, medication status extraction, and medication attribute extraction. The performance of these prompts was assessed using 3 state-of-the-art language models: GPT-3.5 (OpenAI), Gemini (Google), and LLaMA-2 (Meta). The study contrasted zero-shot with few-shot prompting and explored the effectiveness of ensemble approaches. RESULTS The study revealed that task-specific prompt tailoring is vital for the high performance of LLMs for zero-shot clinical NLP. In clinical sense disambiguation, GPT-3.5 achieved an accuracy of 0.96 with heuristic prompts and 0.94 in biomedical evidence extraction. Heuristic prompts, alongside chain of thought prompts, were highly effective across tasks. Few-shot prompting improved performance in complex scenarios, and ensemble approaches capitalized on multiple prompt strengths. GPT-3.5 consistently outperformed Gemini and LLaMA-2 across tasks and prompt types. CONCLUSIONS This study provides a rigorous evaluation of prompt engineering methodologies and introduces innovative techniques for clinical information extraction, demonstrating the potential of in-context learning in the clinical domain. These findings offer clear guidelines for future prompt-based clinical NLP research, facilitating engagement by non-NLP experts in clinical NLP advancements. To the best of our knowledge, this is one of the first works on the empirical evaluation of different prompt engineering approaches for clinical NLP in this era of generative artificial intelligence, and we hope that it will inspire and inform future research in this area.
Collapse
Affiliation(s)
- Sonish Sivarajkumar
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, United States
| | - Mark Kelley
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, United States
| | - Alyssa Samolyk-Mazzanti
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, United States
| | - Shyam Visweswaran
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, United States
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, United States
| | - Yanshan Wang
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, United States
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA, United States
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, United States
| |
Collapse
|
12
|
Waisberg E, Ong J, Masalkhi M, Lee AG. OpenAI's Sora in medicine: revolutionary advances in generative artificial intelligence for healthcare. Ir J Med Sci 2024:10.1007/s11845-024-03680-y. [PMID: 38570404 DOI: 10.1007/s11845-024-03680-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2024] [Accepted: 03/23/2024] [Indexed: 04/05/2024]
Affiliation(s)
- Ethan Waisberg
- Department of Ophthalmology, University of Cambridge, Cambridge, UK.
| | - Joshua Ong
- Department of Ophthalmology and Visual Sciences, University of Michigan Kellogg Eye Center, Ann Arbor, MI, USA
| | - Mouayad Masalkhi
- Center for Space Medicine, Baylor College of Medicine, Houston, TX, USA
| | - Andrew G Lee
- Department of Ophthalmology, Blanton Eye Institute, Houston Methodist Hospital, Houston, TX, USA
- The Houston Methodist Research Institute, Houston Methodist Hospital, Houston, TX, USA
- Departments of Ophthalmology, Neurology, and Neurosurgery, Weill Cornell Medicine, New York, NY, USA
- Department of Ophthalmology, University of Texas Medical Branch, Galveston, TX, USA
- University of Texas MD Anderson Cancer Center, Houston, TX, USA
- Texas A&M College of Medicine, Bryan, TX, USA
- Department of Ophthalmology, The University of Iowa Hospitals and Clinics, Iowa City, IA, USA
| |
Collapse
|
13
|
Waisberg E, Ong J, Masalkhi M, Lee AG. Concerns with OpenAI's Sora in Medicine. Ann Biomed Eng 2024:10.1007/s10439-024-03505-0. [PMID: 38558354 DOI: 10.1007/s10439-024-03505-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Accepted: 03/28/2024] [Indexed: 04/04/2024]
Abstract
Open AI's Sora represents a ground-breaking innovation in AI that can generate lifelike and imaginative visual scenes based on text prompts. However, Sora has also produced some new concerns surrounding artificial video generation in medicine. While Sora is highly promising to enhance patient education, facilitate remote consultations and simulate surgical procedures, AI-generated videos also bring technical, legal, and ethical challenges. In this paper, we explore the clinical and ethical implications of Sora's AI-generated videos in the field of medicine.
Collapse
Affiliation(s)
- Ethan Waisberg
- Department of Ophthalmology, University of Cambridge, Cambridge, UK.
| | - Joshua Ong
- Department of Ophthalmology and Visual Sciences, University of Michigan Kellogg Eye Center, Ann Arbor, MI, USA
| | - Mouayad Masalkhi
- Center for Space Medicine, Baylor College of Medicine, Houston, TX, USA
| | - Andrew G Lee
- Department of Ophthalmology, Blanton Eye Institute, Houston Methodist Hospital, Houston, TX, USA
- The Houston Methodist Research Institute, Houston Methodist Hospital, Houston, TX, USA
- Departments of Ophthalmology, Neurology, and Neurosurgery, Weill Cornell Medicine, New York, NY, USA
- Department of Ophthalmology, University of Texas Medical Branch, Galveston, TX, USA
- University of Texas MD Anderson Cancer Center, Houston, TX, USA
- Texas A&M College of Medicine, Bryan, TX, USA
- Department of Ophthalmology, The University of Iowa Hospitals and Clinics, Iowa City, IA, USA
| |
Collapse
|
14
|
Valdez D, Bunnell A, Lim SY, Sadowski P, Shepherd JA. Performance of Progressive Generations of GPT on an Exam Designed for Certifying Physicians as Certified Clinical Densitometrists. J Clin Densitom 2024; 27:101480. [PMID: 38401238 DOI: 10.1016/j.jocd.2024.101480] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/05/2023] [Revised: 01/25/2024] [Accepted: 02/15/2024] [Indexed: 02/26/2024]
Abstract
BACKGROUND Artificial intelligence (AI) large language models (LLMs) such as ChatGPT have demonstrated the ability to pass standardized exams. These models are not trained for a specific task, but instead trained to predict sequences of text from large corpora of documents sourced from the internet. It has been shown that even models trained on this general task can pass exams in a variety of domain-specific fields, including the United States Medical Licensing Examination. We asked if large language models would perform as well on a much narrower subdomain tests designed for medical specialists. Furthermore, we wanted to better understand how progressive generations of GPT (generative pre-trained transformer) models may be evolving in the completeness and sophistication of their responses even while generational training remains general. In this study, we evaluated the performance of two versions of GPT (GPT 3 and 4) on their ability to pass the certification exam given to physicians to work as osteoporosis specialists and become a certified clinical densitometrists. The CCD exam has a possible score range of 150 to 400. To pass, you need a score of 300. METHODS A 100-question multiple-choice practice exam was obtained from a 3rd party exam preparation website that mimics the accredited certification tests given by the ISCD (International Society for Clinical Densitometry). The exam was administered to two versions of GPT, the free version (GPT Playground) and ChatGPT+, which are based on GPT-3 and GPT-4, respectively (OpenAI, San Francisco, CA). The systems were prompted with the exam questions verbatim. If the response was purely textual and did not specify which of the multiple-choice answers to select, the authors matched the text to the closest answer. Each exam was graded and an estimated ISCD score was provided from the exam website. In addition, each response was evaluated by a rheumatologist CCD and ranked for accuracy using a 5-level scale. The two GPT versions were compared in terms of response accuracy and length. RESULTS The average response length was 11.6 ±19 words for GPT-3 and 50.0±43.6 words for GPT-4. GPT-3 answered 62 questions correctly resulting in a failing ISCD score of 289. However, GPT-4 answered 82 questions correctly with a passing score of 342. GPT-3 scored highest on the "Overview of Low Bone Mass and Osteoporosis" category (72 % correct) while GPT-4 scored well above 80 % accuracy on all categories except "Imaging Technology in Bone Health" (65 % correct). Regarding subjective accuracy, GPT-3 answered 23 questions with nonsensical or totally wrong responses while GPT-4 had no responses in that category. CONCLUSION If this had been an actual certification exam, GPT-4 would now have a CCD suffix to its name even after being trained using general internet knowledge. Clearly, more goes into physician training than can be captured in this exam. However, GPT algorithms may prove to be valuable physician aids in the diagnoses and monitoring of osteoporosis and other diseases.
Collapse
Affiliation(s)
- Dustin Valdez
- University of Hawaii at Manoa, Honolulu, HI, USA; University of Hawaii Cancer Center, Honolulu, HI, USA.
| | - Arianna Bunnell
- University of Hawaii at Manoa, Honolulu, HI, USA; University of Hawaii Cancer Center, Honolulu, HI, USA
| | - Sian Y Lim
- Hawai'i Pacific Health Medical Group, Hawai'i Pacific Health, Honolulu, HI, USA
| | | | | |
Collapse
|
15
|
Lin KC, Chen TA, Lin MH, Chen YC, Chen TJ. Integration and Assessment of ChatGPT in Medical Case Reporting: A Multifaceted Approach. Eur J Investig Health Psychol Educ 2024; 14:888-901. [PMID: 38667812 PMCID: PMC11049282 DOI: 10.3390/ejihpe14040057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2024] [Revised: 03/18/2024] [Accepted: 03/19/2024] [Indexed: 04/28/2024] Open
Abstract
ChatGPT, a large language model, has gained significance in medical writing, particularly in case reports that document the course of an illness. This article explores the integration of ChatGPT and how ChatGPT shapes the process, product, and politics of medical writing in the real world. We conducted a bibliometric analysis on case reports utilizing ChatGPT and indexed in PubMed, encompassing publication information. Furthermore, an in-depth analysis was conducted to categorize the applications and limitations of ChatGPT and the publication trend of application categories. A total of 66 case reports utilizing ChatGPT were identified, with a predominant preference for the online version and English input by the authors. The prevalent application categories were information retrieval and content generation. Notably, this trend remained consistent across different months. Within the subset of 32 articles addressing ChatGPT limitations in case report writing, concerns related to inaccuracies and a lack of clinical context were prominently emphasized. This pointed out the important role of clinical thinking and professional expertise, representing the foundational tenets of medical education, while also accentuating the distinction between physicians and generative artificial intelligence.
Collapse
Affiliation(s)
- Kuan-Chen Lin
- Department of Family Medicine, Taipei Veterans General Hospital, No. 201, Sec. 2, Shih-Pai Road, Taipei 11217, Taiwan; (K.-C.L.); (T.-A.C.); (M.-H.L.)
| | - Tsung-An Chen
- Department of Family Medicine, Taipei Veterans General Hospital, No. 201, Sec. 2, Shih-Pai Road, Taipei 11217, Taiwan; (K.-C.L.); (T.-A.C.); (M.-H.L.)
| | - Ming-Hwai Lin
- Department of Family Medicine, Taipei Veterans General Hospital, No. 201, Sec. 2, Shih-Pai Road, Taipei 11217, Taiwan; (K.-C.L.); (T.-A.C.); (M.-H.L.)
- School of Medicine, National Yang Ming Chiao Tung University, Taipei 30010, Taiwan
| | - Yu-Chun Chen
- Department of Family Medicine, Taipei Veterans General Hospital, No. 201, Sec. 2, Shih-Pai Road, Taipei 11217, Taiwan; (K.-C.L.); (T.-A.C.); (M.-H.L.)
- School of Medicine, National Yang Ming Chiao Tung University, Taipei 30010, Taiwan
- Institute of Hospital and Health Care Administration, School of Medicine, National Yang Ming Chiao Tung University, Taipei 30010, Taiwan
- Big Data Center, Taipei Veterans General Hospital, Taipei 11217, Taiwan
| | - Tzeng-Ji Chen
- Department of Family Medicine, Taipei Veterans General Hospital Hsinchu Branch, No. 81, Sec. 1, Zhongfeng Road, Zhudong Township, Hsinchu 310403, Taiwan
- Department of Post-Baccalaureate Medicine, National Chung Hsing University, No. 145, Xingda Road, South District, Taichung 402202, Taiwan
| |
Collapse
|
16
|
Noda M, Ueno T, Koshu R, Takaso Y, Shimada MD, Saito C, Sugimoto H, Fushiki H, Ito M, Nomura A, Yoshizaki T. Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study. JMIR Med Educ 2024; 10:e57054. [PMID: 38546736 PMCID: PMC11009855 DOI: 10.2196/57054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/03/2024] [Revised: 02/22/2024] [Accepted: 03/09/2024] [Indexed: 04/14/2024]
Abstract
BACKGROUND Artificial intelligence models can learn from medical literature and clinical cases and generate answers that rival human experts. However, challenges remain in the analysis of complex data containing images and diagrams. OBJECTIVE This study aims to assess the answering capabilities and accuracy of ChatGPT-4 Vision (GPT-4V) for a set of 100 questions, including image-based questions, from the 2023 otolaryngology board certification examination. METHODS Answers to 100 questions from the 2023 otolaryngology board certification examination, including image-based questions, were generated using GPT-4V. The accuracy rate was evaluated using different prompts, and the presence of images, clinical area of the questions, and variations in the answer content were examined. RESULTS The accuracy rate for text-only input was, on average, 24.7% but improved to 47.3% with the addition of English translation and prompts (P<.001). The average nonresponse rate for text-only input was 46.3%; this decreased to 2.7% with the addition of English translation and prompts (P<.001). The accuracy rate was lower for image-based questions than for text-only questions across all types of input, with a relatively high nonresponse rate. General questions and questions from the fields of head and neck allergies and nasal allergies had relatively high accuracy rates, which increased with the addition of translation and prompts. In terms of content, questions related to anatomy had the highest accuracy rate. For all content types, the addition of translation and prompts increased the accuracy rate. As for the performance based on image-based questions, the average of correct answer rate with text-only input was 30.4%, and that with text-plus-image input was 41.3% (P=.02). CONCLUSIONS Examination of artificial intelligence's answering capabilities for the otolaryngology board certification examination improves our understanding of its potential and limitations in this field. Although the improvement was noted with the addition of translation and prompts, the accuracy rate for image-based questions was lower than that for text-based questions, suggesting room for improvement in GPT-4V at this stage. Furthermore, text-plus-image input answers a higher rate in image-based questions. Our findings imply the usefulness and potential of GPT-4V in medicine; however, future consideration of safe use methods is needed.
Collapse
Affiliation(s)
- Masao Noda
- Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan
- Department of Otolaryngology and Head and Neck Surgery, Kanazawa University, Kanazawa, Japan
| | - Takayoshi Ueno
- Department of Otolaryngology and Head and Neck Surgery, Kanazawa University, Kanazawa, Japan
| | - Ryota Koshu
- Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan
| | - Yuji Takaso
- Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan
- Department of Otolaryngology and Head and Neck Surgery, Kanazawa University, Kanazawa, Japan
| | - Mari Dias Shimada
- Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan
| | - Chizu Saito
- Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan
| | - Hisashi Sugimoto
- Department of Otolaryngology and Head and Neck Surgery, Kanazawa University, Kanazawa, Japan
| | - Hiroaki Fushiki
- Department of Otolaryngology, Mejiro University Ear Institute Clinic, Saitama, Japan
| | - Makoto Ito
- Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Shimotsuke, Japan
| | - Akihiro Nomura
- College of Transdisciplinary Sciences for Innovation, Kanazawa University, Kanazawa, Japan
| | - Tomokazu Yoshizaki
- Department of Otolaryngology and Head and Neck Surgery, Kanazawa University, Kanazawa, Japan
| |
Collapse
|
17
|
Raja H, Munawar A, Mylonas N, Delsoz M, Madadi Y, Elahi M, Hassan A, Abu Serhan H, Inam O, Hernandez L, Chen H, Tran S, Munir W, Abd-Alrazaq A, Yousefi S. Automated Category and Trend Analysis of Scientific Articles on Ophthalmology Using Large Language Models: Development and Usability Study. JMIR Form Res 2024; 8:e52462. [PMID: 38517457 PMCID: PMC10998173 DOI: 10.2196/52462] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Revised: 01/22/2024] [Accepted: 02/02/2024] [Indexed: 03/23/2024] Open
Abstract
BACKGROUND In this paper, we present an automated method for article classification, leveraging the power of large language models (LLMs). OBJECTIVE The aim of this study is to evaluate the applicability of various LLMs based on textual content of scientific ophthalmology papers. METHODS We developed a model based on natural language processing techniques, including advanced LLMs, to process and analyze the textual content of scientific papers. Specifically, we used zero-shot learning LLMs and compared Bidirectional and Auto-Regressive Transformers (BART) and its variants with Bidirectional Encoder Representations from Transformers (BERT) and its variants, such as distilBERT, SciBERT, PubmedBERT, and BioBERT. To evaluate the LLMs, we compiled a data set (retinal diseases [RenD] ) of 1000 ocular disease-related articles, which were expertly annotated by a panel of 6 specialists into 19 distinct categories. In addition to the classification of articles, we also performed analysis on different classified groups to find the patterns and trends in the field. RESULTS The classification results demonstrate the effectiveness of LLMs in categorizing a large number of ophthalmology papers without human intervention. The model achieved a mean accuracy of 0.86 and a mean F1-score of 0.85 based on the RenD data set. CONCLUSIONS The proposed framework achieves notable improvements in both accuracy and efficiency. Its application in the domain of ophthalmology showcases its potential for knowledge organization and retrieval. We performed a trend analysis that enables researchers and clinicians to easily categorize and retrieve relevant papers, saving time and effort in literature review and information gathering as well as identification of emerging scientific trends within different disciplines. Moreover, the extendibility of the model to other scientific fields broadens its impact in facilitating research and trend analysis across diverse disciplines.
Collapse
Affiliation(s)
- Hina Raja
- Department of Ophthalmology, University of Tennessee Health Science Center, Memphis, TN, United States
| | - Asim Munawar
- Watson Research Center, IBM Research, New York, NY, United States
| | - Nikolaos Mylonas
- School of Informatics, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Mohammad Delsoz
- Department of Ophthalmology, University of Tennessee Health Science Center, Memphis, TN, United States
| | - Yeganeh Madadi
- Department of Ophthalmology, University of Tennessee Health Science Center, Memphis, TN, United States
| | - Muhammad Elahi
- Quillen College of Medicine, East Tennessee State University, Johnson, TN, United States
| | - Amr Hassan
- Gavin Herbert Eye Institute, School of Medicine, University of California, Irvine, CA, United States
| | | | - Onur Inam
- Edward S. Harkness Eye Institute, Vagelos College of Physicians and Surgeons, Columbia University Irving Medical Center, New York, NY, United States
- Department of Biophysics, Faculty of Medicine, Gazi University, Ankara, Turkey
| | - Luis Hernandez
- Association to Prevent Blindness in Mexico, Ciudad, Mexico
| | - Hao Chen
- Department of Ophthalmology, University of Tennessee Health Science Center, Memphis, TN, United States
- Department of Pharmacology, Addiction Science and Toxicology, University of Tennessee Health Science Center, Memphis, TN, United States
| | - Sang Tran
- Department of Ophthalmology and Visual Sciences, School of Medicine, University of Maryland, Baltimore, MD, United States
| | - Wuqaas Munir
- Department of Ophthalmology and Visual Sciences, School of Medicine, University of Maryland, Baltimore, MD, United States
| | - Alaa Abd-Alrazaq
- AI Center for Precision Health, Weill Cornell Medicine-Qatar, Doha, Qatar
| | - Siamak Yousefi
- Department of Ophthalmology, University of Tennessee Health Science Center, Memphis, TN, United States
| |
Collapse
|
18
|
Maloney LT, Dal Martello MF, Fei V, Ma V. A comparison of human and GPT-4 use of probabilistic phrases in a coordination game. Sci Rep 2024; 14:6835. [PMID: 38514688 PMCID: PMC10958015 DOI: 10.1038/s41598-024-56740-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Accepted: 03/11/2024] [Indexed: 03/23/2024] Open
Abstract
English speakers use probabilistic phrases such as likely to communicate information about the probability or likelihood of events. Communication is successful to the extent that the listener grasps what the speaker means to convey and, if communication is successful, individuals can potentially coordinate their actions based on shared knowledge about uncertainty. We first assessed human ability to estimate the probability and the ambiguity (imprecision) of twenty-three probabilistic phrases in a coordination game in two different contexts, investment advice and medical advice. We then had GPT-4 (OpenAI), a Large Language Model, complete the same tasks as the human participants. We found that GPT-4's estimates of probability both in the Investment and Medical Contexts were as close or closer to that of the human participants as the human participants' estimates were to one another. However, further analyses of residuals disclosed small but significant differences between human and GPT-4 performance. Human probability estimates were compressed relative to those of GPT-4. Estimates of probability for both the human participants and GPT-4 were little affected by context. We propose that evaluation methods based on coordination games provide a systematic way to assess what GPT-4 and similar programs can and cannot do.
Collapse
Affiliation(s)
- Laurence T Maloney
- Department of Psychology, New York University, 6 Washington Place, Room 574, New York, NY, 10012, USA.
- Center for Neural Science, New York University, 6 Washington Place, New York, NY, 10012, USA.
| | - Maria F Dal Martello
- Department of Psychology, New York University, 6 Washington Place, Room 574, New York, NY, 10012, USA
- Dipartmento di Psicologia Generale, Università di Padova, Via Venezia 8, Padua, Italy
| | - Vivian Fei
- Department of Psychology, New York University, 6 Washington Place, Room 574, New York, NY, 10012, USA
| | - Valerie Ma
- Department of Psychology, New York University, 6 Washington Place, Room 574, New York, NY, 10012, USA
| |
Collapse
|
19
|
Elyoseph Z, Levkovich I. Comparing the Perspectives of Generative AI, Mental Health Experts, and the General Public on Schizophrenia Recovery: Case Vignette Study. JMIR Ment Health 2024; 11:e53043. [PMID: 38533615 PMCID: PMC11004608 DOI: 10.2196/53043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/24/2023] [Revised: 01/24/2024] [Accepted: 02/11/2024] [Indexed: 03/28/2024] Open
Abstract
Background The current paradigm in mental health care focuses on clinical recovery and symptom remission. This model's efficacy is influenced by therapist trust in patient recovery potential and the depth of the therapeutic relationship. Schizophrenia is a chronic illness with severe symptoms where the possibility of recovery is a matter of debate. As artificial intelligence (AI) becomes integrated into the health care field, it is important to examine its ability to assess recovery potential in major psychiatric disorders such as schizophrenia. Objective This study aimed to evaluate the ability of large language models (LLMs) in comparison to mental health professionals to assess the prognosis of schizophrenia with and without professional treatment and the long-term positive and negative outcomes. Methods Vignettes were inputted into LLMs interfaces and assessed 10 times by 4 AI platforms: ChatGPT-3.5, ChatGPT-4, Google Bard, and Claude. A total of 80 evaluations were collected and benchmarked against existing norms to analyze what mental health professionals (general practitioners, psychiatrists, clinical psychologists, and mental health nurses) and the general public think about schizophrenia prognosis with and without professional treatment and the positive and negative long-term outcomes of schizophrenia interventions. Results For the prognosis of schizophrenia with professional treatment, ChatGPT-3.5 was notably pessimistic, whereas ChatGPT-4, Claude, and Bard aligned with professional views but differed from the general public. All LLMs believed untreated schizophrenia would remain static or worsen without professional treatment. For long-term outcomes, ChatGPT-4 and Claude predicted more negative outcomes than Bard and ChatGPT-3.5. For positive outcomes, ChatGPT-3.5 and Claude were more pessimistic than Bard and ChatGPT-4. Conclusions The finding that 3 out of the 4 LLMs aligned closely with the predictions of mental health professionals when considering the "with treatment" condition is a demonstration of the potential of this technology in providing professional clinical prognosis. The pessimistic assessment of ChatGPT-3.5 is a disturbing finding since it may reduce the motivation of patients to start or persist with treatment for schizophrenia. Overall, although LLMs hold promise in augmenting health care, their application necessitates rigorous validation and a harmonious blend with human expertise.
Collapse
Affiliation(s)
- Zohar Elyoseph
- Department of Brain Sciences, Faculty of Medicine, Imperial College London, London, United Kingdom
- The Center for Psychobiological Research, Department of Psychology and Educational Counseling, Max Stern Yezreel Valley College, Emek Yezreel, Israel
| | - Inbar Levkovich
- Faculty of Graduate Studies, Oranim Academic College, Kiryat Tiv'on, Israel
| |
Collapse
|
20
|
Raman R, Venugopalan M, Kamal A. Evaluating human resources management literacy: A performance analysis of ChatGPT and bard. Heliyon 2024; 10:e27026. [PMID: 38486738 PMCID: PMC10937570 DOI: 10.1016/j.heliyon.2024.e27026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2023] [Revised: 02/16/2024] [Accepted: 02/22/2024] [Indexed: 03/17/2024] Open
Abstract
This study presents a comprehensive analysis comparing the literacy levels of two Generative Artificial Intelligence (GAI) tools, ChatGPT and Bard, using a dataset of 134 questions from the Human Resources (HR) domain. The generated responses are evaluated for accuracy, relevance, and clarity. We find that ChatGPT outperforms Bard in overall accuracy (84.3% vs. 82.8%). This difference in performance suggests that ChatGPT could serve as a robotic advisor in transactional HR roles. In contrast, Bard may possess additional safeguards against misuse in the HR function, making it less capable of generating responses to certain types of questions. Statistical tests reveal that although the two systems differ in their mean accuracy, relevance, and clarity of the responses, the observed differences are not always statistically significant, implying that both tools may be more complementary than competitive. The Pearson correlation coefficients further support this by showing weak to non-existent relationships in performance metrics between the two tools. Confirmation queries don't improve ChatGPT or Bard's response accuracy. The study thus contributes to emerging research on the utility of GAI tools in Human Resources Management and suggests that involving certified HR professionals in the design phase could enhance underlying language model performance.
Collapse
Affiliation(s)
- Raghu Raman
- Amrita School of Business, Amrita Vishwa Vidyapeetham, Amritapuri, India
| | - Murale Venugopalan
- Amrita School of Business, Amrita Vishwa Vidyapeetham, Amritapuri, India
| | - Anju Kamal
- Amrita School of Business, Amrita Vishwa Vidyapeetham, Amritapuri, India
| |
Collapse
|
21
|
Cirone K, Akrout M, Abid L, Oakley A. Assessing the Utility of Multimodal Large Language Models (GPT-4 Vision and Large Language and Vision Assistant) in Identifying Melanoma Across Different Skin Tones. JMIR Dermatol 2024; 7:e55508. [PMID: 38477960 DOI: 10.2196/55508] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Revised: 02/16/2024] [Accepted: 03/01/2024] [Indexed: 03/14/2024] Open
Abstract
The large language models GPT-4 Vision and Large Language and Vision Assistant are capable of understanding and accurately differentiating between benign lesions and melanoma, indicating potential incorporation into dermatologic care, medical research, and education.
Collapse
Affiliation(s)
- Katrina Cirone
- Schulich School of Medicine and Dentistry, Western University, London, ON, Canada
- AIPLabs, Budapest, Hungary
| | - Mohamed Akrout
- AIPLabs, Budapest, Hungary
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
| | | | - Amanda Oakley
- Department of Dermatology, Health New Zealand Te Whatu Ora Waikato, Hamilton, New Zealand
- Department of Medicine, Faculty of Medical and Health Sciences, The University of Auckland, Auckland, New Zealand
| |
Collapse
|
22
|
Nakao T, Miki S, Nakamura Y, Kikuchi T, Nomura Y, Hanaoka S, Yoshikawa T, Abe O. Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study. JMIR Med Educ 2024; 10:e54393. [PMID: 38470459 DOI: 10.2196/54393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 12/26/2023] [Accepted: 02/16/2024] [Indexed: 03/13/2024]
Abstract
BACKGROUND Previous research applying large language models (LLMs) to medicine was focused on text-based information. Recently, multimodal variants of LLMs acquired the capability of recognizing images. OBJECTIVE We aim to evaluate the image recognition capability of generative pretrained transformer (GPT)-4V, a recent multimodal LLM developed by OpenAI, in the medical field by testing how visual information affects its performance to answer questions in the 117th Japanese National Medical Licensing Examination. METHODS We focused on 108 questions that had 1 or more images as part of a question and presented GPT-4V with the same questions under two conditions: (1) with both the question text and associated images and (2) with the question text only. We then compared the difference in accuracy between the 2 conditions using the exact McNemar test. RESULTS Among the 108 questions with images, GPT-4V's accuracy was 68% (73/108) when presented with images and 72% (78/108) when presented without images (P=.36). For the 2 question categories, clinical and general, the accuracies with and those without images were 71% (70/98) versus 78% (76/98; P=.21) and 30% (3/10) versus 20% (2/10; P≥.99), respectively. CONCLUSIONS The additional information from the images did not significantly improve the performance of GPT-4V in the Japanese National Medical Licensing Examination.
Collapse
Affiliation(s)
- Takahiro Nakao
- Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
| | - Soichiro Miki
- Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
| | - Yuta Nakamura
- Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
| | - Tomohiro Kikuchi
- Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
- Department of Radiology, School of Medicine, Jichi Medical University, Shimotsuke, Tochigi, Japan
| | - Yukihiro Nomura
- Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
- Center for Frontier Medical Engineering, Chiba University, Inage-ku, Chiba, Japan
| | - Shouhei Hanaoka
- Department of Radiology, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
| | - Takeharu Yoshikawa
- Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
| | - Osamu Abe
- Department of Radiology, The University of Tokyo Hospital, Bunkyo-ku, Tokyo, Japan
| |
Collapse
|
23
|
Davis J, Van Bulck L, Durieux BN, Lindvall C. The Temperature Feature of ChatGPT: Modifying Creativity for Clinical Research. JMIR Hum Factors 2024; 11:e53559. [PMID: 38457221 PMCID: PMC10960206 DOI: 10.2196/53559] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Revised: 12/11/2023] [Accepted: 01/24/2024] [Indexed: 03/09/2024] Open
Abstract
More clinicians and researchers are exploring uses for large language model chatbots, such as ChatGPT, for research, dissemination, and educational purposes. Therefore, it becomes increasingly relevant to consider the full potential of this tool, including the special features that are currently available through the application programming interface. One of these features is a variable called temperature, which changes the degree to which randomness is involved in the model's generated output. This is of particular interest to clinicians and researchers. By lowering this variable, one can generate more consistent outputs; by increasing it, one can receive more creative responses. For clinicians and researchers who are exploring these tools for a variety of tasks, the ability to tailor outputs to be less creative may be beneficial for work that demands consistency. Additionally, access to more creative text generation may enable scientific authors to describe their research in more general language and potentially connect with a broader public through social media. In this viewpoint, we present the temperature feature, discuss potential uses, and provide some examples.
Collapse
Affiliation(s)
- Joshua Davis
- Department of Psychosocial Oncology and Palliative Care, Dana-Farber Cancer Institute, Boston, MA, United States
- Albany Medical College, Albany, NY, United States
| | - Liesbet Van Bulck
- KU Leuven Department of Public Health and Primary Care, KU Leuven-University of Leuven, Leuven, Belgium
- Research Foundation Flanders (FWO), Brussels, Belgium
| | - Brigitte N Durieux
- Department of Psychosocial Oncology and Palliative Care, Dana-Farber Cancer Institute, Boston, MA, United States
| | - Charlotta Lindvall
- Department of Psychosocial Oncology and Palliative Care, Dana-Farber Cancer Institute, Boston, MA, United States
- Department of Medicine, Brigham and Women's Hospital, Boston, MA, United States
- Harvard Medical School, Harvard University, Boston, MA, United States
| |
Collapse
|
24
|
Roster K, Kann RB, Farabi B, Gronbeck C, Brownstone N, Lipner SR. Readability and Health Literacy Scores for ChatGPT-Generated Dermatology Public Education Materials: Cross-Sectional Analysis of Sunscreen and Melanoma Questions. JMIR Dermatol 2024; 7:e50163. [PMID: 38446502 PMCID: PMC10955394 DOI: 10.2196/50163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Revised: 01/02/2024] [Accepted: 02/06/2024] [Indexed: 03/07/2024] Open
Affiliation(s)
- Katie Roster
- New York Medical College, New York, NY, United States
| | | | - Banu Farabi
- Dermatology Department, NYC Health + Hospital/Metropolitan, New York, NY, United States
| | - Christian Gronbeck
- Department of Dermatology, University of Connecticut HealthCenter, Framington, CT, United States
| | - Nicholas Brownstone
- Department of Dermatology, Temple University Hospital, Philadelphia, PA, United States
| | - Shari R Lipner
- Department of Dermatology, Weill Cornell Medicine, New York, NY, United States
| |
Collapse
|
25
|
Rodriguez DV, Lawrence K, Gonzalez J, Brandfield-Harvey B, Xu L, Tasneem S, Levine DL, Mann D. Leveraging Generative AI Tools to Support the Development of Digital Solutions in Health Care Research: Case Study. JMIR Hum Factors 2024; 11:e52885. [PMID: 38446539 PMCID: PMC10955400 DOI: 10.2196/52885] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Revised: 11/27/2023] [Accepted: 12/15/2023] [Indexed: 03/07/2024] Open
Abstract
BACKGROUND Generative artificial intelligence has the potential to revolutionize health technology product development by improving coding quality, efficiency, documentation, quality assessment and review, and troubleshooting. OBJECTIVE This paper explores the application of a commercially available generative artificial intelligence tool (ChatGPT) to the development of a digital health behavior change intervention designed to support patient engagement in a commercial digital diabetes prevention program. METHODS We examined the capacity, advantages, and limitations of ChatGPT to support digital product idea conceptualization, intervention content development, and the software engineering process, including software requirement generation, software design, and code production. In total, 11 evaluators, each with at least 10 years of experience in fields of study ranging from medicine and implementation science to computer science, participated in the output review process (ChatGPT vs human-generated output). All had familiarity or prior exposure to the original personalized automatic messaging system intervention. The evaluators rated the ChatGPT-produced outputs in terms of understandability, usability, novelty, relevance, completeness, and efficiency. RESULTS Most metrics received positive scores. We identified that ChatGPT can (1) support developers to achieve high-quality products faster and (2) facilitate nontechnical communication and system understanding between technical and nontechnical team members around the development goal of rapid and easy-to-build computational solutions for medical technologies. CONCLUSIONS ChatGPT can serve as a usable facilitator for researchers engaging in the software development life cycle, from product conceptualization to feature identification and user story development to code generation. TRIAL REGISTRATION ClinicalTrials.gov NCT04049500; https://clinicaltrials.gov/ct2/show/NCT04049500.
Collapse
Affiliation(s)
- Danissa V Rodriguez
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, United States
| | - Katharine Lawrence
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, United States
- Medical Center Information Technology, Department of Health Informatics, New York University Langone Health, New York, NY, United States
| | - Javier Gonzalez
- Medical Center Information Technology, Department of Health Informatics, New York University Langone Health, New York, NY, United States
| | - Beatrix Brandfield-Harvey
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, United States
| | - Lynn Xu
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, United States
| | - Sumaiya Tasneem
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, United States
| | - Defne L Levine
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, United States
| | - Devin Mann
- Department of Population Health, New York University Grossman School of Medicine, New York, NY, United States
- Medical Center Information Technology, Department of Health Informatics, New York University Langone Health, New York, NY, United States
| |
Collapse
|
26
|
Li J, Dada A, Puladi B, Kleesiek J, Egger J. ChatGPT in healthcare: A taxonomy and systematic review. Comput Methods Programs Biomed 2024; 245:108013. [PMID: 38262126 DOI: 10.1016/j.cmpb.2024.108013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Revised: 12/29/2023] [Accepted: 01/08/2024] [Indexed: 01/25/2024]
Abstract
The recent release of ChatGPT, a chat bot research project/product of natural language processing (NLP) by OpenAI, stirs up a sensation among both the general public and medical professionals, amassing a phenomenally large user base in a short time. This is a typical example of the 'productization' of cutting-edge technologies, which allows the general public without a technical background to gain firsthand experience in artificial intelligence (AI), similar to the AI hype created by AlphaGo (DeepMind Technologies, UK) and self-driving cars (Google, Tesla, etc.). However, it is crucial, especially for healthcare researchers, to remain prudent amidst the hype. This work provides a systematic review of existing publications on the use of ChatGPT in healthcare, elucidating the 'status quo' of ChatGPT in medical applications, for general readers, healthcare professionals as well as NLP scientists. The large biomedical literature database PubMed is used to retrieve published works on this topic using the keyword 'ChatGPT'. An inclusion criterion and a taxonomy are further proposed to filter the search results and categorize the selected publications, respectively. It is found through the review that the current release of ChatGPT has achieved only moderate or 'passing' performance in a variety of tests, and is unreliable for actual clinical deployment, since it is not intended for clinical applications by design. We conclude that specialized NLP models trained on (bio)medical datasets still represent the right direction to pursue for critical clinical applications.
Collapse
Affiliation(s)
- Jianning Li
- Institute for Artificial Intelligence in Medicine, University Hospital Essen (AöR), Girardetstraße 2, 45131 Essen, Germany
| | - Amin Dada
- Institute for Artificial Intelligence in Medicine, University Hospital Essen (AöR), Girardetstraße 2, 45131 Essen, Germany
| | - Behrus Puladi
- Institute of Medical Informatics, University Hospital RWTH Aachen, Pauwelsstraße 30, 52074 Aachen, Germany; Department of Oral and Maxillofacial Surgery, University Hospital RWTH Aachen, Pauwelsstraße 30, 52074 Aachen, Germany
| | - Jens Kleesiek
- Institute for Artificial Intelligence in Medicine, University Hospital Essen (AöR), Girardetstraße 2, 45131 Essen, Germany; TU Dortmund University, Department of Physics, Otto-Hahn-Straße 4, 44227 Dortmund, Germany
| | - Jan Egger
- Institute for Artificial Intelligence in Medicine, University Hospital Essen (AöR), Girardetstraße 2, 45131 Essen, Germany; Center for Virtual and Extended Reality in Medicine (ZvRM), University Hospital Essen, University Medicine Essen, Hufelandstraße 55, 45147 Essen, Germany.
| |
Collapse
|
27
|
Blease C, Worthen A, Torous J. Psychiatrists' experiences and opinions of generative artificial intelligence in mental healthcare: An online mixed methods survey. Psychiatry Res 2024; 333:115724. [PMID: 38244285 DOI: 10.1016/j.psychres.2024.115724] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 01/02/2024] [Accepted: 01/05/2024] [Indexed: 01/22/2024]
Abstract
Following the launch of ChatGPT in November 2022, interest in large language model (LLM)-powered chatbots has surged with increasing focus on the clinical potential of these tools. Missing from this discussion, however, are the perspectives of physicians. The current study aimed to explore psychiatrists' experiences and opinions on this new generation of chatbots in mental health care. An online survey including both quantitative and qualitative responses was distributed to a non-probability sample of psychiatrists affiliated with the American Psychiatric Association. Findings revealed 44 % of psychiatrists had used OpenAI's ChatGPT-3.5 and 33 % had used GPT-4.0 "to assist with answering clinical questions." Administrative tasks were cited as a major benefit of these tools: 70 % somewhat agreed/agreed "documentation will be/is more efficient". Three in four psychiatrists (75 %) somewhat agreed/agreed "the majority of their patients will consult these tools before first seeing a doctor". Nine in ten somewhat agreed/agreed that clinicians need more support/training in understanding these tools. Open-ended responses reflected these opinions but respondents also expressed divergent opinions on the value of generative AI in clinical practice, including its impact on the future of the profession.
Collapse
Affiliation(s)
- Charlotte Blease
- Participatory eHeath and Health Data Research Group, Department of Women's and Children's Health, Uppsala University, Uppala, Sweden; Digital Psychiatry, Department of Psychiatry, Beth Israel Deaconess Medical Center, -Harvard Medical School, Boston, MA 02115, USA.
| | | | - John Torous
- Digital Psychiatry, Department of Psychiatry, Beth Israel Deaconess Medical Center, -Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
28
|
Kantor J. Best practices for implementing ChatGPT, large language models, and artificial intelligence in qualitative and survey-based research. JAAD Int 2024; 14:22-23. [PMID: 38054196 PMCID: PMC10694559 DOI: 10.1016/j.jdin.2023.10.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/07/2023] Open
Affiliation(s)
- Jonathan Kantor
- Correspondence to: Jonathan Kantor, MD, MSc, MSCE, 1301 Plantation Island Dr S, St Augustine, FL 32080.
| |
Collapse
|
29
|
Hakam HT, Prill R, Korte L, Lovreković B, Ostojić M, Ramadanov N, Muehlensiepen F. Human-Written vs AI-Generated Texts in Orthopedic Academic Literature: Comparative Qualitative Analysis. JMIR Form Res 2024; 8:e52164. [PMID: 38363631 PMCID: PMC10907945 DOI: 10.2196/52164] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Revised: 11/09/2023] [Accepted: 12/13/2023] [Indexed: 02/17/2024] Open
Abstract
BACKGROUND As large language models (LLMs) are becoming increasingly integrated into different aspects of health care, questions about the implications for medical academic literature have begun to emerge. Key aspects such as authenticity in academic writing are at stake with artificial intelligence (AI) generating highly linguistically accurate and grammatically sound texts. OBJECTIVE The objective of this study is to compare human-written with AI-generated scientific literature in orthopedics and sports medicine. METHODS Five original abstracts were selected from the PubMed database. These abstracts were subsequently rewritten with the assistance of 2 LLMs with different degrees of proficiency. Subsequently, researchers with varying degrees of expertise and with different areas of specialization were asked to rank the abstracts according to linguistic and methodological parameters. Finally, researchers had to classify the articles as AI generated or human written. RESULTS Neither the researchers nor the AI-detection software could successfully identify the AI-generated texts. Furthermore, the criteria previously suggested in the literature did not correlate with whether the researchers deemed a text to be AI generated or whether they judged the article correctly based on these parameters. CONCLUSIONS The primary finding of this study was that researchers were unable to distinguish between LLM-generated and human-written texts. However, due to the small sample size, it is not possible to generalize the results of this study. As is the case with any tool used in academic research, the potential to cause harm can be mitigated by relying on the transparency and integrity of the researchers. With scientific integrity at stake, further research with a similar study design should be conducted to determine the magnitude of this issue.
Collapse
Affiliation(s)
- Hassan Tarek Hakam
- Center of Orthopaedics and Trauma Surgery, University Clinic of Brandenburg, Brandenburg Medical School, Brandenburg an der Havel, Germany
- Faculty of Health Sciences, University Clinic of Brandenburg, Brandenburg an der Havel, Germany
- Center of Evidence Based Practice in Brandenburg, a JBI Affiliated Group, Brandenburg an der Havel, Germany
| | - Robert Prill
- Faculty of Health Sciences, University Clinic of Brandenburg, Brandenburg an der Havel, Germany
- Center of Evidence Based Practice in Brandenburg, a JBI Affiliated Group, Brandenburg an der Havel, Germany
| | - Lisa Korte
- Center of Health Services Research, Faculty of Health Sciences, University Clinic of Brandenburg, Rüdersdorf bei Berlin, Germany
| | - Bruno Lovreković
- Faculty of Orthopaedics, University Hospital Merkur, Zagreb, Croatia
| | - Marko Ostojić
- Departement of Orthopaedics, University Hospital Mostar, Mostar, Bosnia and Herzegovina
| | - Nikolai Ramadanov
- Center of Orthopaedics and Trauma Surgery, University Clinic of Brandenburg, Brandenburg Medical School, Brandenburg an der Havel, Germany
- Faculty of Health Sciences, University Clinic of Brandenburg, Brandenburg an der Havel, Germany
| | - Felix Muehlensiepen
- Center of Evidence Based Practice in Brandenburg, a JBI Affiliated Group, Brandenburg an der Havel, Germany
- Center of Health Services Research, Faculty of Health Sciences, University Clinic of Brandenburg, Rüdersdorf bei Berlin, Germany
| |
Collapse
|
30
|
Kanoulas D, Khattak S, Loianno G. Editorial: Rising stars in field robotics: 2022. Front Robot AI 2024; 11:1379661. [PMID: 38419614 PMCID: PMC10900065 DOI: 10.3389/frobt.2024.1379661] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Accepted: 02/05/2024] [Indexed: 03/02/2024] Open
|
31
|
Shah A, Wahood S, Guermazi D, Brem CE, Saliba E. Skin and Syntax: Large Language Models in Dermatopathology. Dermatopathology (Basel) 2024; 11:101-111. [PMID: 38390851 PMCID: PMC10885095 DOI: 10.3390/dermatopathology11010009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Revised: 02/04/2024] [Accepted: 02/08/2024] [Indexed: 02/24/2024] Open
Abstract
This literature review introduces the integration of Large Language Models (LLMs) in the field of dermatopathology, outlining their potential benefits, challenges, and prospects. It discusses the changing landscape of dermatopathology with the emergence of LLMs. The potential advantages of LLMs include a streamlined generation of pathology reports, the ability to learn and provide up-to-date information, and simplified patient education. Existing instances of LLMs encompass diagnostic support, research acceleration, and trainee education. Challenges involve biases, data privacy and quality, and establishing a balance between AI and dermatopathological expertise. Prospects include the integration of LLMs with other AI technologies to improve diagnostics and the improvement of multimodal LLMs that can handle both text and image input. Our implementation guidelines highlight the importance of model transparency and interpretability, data quality, and continuous oversight. The transformative potential of LLMs in dermatopathology is underscored, with an emphasis on a dynamic collaboration between artificial intelligence (AI) experts (technical specialists) and dermatopathologists (clinicians) for improved patient outcomes.
Collapse
Affiliation(s)
- Asghar Shah
- Department of Dermatology, The Warren Alpert Medical School, Brown University, Providence, RI 02903, USA
| | - Samer Wahood
- Department of Dermatology, The Warren Alpert Medical School, Brown University, Providence, RI 02903, USA
| | - Dorra Guermazi
- Department of Dermatology, The Warren Alpert Medical School, Brown University, Providence, RI 02903, USA
| | - Candice E Brem
- Section of Dermatopathology, Department of Dermatology, Chobanian and Avedisian School of Medicine, Boston University, Boston, MA 02118, USA
| | - Elie Saliba
- Department of Dermatology, The Warren Alpert Medical School, Brown University, Providence, RI 02903, USA
- Department of Dermatology, Gilbert and Rose-Marie Chagoury School of Medicine, Lebanese American University, Beirut 13-5053, Lebanon
| |
Collapse
|
32
|
Meyer A, Riese J, Streichert T. Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study. JMIR Med Educ 2024; 10:e50965. [PMID: 38329802 PMCID: PMC10884900 DOI: 10.2196/50965] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Revised: 11/14/2023] [Accepted: 12/11/2023] [Indexed: 02/09/2024]
Abstract
BACKGROUND The potential of artificial intelligence (AI)-based large language models, such as ChatGPT, has gained significant attention in the medical field. This enthusiasm is driven not only by recent breakthroughs and improved accessibility, but also by the prospect of democratizing medical knowledge and promoting equitable health care. However, the performance of ChatGPT is substantially influenced by the input language, and given the growing public trust in this AI tool compared to that in traditional sources of information, investigating its medical accuracy across different languages is of particular importance. OBJECTIVE This study aimed to compare the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination. METHODS To assess GPT-3.5's and GPT-4's medical proficiency, we used 937 original multiple-choice questions from 3 written German medical licensing examinations in October 2021, April 2022, and October 2022. RESULTS GPT-4 achieved an average score of 85% and ranked in the 92.8th, 99.5th, and 92.6th percentiles among medical students who took the same examinations in October 2021, April 2022, and October 2022, respectively. This represents a substantial improvement of 27% compared to GPT-3.5, which only passed 1 out of the 3 examinations. While GPT-3.5 performed well in psychiatry questions, GPT-4 exhibited strengths in internal medicine and surgery but showed weakness in academic research. CONCLUSIONS The study results highlight ChatGPT's remarkable improvement from moderate (GPT-3.5) to high competency (GPT-4) in answering medical licensing examination questions in German. While GPT-4's predecessor (GPT-3.5) was imprecise and inconsistent, it demonstrates considerable potential to improve medical education and patient care, provided that medically trained users critically evaluate its results. As the replacement of search engines by AI tools seems possible in the future, further studies with nonprofessional questions are needed to assess the safety and accuracy of ChatGPT for the general population.
Collapse
Affiliation(s)
- Annika Meyer
- Institute for Clinical Chemistry, University Hospital Cologne, Cologne, Germany
| | - Janik Riese
- Department of General Surgery, Visceral, Thoracic and Vascular Surgery, University Hospital Greifswald, Greifswald, Germany
| | - Thomas Streichert
- Institute for Clinical Chemistry, University Hospital Cologne, Cologne, Germany
| |
Collapse
|
33
|
Elyoseph Z, Refoua E, Asraf K, Lvovsky M, Shimoni Y, Hadar-Shoval D. Capacity of Generative AI to Interpret Human Emotions From Visual and Textual Data: Pilot Evaluation Study. JMIR Ment Health 2024; 11:e54369. [PMID: 38319707 PMCID: PMC10879976 DOI: 10.2196/54369] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Revised: 12/09/2023] [Accepted: 12/25/2023] [Indexed: 02/07/2024] Open
Abstract
BACKGROUND Mentalization, which is integral to human cognitive processes, pertains to the interpretation of one's own and others' mental states, including emotions, beliefs, and intentions. With the advent of artificial intelligence (AI) and the prominence of large language models in mental health applications, questions persist about their aptitude in emotional comprehension. The prior iteration of the large language model from OpenAI, ChatGPT-3.5, demonstrated an advanced capacity to interpret emotions from textual data, surpassing human benchmarks. Given the introduction of ChatGPT-4, with its enhanced visual processing capabilities, and considering Google Bard's existing visual functionalities, a rigorous assessment of their proficiency in visual mentalizing is warranted. OBJECTIVE The aim of the research was to critically evaluate the capabilities of ChatGPT-4 and Google Bard with regard to their competence in discerning visual mentalizing indicators as contrasted with their textual-based mentalizing abilities. METHODS The Reading the Mind in the Eyes Test developed by Baron-Cohen and colleagues was used to assess the models' proficiency in interpreting visual emotional indicators. Simultaneously, the Levels of Emotional Awareness Scale was used to evaluate the large language models' aptitude in textual mentalizing. Collating data from both tests provided a holistic view of the mentalizing capabilities of ChatGPT-4 and Bard. RESULTS ChatGPT-4, displaying a pronounced ability in emotion recognition, secured scores of 26 and 27 in 2 distinct evaluations, significantly deviating from a random response paradigm (P<.001). These scores align with established benchmarks from the broader human demographic. Notably, ChatGPT-4 exhibited consistent responses, with no discernible biases pertaining to the sex of the model or the nature of the emotion. In contrast, Google Bard's performance aligned with random response patterns, securing scores of 10 and 12 and rendering further detailed analysis redundant. In the domain of textual analysis, both ChatGPT and Bard surpassed established benchmarks from the general population, with their performances being remarkably congruent. CONCLUSIONS ChatGPT-4 proved its efficacy in the domain of visual mentalizing, aligning closely with human performance standards. Although both models displayed commendable acumen in textual emotion interpretation, Bard's capabilities in visual emotion interpretation necessitate further scrutiny and potential refinement. This study stresses the criticality of ethical AI development for emotional recognition, highlighting the need for inclusive data, collaboration with patients and mental health experts, and stringent governmental oversight to ensure transparency and protect patient privacy.
Collapse
Affiliation(s)
- Zohar Elyoseph
- Department of Educational Psychology, The Center for Psychobiological Research, The Max Stern Yezreel Valley College, Emek Yezreel, Israel
- Imperial College London, London, United Kingdom
| | - Elad Refoua
- Department of Psychology, Bar-Ilan University, Ramat Gan, Israel
| | - Kfir Asraf
- Department of Psychology, The Max Stern Yezreel Valley College, Emek Yezreel, Israel
| | - Maya Lvovsky
- Department of Psychology, The Max Stern Yezreel Valley College, Emek Yezreel, Israel
| | - Yoav Shimoni
- Boston Children's Hospital, Boston, MA, United States
| | - Dorit Hadar-Shoval
- Department of Psychology, The Max Stern Yezreel Valley College, Emek Yezreel, Israel
| |
Collapse
|
34
|
Kavadella A, Dias da Silva MA, Kaklamanos EG, Stamatopoulos V, Giannakopoulos K. Evaluation of ChatGPT's Real-Life Implementation in Undergraduate Dental Education: Mixed Methods Study. JMIR Med Educ 2024; 10:e51344. [PMID: 38111256 PMCID: PMC10867750 DOI: 10.2196/51344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Revised: 10/28/2023] [Accepted: 12/11/2023] [Indexed: 12/20/2023]
Abstract
BACKGROUND The recent artificial intelligence tool ChatGPT seems to offer a range of benefits in academic education while also raising concerns. Relevant literature encompasses issues of plagiarism and academic dishonesty, as well as pedagogy and educational affordances; yet, no real-life implementation of ChatGPT in the educational process has been reported to our knowledge so far. OBJECTIVE This mixed methods study aimed to evaluate the implementation of ChatGPT in the educational process, both quantitatively and qualitatively. METHODS In March 2023, a total of 77 second-year dental students of the European University Cyprus were divided into 2 groups and asked to compose a learning assignment on "Radiation Biology and Radiation Protection in the Dental Office," working collaboratively in small subgroups, as part of the educational semester program of the Dentomaxillofacial Radiology module. Careful planning ensured a seamless integration of ChatGPT, addressing potential challenges. One group searched the internet for scientific resources to perform the task and the other group used ChatGPT for this purpose. Both groups developed a PowerPoint (Microsoft Corp) presentation based on their research and presented it in class. The ChatGPT group students additionally registered all interactions with the language model during the prompting process and evaluated the final outcome; they also answered an open-ended evaluation questionnaire, including questions on their learning experience. Finally, all students undertook a knowledge examination on the topic, and the grades between the 2 groups were compared statistically, whereas the free-text comments of the questionnaires were thematically analyzed. RESULTS Out of the 77 students, 39 were assigned to the ChatGPT group and 38 to the literature research group. Seventy students undertook the multiple choice question knowledge examination, and examination grades ranged from 5 to 10 on the 0-10 grading scale. The Mann-Whitney U test showed that students of the ChatGPT group performed significantly better (P=.045) than students of the literature research group. The evaluation questionnaires revealed the benefits (human-like interface, immediate response, and wide knowledge base), the limitations (need for rephrasing the prompts to get a relevant answer, general content, false citations, and incapability to provide images or videos), and the prospects (in education, clinical practice, continuing education, and research) of ChatGPT. CONCLUSIONS Students using ChatGPT for their learning assignments performed significantly better in the knowledge examination than their fellow students who used the literature research methodology. Students adapted quickly to the technological environment of the language model, recognized its opportunities and limitations, and used it creatively and efficiently. Implications for practice: the study underscores the adaptability of students to technological innovations including ChatGPT and its potential to enhance educational outcomes. Educators should consider integrating ChatGPT into curriculum design; awareness programs are warranted to educate both students and educators about the limitations of ChatGPT, encouraging critical engagement and responsible use.
Collapse
Affiliation(s)
- Argyro Kavadella
- School of Dentistry, European University Cyprus, Nicosia, Cyprus
| | - Marco Antonio Dias da Silva
- Research Group of Teleducation and Teledentistry, Federal University of Campina Grande, Campina Grande, Brazil
| | - Eleftherios G Kaklamanos
- School of Dentistry, European University Cyprus, Nicosia, Cyprus
- School of Dentistry, Aristotle University of Thessaloniki, Thessaloniki, Greece
- Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai, United Arab Emirates
| | - Vasileios Stamatopoulos
- Information Management Systems Institute, ATHENA Research and Innovation Center, Athens, Greece
| | | |
Collapse
|
35
|
Herrmann-Werner A, Festl-Wietek T, Holderried F, Herschbach L, Griewatz J, Masters K, Zipfel S, Mahling M. Assessing ChatGPT's Mastery of Bloom's Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study. J Med Internet Res 2024; 26:e52113. [PMID: 38261378 PMCID: PMC10848129 DOI: 10.2196/52113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 09/15/2023] [Accepted: 12/07/2023] [Indexed: 01/24/2024] Open
Abstract
BACKGROUND Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to "hallucinations" (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom's taxonomy. OBJECTIVE This study aims to explore how GPT-4 performs in terms of Bloom's taxonomy using psychosomatic medicine exam questions. METHODS We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom's taxonomy. RESULTS GPT-4's performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P=.002 for the detailed prompt and P<.001 for the short prompt). Independent of the prompt, GPT-4's lowest exam performance was 78.9% (15/19), thereby always surpassing the "pass" threshold. Our qualitative analysis of incorrect answers, based on Bloom's taxonomy, showed that errors were primarily in the "remember" (29/68) and "understand" (23/68) cognitive levels; specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines. CONCLUSIONS GPT-4 demonstrated a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated through Bloom's taxonomy, our data revealed that GPT-4 occasionally ignored specific facts (remember), provided illogical reasoning (understand), or failed to apply concepts to a new situation (apply). These errors, which were confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood.
Collapse
Affiliation(s)
- Anne Herrmann-Werner
- Tübingen Institute for Medical Education, Faculty of Medicine, University of Tübingen, Tübingen, Germany
- Department of Psychosomatic Medicine and Psychotherapy, University Hospital Tübingen, Tübingen, Germany
| | - Teresa Festl-Wietek
- Tübingen Institute for Medical Education, Faculty of Medicine, University of Tübingen, Tübingen, Germany
| | - Friederike Holderried
- Tübingen Institute for Medical Education, Faculty of Medicine, University of Tübingen, Tübingen, Germany
- University Department of Anesthesiology and Intensive Care Medicine, University Hospital Tübingen, Tübingen, Germany
| | - Lea Herschbach
- Tübingen Institute for Medical Education, Faculty of Medicine, University of Tübingen, Tübingen, Germany
| | - Jan Griewatz
- Tübingen Institute for Medical Education, Faculty of Medicine, University of Tübingen, Tübingen, Germany
| | - Ken Masters
- Medical Education and Informatics Department, College of Medicine and Health Sciences, Sultan Qaboos University, Muscat, Oman
| | - Stephan Zipfel
- Department of Psychosomatic Medicine and Psychotherapy, University Hospital Tübingen, Tübingen, Germany
| | - Moritz Mahling
- Tübingen Institute for Medical Education, Faculty of Medicine, University of Tübingen, Tübingen, Germany
- Department of Diabetology, Endocrinology, Nephrology, Section of Nephrology and Hypertension, University Hospital Tübingen, Tübingen, Germany
| |
Collapse
|
36
|
Long C, Lowe K, Zhang J, Santos AD, Alanazi A, O'Brien D, Wright ED, Cote D. A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology-Head and Neck Surgery Certification Examinations: Performance Study. JMIR Med Educ 2024; 10:e49970. [PMID: 38227351 PMCID: PMC10828939 DOI: 10.2196/49970] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 11/04/2023] [Accepted: 11/07/2023] [Indexed: 01/17/2024]
Abstract
BACKGROUND ChatGPT is among the most popular large language models (LLMs), exhibiting proficiency in various standardized tests, including multiple-choice medical board examinations. However, its performance on otolaryngology-head and neck surgery (OHNS) certification examinations and open-ended medical board certification examinations has not been reported. OBJECTIVE We aimed to evaluate the performance of ChatGPT on OHNS board examinations and propose a novel method to assess an AI model's performance on open-ended medical board examination questions. METHODS Twenty-one open-ended questions were adopted from the Royal College of Physicians and Surgeons of Canada's sample examination to query ChatGPT on April 11, 2023, with and without prompts. A new model, named Concordance, Validity, Safety, Competency (CVSC), was developed to evaluate its performance. RESULTS In an open-ended question assessment, ChatGPT achieved a passing mark (an average of 75% across 3 trials) in the attempts and demonstrated higher accuracy with prompts. The model demonstrated high concordance (92.06%) and satisfactory validity. While demonstrating considerable consistency in regenerating answers, it often provided only partially correct responses. Notably, concerning features such as hallucinations and self-conflicting answers were observed. CONCLUSIONS ChatGPT achieved a passing score in the sample examination and demonstrated the potential to pass the OHNS certification examination of the Royal College of Physicians and Surgeons of Canada. Some concerns remain due to its hallucinations, which could pose risks to patient safety. Further adjustments are necessary to yield safer and more accurate answers for clinical implementation.
Collapse
Affiliation(s)
- Cai Long
- Division of Otolaryngology-Head and Neck Surgery, University of Alberta, Edmonton, AB, Canada
| | - Kayle Lowe
- Faculty of Medicine, University of Alberta, Edmonton, AB, Canada
| | - Jessica Zhang
- Faculty of Medicine, University of Alberta, Edmonton, AB, Canada
| | | | - Alaa Alanazi
- Division of Otolaryngology-Head and Neck Surgery, University of Alberta, Edmonton, AB, Canada
| | - Daniel O'Brien
- Department of Surgery, Creighton University, Omaha, NE, United States
| | - Erin D Wright
- Division of Otolaryngology-Head and Neck Surgery, University of Alberta, Edmonton, AB, Canada
| | - David Cote
- Division of Otolaryngology-Head and Neck Surgery, University of Alberta, Edmonton, AB, Canada
| |
Collapse
|
37
|
Guo E, Gupta M, Deng J, Park YJ, Paget M, Naugler C. Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study. J Med Internet Res 2024; 26:e48996. [PMID: 38214966 PMCID: PMC10818236 DOI: 10.2196/48996] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2023] [Revised: 08/30/2023] [Accepted: 09/28/2023] [Indexed: 01/13/2024] Open
Abstract
BACKGROUND The systematic review of clinical research papers is a labor-intensive and time-consuming process that often involves the screening of thousands of titles and abstracts. The accuracy and efficiency of this process are critical for the quality of the review and subsequent health care decisions. Traditional methods rely heavily on human reviewers, often requiring a significant investment of time and resources. OBJECTIVE This study aims to assess the performance of the OpenAI generative pretrained transformer (GPT) and GPT-4 application programming interfaces (APIs) in accurately and efficiently identifying relevant titles and abstracts from real-world clinical review data sets and comparing their performance against ground truth labeling by 2 independent human reviewers. METHODS We introduce a novel workflow using the Chat GPT and GPT-4 APIs for screening titles and abstracts in clinical reviews. A Python script was created to make calls to the API with the screening criteria in natural language and a corpus of title and abstract data sets filtered by a minimum of 2 human reviewers. We compared the performance of our model against human-reviewed papers across 6 review papers, screening over 24,000 titles and abstracts. RESULTS Our results show an accuracy of 0.91, a macro F1-score of 0.60, a sensitivity of excluded papers of 0.91, and a sensitivity of included papers of 0.76. The interrater variability between 2 independent human screeners was κ=0.46, and the prevalence and bias-adjusted κ between our proposed methods and the consensus-based human decisions was κ=0.96. On a randomly selected subset of papers, the GPT models demonstrated the ability to provide reasoning for their decisions and corrected their initial decisions upon being asked to explain their reasoning for incorrect classifications. CONCLUSIONS Large language models have the potential to streamline the clinical review process, save valuable time and effort for researchers, and contribute to the overall quality of clinical reviews. By prioritizing the workflow and acting as an aid rather than a replacement for researchers and reviewers, models such as GPT-4 can enhance efficiency and lead to more accurate and reliable conclusions in medical research.
Collapse
Affiliation(s)
- Eddie Guo
- Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
| | - Mehul Gupta
- Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
| | - Jiawen Deng
- Temerty Faculty of Medicine, University of Toronto, Toronto, AB, Canada
| | - Ye-Jean Park
- Temerty Faculty of Medicine, University of Toronto, Toronto, AB, Canada
| | - Michael Paget
- Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
| | | |
Collapse
|
38
|
Kollitsch L, Eredics K, Marszalek M, Rauchenwald M, Brookman-May SD, Burger M, Körner-Riffard K, May M. How does artificial intelligence master urological board examinations? A comparative analysis of different Large Language Models' accuracy and reliability in the 2022 In-Service Assessment of the European Board of Urology. World J Urol 2024; 42:20. [PMID: 38197996 DOI: 10.1007/s00345-023-04749-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2023] [Accepted: 11/02/2023] [Indexed: 01/11/2024] Open
Abstract
PURPOSE This study is a comparative analysis of three Large Language Models (LLMs) evaluating their rate of correct answers (RoCA) and the reliability of generated answers on a set of urological knowledge-based questions spanning different levels of complexity. METHODS ChatGPT-3.5, ChatGPT-4, and Bing AI underwent two testing rounds, with a 48-h gap in between, using the 100 multiple-choice questions from the 2022 European Board of Urology (EBU) In-Service Assessment (ISA). For conflicting responses, an additional consensus round was conducted to establish conclusive answers. RoCA was compared across various question complexities. Ten weeks after the consensus round, a subsequent testing round was conducted to assess potential knowledge gain and improvement in RoCA, respectively. RESULTS Over three testing rounds, ChatGPT-3.5 achieved RoCa scores of 58%, 62%, and 59%. In contrast, ChatGPT-4 achieved RoCA scores of 63%, 77%, and 77%, while Bing AI yielded scores of 81%, 73%, and 77%, respectively. Agreement rates between rounds 1 and 2 were 84% (κ = 0.67, p < 0.001) for ChatGPT-3.5, 74% (κ = 0.40, p < 0.001) for ChatGPT-4, and 76% (κ = 0.33, p < 0.001) for BING AI. In the consensus round, ChatGPT-4 and Bing AI significantly outperformed ChatGPT-3.5 (77% and 77% vs. 59%, both p = 0.010). All LLMs demonstrated decreasing RoCA scores with increasing question complexity (p < 0.001). In the fourth round, no significant improvement in RoCA was observed across all three LLMs. CONCLUSIONS The performance of the tested LLMs in addressing urological specialist inquiries warrants further refinement. Moreover, the deficiency in response reliability contributes to existing challenges related to their current utility for educational purposes.
Collapse
Affiliation(s)
- Lisa Kollitsch
- Department of Urology and Andrology, Klinik Donaustadt, Vienna, Austria
| | - Klaus Eredics
- Department of Urology and Andrology, Klinik Donaustadt, Vienna, Austria
- Department of Urology, Paracelsus Medical University, Salzburg, Austria
| | - Martin Marszalek
- Department of Urology and Andrology, Klinik Donaustadt, Vienna, Austria
| | - Michael Rauchenwald
- Department of Urology and Andrology, Klinik Donaustadt, Vienna, Austria
- European Board of Urology, Arnhem, The Netherlands
| | - Sabine D Brookman-May
- Department of Urology, University of Munich, LMU, Munich, Germany
- Johnson and Johnson Innovative Medicine, Research and Development, Spring House, PA, USA
| | - Maximilian Burger
- Department of Urology, Caritas St. Josef Medical Centre, University of Regensburg, Regensburg, Germany
| | - Katharina Körner-Riffard
- Department of Urology, Caritas St. Josef Medical Centre, University of Regensburg, Regensburg, Germany
| | - Matthias May
- Department of Urology, St. Elisabeth Hospital Straubing, Brothers of Mercy Hospital, Straubing, Germany.
| |
Collapse
|
39
|
Hellström T. AI and its consequences for the written word. Front Artif Intell 2024; 6:1326166. [PMID: 38239498 PMCID: PMC10794589 DOI: 10.3389/frai.2023.1326166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2023] [Accepted: 12/07/2023] [Indexed: 01/22/2024] Open
Abstract
The latest developments of chatbots driven by Large Language Models (LLMs), more specifically ChatGPT, have shaken the foundations of how text is created, and may drastically reduce and change the need, ability, and valuation of human writing. Furthermore, our trust in the written word is likely to decrease, as an increasing proportion of all written text will be AI-generated - and potentially incorrect. In this essay, I discuss these implications and possible scenarios for us humans, and for AI itself.
Collapse
|
40
|
Wu G, Zhao W, Wong A, Lee DA. Patients with floaters: Answers from virtual assistants and large language models. Digit Health 2024; 10:20552076241229933. [PMID: 38362238 PMCID: PMC10868475 DOI: 10.1177/20552076241229933] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/28/2023] [Indexed: 02/17/2024] Open
Abstract
Objectives "Floaters," a common complaint among patients of all ages, was used as a query term because it affects 30% of all people searching for eye care. The American Academy of Ophthalmology website's "floaters" section was used as a source for questions and answers (www.aao.org). Floaters is a visual obstruction that moves with the movement of the eye. They can be associated with retinal detachment, which can lead to vision loss. With the advent of large language model (LLM) chatbots ChatGPT, Bard versus virtual assistants (VA), Google Assistant, and Alexa, we analyzed their responses to "floaters." Methods Using AAO.org, "Public & Patients," and its related subsection, "EyeHealth A-Z": Floaters and Flashes link, we asked four questions: (1) What are floaters? (2) What are flashes? (3) Flashes and Migraines? (4) Floaters and Flashes Treatment? to ChatGPT, Bard, Google Assistant, and Alexa. The American Academy of Ophthalmology (AAO) keywords were identified if they were highlighted. The "Flesch-Kincaid Grade Level" formula approved by the U.S. Department of Education, was used to evaluate the reading comprehension level for the responses. Results Of the chatbots and virtual assistants, Google Assistant is the only one that uses the term "ophthalmologist." There is no mention of the urgency or emergency nature of floaters. AAO.org shows a lower reading level vs the LLMs and VA (p = .11). The reading comprehension levels of ChatGPT, Bard, Google Assistant, and Alexa are higher (12.3, 9.7, 13.1, 8.1 grade) vs the AAO.org (7.3 grade). There is a higher word count for LLMs vs VA (p < .0286). Conclusion Currently, ChatGPT, Bard, Google Assistant, and Alexa are similar. Factual information is present but all miss the urgency of the diagnosis of a retinal detachment. Translational relevance: Both the LLM and virtual assistants are free and our patients will use them to obtain "floaters" information. There may be errors of omission with ChatGPT and a lack of urgency to seek a physician's care.
Collapse
Affiliation(s)
- Gloria Wu
- Department of Ophthalmology, University of California San Francisco School of Medicine, San Francisco, California, USA
| | - Weichen Zhao
- University of California, Davis, Davis, California, USA
| | - Adrial Wong
- University of California, Davis, Davis, California, USA
| | - David A Lee
- University of Texas Health Science Center at Houston, McGovern Medical School, Houston, Texas, USA
| |
Collapse
|
41
|
Lee KH, Lee RW, Kwon YE. Validation of a Deep Learning Chest X-ray Interpretation Model: Integrating Large-Scale AI and Large Language Models for Comparative Analysis with ChatGPT. Diagnostics (Basel) 2023; 14:90. [PMID: 38201398 PMCID: PMC10795741 DOI: 10.3390/diagnostics14010090] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2023] [Revised: 12/28/2023] [Accepted: 12/29/2023] [Indexed: 01/12/2024] Open
Abstract
This study evaluates the diagnostic accuracy and clinical utility of two artificial intelligence (AI) techniques: Kakao Brain Artificial Neural Network for Chest X-ray Reading (KARA-CXR), an assistive technology developed using large-scale AI and large language models (LLMs), and ChatGPT, a well-known LLM. The study was conducted to validate the performance of the two technologies in chest X-ray reading and explore their potential applications in the medical imaging diagnosis domain. The study methodology consisted of randomly selecting 2000 chest X-ray images from a single institution's patient database, and two radiologists evaluated the readings provided by KARA-CXR and ChatGPT. The study used five qualitative factors to evaluate the readings generated by each model: accuracy, false findings, location inaccuracies, count inaccuracies, and hallucinations. Statistical analysis showed that KARA-CXR achieved significantly higher diagnostic accuracy compared to ChatGPT. In the 'Acceptable' accuracy category, KARA-CXR was rated at 70.50% and 68.00% by two observers, while ChatGPT achieved 40.50% and 47.00%. Interobserver agreement was moderate for both systems, with KARA at 0.74 and GPT4 at 0.73. For 'False Findings', KARA-CXR scored 68.00% and 68.50%, while ChatGPT scored 37.00% for both observers, with high interobserver agreements of 0.96 for KARA and 0.97 for GPT4. In 'Location Inaccuracy' and 'Hallucinations', KARA-CXR outperformed ChatGPT with significant margins. KARA-CXR demonstrated a non-hallucination rate of 75%, which is significantly higher than ChatGPT's 38%. The interobserver agreement was high for KARA (0.91) and moderate to high for GPT4 (0.85) in the hallucination category. In conclusion, this study demonstrates the potential of AI and large-scale language models in medical imaging and diagnostics. It also shows that in the chest X-ray domain, KARA-CXR has relatively higher accuracy than ChatGPT.
Collapse
Affiliation(s)
| | - Ro Woon Lee
- Department of Radiology, College of Medicine, Inha University, Incheon 22212, Republic of Korea
| | | |
Collapse
|
42
|
Cheng SL, Tsai SJ, Bai YM, Ko CH, Hsu CW, Yang FC, Tsai CK, Tu YK, Yang SN, Tseng PT, Hsu TW, Liang CS, Su KP. Comparisons of Quality, Correctness, and Similarity Between ChatGPT-Generated and Human-Written Abstracts for Basic Research: Cross-Sectional Study. J Med Internet Res 2023; 25:e51229. [PMID: 38145486 PMCID: PMC10760418 DOI: 10.2196/51229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Revised: 10/17/2023] [Accepted: 11/20/2023] [Indexed: 12/26/2023] Open
Abstract
BACKGROUND ChatGPT may act as a research assistant to help organize the direction of thinking and summarize research findings. However, few studies have examined the quality, similarity (abstracts being similar to the original one), and accuracy of the abstracts generated by ChatGPT when researchers provide full-text basic research papers. OBJECTIVE We aimed to assess the applicability of an artificial intelligence (AI) model in generating abstracts for basic preclinical research. METHODS We selected 30 basic research papers from Nature, Genome Biology, and Biological Psychiatry. Excluding abstracts, we inputted the full text into ChatPDF, an application of a language model based on ChatGPT, and we prompted it to generate abstracts with the same style as used in the original papers. A total of 8 experts were invited to evaluate the quality of these abstracts (based on a Likert scale of 0-10) and identify which abstracts were generated by ChatPDF, using a blind approach. These abstracts were also evaluated for their similarity to the original abstracts and the accuracy of the AI content. RESULTS The quality of ChatGPT-generated abstracts was lower than that of the actual abstracts (10-point Likert scale: mean 4.72, SD 2.09 vs mean 8.09, SD 1.03; P<.001). The difference in quality was significant in the unstructured format (mean difference -4.33; 95% CI -4.79 to -3.86; P<.001) but minimal in the 4-subheading structured format (mean difference -2.33; 95% CI -2.79 to -1.86). Among the 30 ChatGPT-generated abstracts, 3 showed wrong conclusions, and 10 were identified as AI content. The mean percentage of similarity between the original and the generated abstracts was not high (2.10%-4.40%). The blinded reviewers achieved a 93% (224/240) accuracy rate in guessing which abstracts were written using ChatGPT. CONCLUSIONS Using ChatGPT to generate a scientific abstract may not lead to issues of similarity when using real full texts written by humans. However, the quality of the ChatGPT-generated abstracts was suboptimal, and their accuracy was not 100%.
Collapse
Affiliation(s)
- Shu-Li Cheng
- Department of Nursing, Mackay Medical College, Taipei, Taiwan
| | - Shih-Jen Tsai
- Department of Psychiatry, Taipei Veterans General Hospital, Taipei, Taiwan
- Division of Psychiatry, School of Medicine, National Yang-Ming University, Taipei, Taiwan
| | - Ya-Mei Bai
- Department of Psychiatry, Taipei Veterans General Hospital, Taipei, Taiwan
- Division of Psychiatry, School of Medicine, National Yang-Ming University, Taipei, Taiwan
| | - Chih-Hung Ko
- Department of Psychiatry, Kaohsiung Medical University Hospital, Kaohsiung, Taiwan
- Department of Psychiatry, College of Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan
- Department of Psychiatry, Kaohsiung Municipal Siaogang Hospital, Kaohsiung Medical University, Kaohsiung, Taiwan
| | - Chih-Wei Hsu
- Department of Psychiatry, Kaohsiung Chang Gung Memorial Hospital, Kaohsiung, Taiwan
| | - Fu-Chi Yang
- Department of Neurology, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan
| | - Chia-Kuang Tsai
- Department of Neurology, Tri-Service General Hospital, National Defense Medical Center, Taipei, Taiwan
| | - Yu-Kang Tu
- Institute of Epidemiology and Preventive Medicine, College of Public Health, National Taiwan University, Taipei, Taiwan
- Department of Dentistry, National Taiwan University Hospital, Taipei, Taiwan
| | - Szu-Nian Yang
- Department of Psychiatry, Tri-service Hospital, Beitou branch, Taipei, Taiwan
- Department of Psychiatry, Armed Forces Taoyuan General Hospital, Taoyuan, Taiwan
- Graduate Institute of Health and Welfare Policy, National Yang Ming Chiao Tung University, Taipei, Taiwan
| | - Ping-Tao Tseng
- Institute of Biomedical Sciences, Institute of Precision Medicine, National Sun Yat-sen University, Kaohsiung, Taiwan
- Department of Psychology, College of Medical and Health Science, Asia University, Taichung, Taiwan
- Prospect Clinic for Otorhinolaryngology and Neurology, Kaohsiung, Taiwan
| | - Tien-Wei Hsu
- Department of Psychiatry, E-Da Dachang Hospital, I-Shou University, Kaohsiung, Taiwan
- Department of Psychiatry, E-Da Hospital, I-Shou University, Kaohsiung, Taiwan
| | - Chih-Sung Liang
- Department of Psychiatry, Tri-service Hospital, Beitou branch, Taipei, Taiwan
- Department of Psychiatry, National Defense Medical Center, Taipei, Taiwan
| | - Kuan-Pin Su
- College of Medicine, China Medical University, Taichung, Taiwan
- Mind-Body Interface Laboratory, China Medical University and Hospital, Taichung, Taiwan
- An-Nan Hospital, China Medical University, Tainan, Taiwan
| |
Collapse
|
43
|
Ali S, Chourasia P, Patterson M. When Protein Structure Embedding Meets Large Language Models. Genes (Basel) 2023; 15:25. [PMID: 38254915 PMCID: PMC10815811 DOI: 10.3390/genes15010025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Revised: 12/16/2023] [Accepted: 12/21/2023] [Indexed: 01/24/2024] Open
Abstract
Protein structure analysis is essential in various bioinformatics domains such as drug discovery, disease diagnosis, and evolutionary studies. Within structural biology, the classification of protein structures is pivotal, employing machine learning algorithms to categorize structures based on data from databases like the Protein Data Bank (PDB). To predict protein functions, embeddings based on protein sequences have been employed. Creating numerical embeddings that preserve vital information while considering protein structure and sequence presents several challenges. The existing literature lacks a comprehensive and effective approach that combines structural and sequence-based features to achieve efficient protein classification. While large language models (LLMs) have exhibited promising outcomes for protein function prediction, their focus primarily lies on protein sequences, disregarding the 3D structures of proteins. The quality of embeddings heavily relies on how well the geometry of the embedding space aligns with the underlying data structure, posing a critical research question. Traditionally, Euclidean space has served as a widely utilized framework for embeddings. In this study, we propose a novel method for designing numerical embeddings in Euclidean space for proteins by leveraging 3D structure information, specifically employing the concept of contact maps. These embeddings are synergistically combined with features extracted from LLMs and traditional feature engineering techniques to enhance the performance of embeddings in supervised protein analysis. Experimental results on benchmark datasets, including PDB Bind and STCRDAB, demonstrate the superior performance of the proposed method for protein function prediction.
Collapse
Affiliation(s)
| | | | - Murray Patterson
- Department of Computer Science, Georgia State University, Atlanta, GA 30303, USA; (S.A.); (P.C.)
| |
Collapse
|
44
|
Pasquiou A, Lakretz Y, Thirion B, Pallier C. Information-Restricted Neural Language Models Reveal Different Brain Regions' Sensitivity to Semantics, Syntax, and Context. Neurobiol Lang (Camb) 2023; 4:611-636. [PMID: 38144237 PMCID: PMC10745090 DOI: 10.1162/nol_a_00125] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/23/2023] [Accepted: 09/28/2023] [Indexed: 12/26/2023]
Abstract
A fundamental question in neurolinguistics concerns the brain regions involved in syntactic and semantic processing during speech comprehension, both at the lexical (word processing) and supra-lexical levels (sentence and discourse processing). To what extent are these regions separated or intertwined? To address this question, we introduce a novel approach exploiting neural language models to generate high-dimensional feature sets that separately encode semantic and syntactic information. More precisely, we train a lexical language model, GloVe, and a supra-lexical language model, GPT-2, on a text corpus from which we selectively removed either syntactic or semantic information. We then assess to what extent the features derived from these information-restricted models are still able to predict the fMRI time courses of humans listening to naturalistic text. Furthermore, to determine the windows of integration of brain regions involved in supra-lexical processing, we manipulate the size of contextual information provided to GPT-2. The analyses show that, while most brain regions involved in language comprehension are sensitive to both syntactic and semantic features, the relative magnitudes of these effects vary across these regions. Moreover, regions that are best fitted by semantic or syntactic features are more spatially dissociated in the left hemisphere than in the right one, and the right hemisphere shows sensitivity to longer contexts than the left. The novelty of our approach lies in the ability to control for the information encoded in the models' embeddings by manipulating the training set. These "information-restricted" models complement previous studies that used language models to probe the neural bases of language, and shed new light on its spatial organization.
Collapse
Affiliation(s)
- Alexandre Pasquiou
- Cognitive Neuroimaging Unit (UNICOG), NeuroSpin, National Institute of Health and Medical Research (Inserm) and French Alternative Energies and Atomic Energy Commission (CEA), Frédéric Joliot Life Sciences Institute, Paris-Saclay University, Gif-sur-Yvette, France
- Models and Inference for Neuroimaging Data (MIND), NeuroSpin, French Alternative Energies and Atomic Energy Commission (CEA), Inria Saclay, Frédéric Joliot Life Sciences Institute, Paris-Saclay University, Gif-sur-Yvette, France
| | - Yair Lakretz
- Cognitive Neuroimaging Unit (UNICOG), NeuroSpin, National Institute of Health and Medical Research (Inserm) and French Alternative Energies and Atomic Energy Commission (CEA), Frédéric Joliot Life Sciences Institute, Paris-Saclay University, Gif-sur-Yvette, France
| | - Bertrand Thirion
- Models and Inference for Neuroimaging Data (MIND), NeuroSpin, French Alternative Energies and Atomic Energy Commission (CEA), Inria Saclay, Frédéric Joliot Life Sciences Institute, Paris-Saclay University, Gif-sur-Yvette, France
| | - Christophe Pallier
- Cognitive Neuroimaging Unit (UNICOG), NeuroSpin, National Institute of Health and Medical Research (Inserm) and French Alternative Energies and Atomic Energy Commission (CEA), Frédéric Joliot Life Sciences Institute, Paris-Saclay University, Gif-sur-Yvette, France
| |
Collapse
|
45
|
Watari T, Takagi S, Sakaguchi K, Nishizaki Y, Shimizu T, Yamamoto Y, Tokuda Y. Performance Comparison of ChatGPT-4 and Japanese Medical Residents in the General Medicine In-Training Examination: Comparison Study. JMIR Med Educ 2023; 9:e52202. [PMID: 38055323 PMCID: PMC10733815 DOI: 10.2196/52202] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Revised: 10/22/2023] [Accepted: 11/03/2023] [Indexed: 12/07/2023]
Abstract
BACKGROUND The reliability of GPT-4, a state-of-the-art expansive language model specializing in clinical reasoning and medical knowledge, remains largely unverified across non-English languages. OBJECTIVE This study aims to compare fundamental clinical competencies between Japanese residents and GPT-4 by using the General Medicine In-Training Examination (GM-ITE). METHODS We used the GPT-4 model provided by OpenAI and the GM-ITE examination questions for the years 2020, 2021, and 2022 to conduct a comparative analysis. This analysis focused on evaluating the performance of individuals who were concluding their second year of residency in comparison to that of GPT-4. Given the current abilities of GPT-4, our study included only single-choice exam questions, excluding those involving audio, video, or image data. The assessment included 4 categories: general theory (professionalism and medical interviewing), symptomatology and clinical reasoning, physical examinations and clinical procedures, and specific diseases. Additionally, we categorized the questions into 7 specialty fields and 3 levels of difficulty, which were determined based on residents' correct response rates. RESULTS Upon examination of 137 GM-ITE questions in Japanese, GPT-4 scores were significantly higher than the mean scores of residents (residents: 55.8%, GPT-4: 70.1%; P<.001). In terms of specific disciplines, GPT-4 scored 23.5 points higher in the "specific diseases," 30.9 points higher in "obstetrics and gynecology," and 26.1 points higher in "internal medicine." In contrast, GPT-4 scores in "medical interviewing and professionalism," "general practice," and "psychiatry" were lower than those of the residents, although this discrepancy was not statistically significant. Upon analyzing scores based on question difficulty, GPT-4 scores were 17.2 points lower for easy problems (P=.007) but were 25.4 and 24.4 points higher for normal and difficult problems, respectively (P<.001). In year-on-year comparisons, GPT-4 scores were 21.7 and 21.5 points higher in the 2020 (P=.01) and 2022 (P=.003) examinations, respectively, but only 3.5 points higher in the 2021 examinations (no significant difference). CONCLUSIONS In the Japanese language, GPT-4 also outperformed the average medical residents in the GM-ITE test, originally designed for them. Specifically, GPT-4 demonstrated a tendency to score higher on difficult questions with low resident correct response rates and those demanding a more comprehensive understanding of diseases. However, GPT-4 scored comparatively lower on questions that residents could readily answer, such as those testing attitudes toward patients and professionalism, as well as those necessitating an understanding of context and communication. These findings highlight the strengths and limitations of artificial intelligence applications in medical education and practice.
Collapse
Affiliation(s)
- Takashi Watari
- General Medicine Center, Shimane University Hospital, Izumo, Japan
- Department of Medicine, University of Michigan Medical School, Ann Arbor, MI, United States
- Medicine Service, VA Ann Arbor Healthcare System, Ann Arbor, MI, United States
| | - Soshi Takagi
- Faculty of Medicine, Shimane University, Izuom, Japan
| | - Kota Sakaguchi
- General Medicine Center, Shimane University Hospital, Izumo, Japan
| | - Yuji Nishizaki
- Division of Medical Education, Juntendo University School of Medicine, Tokyo, Japan
| | - Taro Shimizu
- Department of Diagnostic and Generalist Medicine, Dokkyo Medical University Hospital, Tochigi, Japan
| | - Yu Yamamoto
- Division of General Medicine, Center for Community Medicine, Jichi Medical University, Tochigi, Japan
| | - Yasuharu Tokuda
- Muribushi Okinawa Project for Teaching Hospitals, Okinawa, Japan
| |
Collapse
|
46
|
Buhr CR, Smith H, Huppertz T, Bahr-Hamm K, Matthias C, Blaikie A, Kelsey T, Kuhn S, Eckrich J. ChatGPT Versus Consultants: Blinded Evaluation on Answering Otorhinolaryngology Case-Based Questions. JMIR Med Educ 2023; 9:e49183. [PMID: 38051578 PMCID: PMC10731554 DOI: 10.2196/49183] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/20/2023] [Revised: 07/20/2023] [Accepted: 10/20/2023] [Indexed: 12/07/2023]
Abstract
BACKGROUND Large language models (LLMs), such as ChatGPT (Open AI), are increasingly used in medicine and supplement standard search engines as information sources. This leads to more "consultations" of LLMs about personal medical symptoms. OBJECTIVE This study aims to evaluate ChatGPT's performance in answering clinical case-based questions in otorhinolaryngology (ORL) in comparison to ORL consultants' answers. METHODS We used 41 case-based questions from established ORL study books and past German state examinations for doctors. The questions were answered by both ORL consultants and ChatGPT 3. ORL consultants rated all responses, except their own, on medical adequacy, conciseness, coherence, and comprehensibility using a 6-point Likert scale. They also identified (in a blinded setting) if the answer was created by an ORL consultant or ChatGPT. Additionally, the character count was compared. Due to the rapidly evolving pace of technology, a comparison between responses generated by ChatGPT 3 and ChatGPT 4 was included to give an insight into the evolving potential of LLMs. RESULTS Ratings in all categories were significantly higher for ORL consultants (P<.001). Although inferior to the scores of the ORL consultants, ChatGPT's scores were relatively higher in semantic categories (conciseness, coherence, and comprehensibility) compared to medical adequacy. ORL consultants identified ChatGPT as the source correctly in 98.4% (121/123) of cases. ChatGPT's answers had a significantly higher character count compared to ORL consultants (P<.001). Comparison between responses generated by ChatGPT 3 and ChatGPT 4 showed a slight improvement in medical accuracy as well as a better coherence of the answers provided. Contrarily, neither the conciseness (P=.06) nor the comprehensibility (P=.08) improved significantly despite the significant increase in the mean amount of characters by 52.5% (n= (1470-964)/964; P<.001). CONCLUSIONS While ChatGPT provided longer answers to medical problems, medical adequacy and conciseness were significantly lower compared to ORL consultants' answers. LLMs have potential as augmentative tools for medical care, but their "consultation" for medical problems carries a high risk of misinformation as their high semantic quality may mask contextual deficits.
Collapse
Affiliation(s)
- Christoph Raphael Buhr
- Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany
- School of Medicine, University of St Andrews, St Andrews, United Kingdom
| | - Harry Smith
- School of Computer Science, University of St Andrews, St Andrews, United Kingdom
| | - Tilman Huppertz
- Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany
| | - Katharina Bahr-Hamm
- Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany
| | - Christoph Matthias
- Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany
| | - Andrew Blaikie
- School of Medicine, University of St Andrews, St Andrews, United Kingdom
| | - Tom Kelsey
- School of Computer Science, University of St Andrews, St Andrews, United Kingdom
| | - Sebastian Kuhn
- Institute of Digital Medicine, Philipps-University Marburg and University Hospital of Giessen and Marburg, Marburg, Germany
| | - Jonas Eckrich
- Department of Otorhinolaryngology, University Medical Center of the Johannes Gutenberg-University Mainz, Mainz, Germany
| |
Collapse
|
47
|
Le Mens G, Kovács B, Hannan MT, Pros G. Uncovering the semantics of concepts using GPT-4. Proc Natl Acad Sci U S A 2023; 120:e2309350120. [PMID: 38032930 DOI: 10.1073/pnas.2309350120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2023] [Accepted: 10/13/2023] [Indexed: 12/02/2023] Open
Abstract
The ability of recent Large Language Models (LLMs) such as GPT-3.5 and GPT-4 to generate human-like texts suggests that social scientists could use these LLMs to construct measures of semantic similarity that match human judgment. In this article, we provide an empirical test of this intuition. We use GPT-4 to construct a measure of typicality-the similarity of a text document to a concept. We evaluate its performance against other model-based typicality measures in terms of the correlation with human typicality ratings. We conduct this comparative analysis in two domains: the typicality of books in literary genres (using an existing dataset of book descriptions) and the typicality of tweets authored by US Congress members in the Democratic and Republican parties (using a novel dataset). The typicality measure produced with GPT-4 meets or exceeds the performance of the previous state-of-the art typicality measure we introduced in a recent paper [G. Le Mens, B. Kovács, M. T. Hannan, G. Pros Rius, Sociol. Sci. 2023, 82-117 (2023)]. It accomplishes this without any training with the research data (it is zero-shot learning). This is a breakthrough because the previous state-of-the-art measure required fine-tuning an LLM on hundreds of thousands of text documents to achieve its performance.
Collapse
Affiliation(s)
- Gaël Le Mens
- Department of Economics and Business, Universitat Pompeu Fabra (UPF), Barcelona School of Economics (BSE), UPF-Barcelona School of Management, Barcelona 08005, Spain
| | - Balázs Kovács
- Yale School of Management, Yale University, New Haven, CT 06520
| | - Michael T Hannan
- Stanford Graduate School of Business, Stanford University, Stanford, CA 94305
| | - Guillem Pros
- Department of Economics and Business, Universitat Pompeu Fabra (UPF), Barcelona School of Economics (BSE), UPF-Barcelona School of Management, Barcelona 08005, Spain
| |
Collapse
|
48
|
Thirunavukarasu AJ. How Can the Clinical Aptitude of AI Assistants Be Assayed? J Med Internet Res 2023; 25:e51603. [PMID: 38051572 PMCID: PMC10731545 DOI: 10.2196/51603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Revised: 09/28/2023] [Accepted: 11/20/2023] [Indexed: 12/07/2023] Open
Abstract
Large language models (LLMs) are exhibiting remarkable performance in clinical contexts, with exemplar results ranging from expert-level attainment in medical examination questions to superior accuracy and relevance when responding to patient queries compared to real doctors replying to queries on social media. The deployment of LLMs in conventional health care settings is yet to be reported, and there remains an open question as to what evidence should be required before such deployment is warranted. Early validation studies use unvalidated surrogate variables to represent clinical aptitude, and it may be necessary to conduct prospective randomized controlled trials to justify the use of an LLM for clinical advice or assistance, as potential pitfalls and pain points cannot be exhaustively predicted. This viewpoint states that as LLMs continue to revolutionize the field, there is an opportunity to improve the rigor of artificial intelligence (AI) research to reward innovation, conferring real benefits to real patients.
Collapse
Affiliation(s)
- Arun James Thirunavukarasu
- Oxford University Clinical Academic Graduate School, University of Oxford, Oxford, United Kingdom
- School of Clinical Medicine, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
49
|
Mishra C, Verdonschot R, Hagoort P, Skantze G. Real-time emotion generation in human-robot dialogue using large language models. Front Robot AI 2023; 10:1271610. [PMID: 38106543 PMCID: PMC10722897 DOI: 10.3389/frobt.2023.1271610] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2023] [Accepted: 11/08/2023] [Indexed: 12/19/2023] Open
Abstract
Affective behaviors enable social robots to not only establish better connections with humans but also serve as a tool for the robots to express their internal states. It has been well established that emotions are important to signal understanding in Human-Robot Interaction (HRI). This work aims to harness the power of Large Language Models (LLM) and proposes an approach to control the affective behavior of robots. By interpreting emotion appraisal as an Emotion Recognition in Conversation (ERC) tasks, we used GPT-3.5 to predict the emotion of a robot's turn in real-time, using the dialogue history of the ongoing conversation. The robot signaled the predicted emotion using facial expressions. The model was evaluated in a within-subjects user study (N = 47) where the model-driven emotion generation was compared against conditions where the robot did not display any emotions and where it displayed incongruent emotions. The participants interacted with the robot by playing a card sorting game that was specifically designed to evoke emotions. The results indicated that the emotions were reliably generated by the LLM and the participants were able to perceive the robot's emotions. It was found that the robot expressing congruent model-driven facial emotion expressions were perceived to be significantly more human-like, emotionally appropriate, and elicit a more positive impression. Participants also scored significantly better in the card sorting game when the robot displayed congruent facial expressions. From a technical perspective, the study shows that LLMs can be used to control the affective behavior of robots reliably in real-time. Additionally, our results could be used in devising novel human-robot interactions, making robots more effective in roles where emotional interaction is important, such as therapy, companionship, or customer service.
Collapse
Affiliation(s)
- Chinmaya Mishra
- Furhat Robotics AB, Stockholm, Sweden
- Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands
| | | | - Peter Hagoort
- Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands
- Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, Netherlands
| | - Gabriel Skantze
- Furhat Robotics AB, Stockholm, Sweden
- Division of Speech, Music and Hearing, KTH Royal Institute of Technology, Stockholm, Sweden
| |
Collapse
|
50
|
de Zarzà I, de Curtò J, Roig G, Calafate CT. LLM Multimodal Traffic Accident Forecasting. Sensors (Basel) 2023; 23:9225. [PMID: 38005612 PMCID: PMC10674612 DOI: 10.3390/s23229225] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Revised: 11/10/2023] [Accepted: 11/15/2023] [Indexed: 11/26/2023]
Abstract
With the rise in traffic congestion in urban centers, predicting accidents has become paramount for city planning and public safety. This work comprehensively studied the efficacy of modern deep learning (DL) methods in forecasting traffic accidents and enhancing Level-4 and Level-5 (L-4 and L-5) driving assistants with actionable visual and language cues. Using a rich dataset detailing accident occurrences, we juxtaposed the Transformer model against traditional time series models like ARIMA and the more recent Prophet model. Additionally, through detailed analysis, we delved deep into feature importance using principal component analysis (PCA) loadings, uncovering key factors contributing to accidents. We introduce the idea of using real-time interventions with large language models (LLMs) in autonomous driving with the use of lightweight compact LLMs like LLaMA-2 and Zephyr-7b-α. Our exploration extends to the realm of multimodality, through the use of Large Language-and-Vision Assistant (LLaVA)-a bridge between visual and linguistic cues by means of a Visual Language Model (VLM)-in conjunction with deep probabilistic reasoning, enhancing the real-time responsiveness of autonomous driving systems. In this study, we elucidate the advantages of employing large multimodal models within DL and deep probabilistic programming for enhancing the performance and usability of time series forecasting and feature weight importance, particularly in a self-driving scenario. This work paves the way for safer, smarter cities, underpinned by data-driven decision making.
Collapse
Affiliation(s)
- I. de Zarzà
- Informatik und Mathematik, GOETHE-University Frankfurt am Main, 60323 Frankfurt am Main, Germany; (I.d.Z.); (G.R.)
- Departamento de Informática de Sistemas y Computadores, Universitat Politècnica de València, 46022 València, Spain;
- Estudis d’Informàtica, Multimèdia i Telecomunicació, Universitat Oberta de Catalunya, 08018 Barcelona, Spain
| | - J. de Curtò
- Informatik und Mathematik, GOETHE-University Frankfurt am Main, 60323 Frankfurt am Main, Germany; (I.d.Z.); (G.R.)
- Departamento de Informática de Sistemas y Computadores, Universitat Politècnica de València, 46022 València, Spain;
- Estudis d’Informàtica, Multimèdia i Telecomunicació, Universitat Oberta de Catalunya, 08018 Barcelona, Spain
| | - Gemma Roig
- Informatik und Mathematik, GOETHE-University Frankfurt am Main, 60323 Frankfurt am Main, Germany; (I.d.Z.); (G.R.)
- HESSIAN Center for AI (hessian.AI), 64289 Darmstadt, Germany
| | - Carlos T. Calafate
- Departamento de Informática de Sistemas y Computadores, Universitat Politècnica de València, 46022 València, Spain;
| |
Collapse
|