1
|
Wang L, Wan Z, Ni C, Song Q, Li Y, Clayton EW, Malin BA, Yin Z. A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.04.26.24306390. [PMID: 38712148 PMCID: PMC11071576 DOI: 10.1101/2024.04.26.24306390] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
Background The launch of the Chat Generative Pre-trained Transformer (ChatGPT) in November 2022 has attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including healthcare. Numerous studies have since been conducted regarding how to employ state-of-the-art LLMs in health-related scenarios to assist patients, doctors, and public health administrators. Objective This review aims to summarize the applications and concerns of applying conversational LLMs in healthcare and provide an agenda for future research on LLMs in healthcare. Methods We utilized PubMed, ACM, and IEEE digital libraries as primary sources for this review. We followed the guidance of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRIMSA) to screen and select peer-reviewed research articles that (1) were related to both healthcare applications and conversational LLMs and (2) were published before September 1st, 2023, the date when we started paper collection and screening. We investigated these papers and classified them according to their applications and concerns. Results Our search initially identified 820 papers according to targeted keywords, out of which 65 papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT from OpenAI (60), followed by Bard from Google (1), Large Language Model Meta AI (LLaMA) from Meta (1), and other LLMs (5). These papers were classified into four categories in terms of their applications: 1) summarization, 2) medical knowledge inquiry, 3) prediction, and 4) administration, and four categories of concerns: 1) reliability, 2) bias, 3) privacy, and 4) public acceptability. There are 49 (75%) research papers using LLMs for summarization and/or medical knowledge inquiry, and 58 (89%) research papers expressing concerns about reliability and/or bias. We found that conversational LLMs exhibit promising results in summarization and providing medical knowledge to patients with a relatively high accuracy. However, conversational LLMs like ChatGPT are not able to provide reliable answers to complex health-related tasks that require specialized domain expertise. Additionally, no experiments in our reviewed papers have been conducted to thoughtfully examine how conversational LLMs lead to bias or privacy issues in healthcare research. Conclusions Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications brought bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in healthcare.
Collapse
Affiliation(s)
- Leyao Wang
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Zhiyu Wan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, TN, USA, 37203
| | - Congning Ni
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Qingyuan Song
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Yang Li
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
| | - Ellen Wright Clayton
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, Tennessee, USA, 37203
- Center for Biomedical Ethics and Society, Vanderbilt University Medical Center, Nashville, Tennessee, USA, 37203
| | - Bradley A. Malin
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
- Department of Biomedical Informatics, Vanderbilt University Medical Center, TN, USA, 37203
- Department of Biostatistics, Vanderbilt University Medical Center, TN, USA, 37203
| | - Zhijun Yin
- Department of Computer Science, Vanderbilt University, Nashville, TN, USA, 37212
- Department of Biomedical Informatics, Vanderbilt University Medical Center, TN, USA, 37203
| |
Collapse
|
2
|
Dyckhoff-Shen S, Koedel U, Brouwer MC, Bodilsen J, Klein M. ChatGPT fails challenging the recent ESCMID brain abscess guideline. J Neurol 2024; 271:2086-2101. [PMID: 38279999 PMCID: PMC10972965 DOI: 10.1007/s00415-023-12168-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2023] [Revised: 12/20/2023] [Accepted: 12/21/2023] [Indexed: 01/29/2024]
Abstract
BACKGROUND With artificial intelligence (AI) on the rise, it remains unclear if AI is able to professionally evaluate medical research and give scientifically valid recommendations. AIM This study aimed to assess the accuracy of ChatGPT's responses to ten key questions on brain abscess diagnostics and treatment in comparison to the guideline recently published by the European Society for Clinical Microbiology and Infectious Diseases (ESCMID). METHODS All ten PECO (Population, Exposure, Comparator, Outcome) questions which had been developed during the guideline process were presented directly to ChatGPT. Next, ChatGPT was additionally fed with data from studies selected for each PECO question by the ESCMID committee. AI's responses were subsequently compared with the recommendations of the ESCMID guideline. RESULTS For 17 out of 20 challenges, ChatGPT was able to give recommendations on the management of patients with brain abscess, including grade of evidence and strength of recommendation. Without data prompting, 70% of questions were answered very similar to the guideline recommendation. In the answers that differed from the guideline recommendations, no patient hazard was present. Data input slightly improved the clarity of ChatGPT's recommendations, but, however, led to less correct answers including two recommendations that directly contradicted the guideline, being associated with the possibility of a hazard to the patient. CONCLUSION ChatGPT seems to be able to rapidly gather information on brain abscesses and give recommendations on key questions about their management in most cases. Nevertheless, single responses could possibly harm the patients. Thus, the expertise of an expert committee remains inevitable.
Collapse
Affiliation(s)
- Susanne Dyckhoff-Shen
- Department of Neurology with Friedrich-Baur-Institute, LMU University Hospital, LMU Munich (en.), Klinikum Grosshadern of the Ludwig Maximilians University of Munich, Marchioninistr. 15, 81377, Munich, Germany.
| | - Uwe Koedel
- Department of Neurology with Friedrich-Baur-Institute, LMU University Hospital, LMU Munich (en.), Klinikum Grosshadern of the Ludwig Maximilians University of Munich, Marchioninistr. 15, 81377, Munich, Germany
| | - Matthijs C Brouwer
- Department of Neurology, Amsterdam UMC, University of Amsterdam, Amsterdam Neuroscience, Amsterdam, The Netherlands
- European Society for Clinical Microbiology and Infectious Diseases (ESCMID) Study Group for Infections of the Brain (ESGIB), Basel, Switzerland
| | - Jacob Bodilsen
- Department of Infectious Diseases, Aalborg University Hospital, Aalborg, Denmark
- European Society for Clinical Microbiology and Infectious Diseases (ESCMID) Study Group for Infections of the Brain (ESGIB), Basel, Switzerland
| | - Matthias Klein
- Department of Neurology with Friedrich-Baur-Institute, LMU University Hospital, LMU Munich (en.), Klinikum Grosshadern of the Ludwig Maximilians University of Munich, Marchioninistr. 15, 81377, Munich, Germany
- Emergency Department, LMU University Hospital, LMU Munich (en.), Munich, Germany
- European Society for Clinical Microbiology and Infectious Diseases (ESCMID) Study Group for Infections of the Brain (ESGIB), Basel, Switzerland
| |
Collapse
|
3
|
Wong M, Lim ZW, Pushpanathan K, Cheung CY, Wang YX, Chen D, Tham YC. Review of emerging trends and projection of future developments in large language models research in ophthalmology. Br J Ophthalmol 2023:bjo-2023-324734. [PMID: 38164563 DOI: 10.1136/bjo-2023-324734] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Accepted: 11/14/2023] [Indexed: 01/03/2024]
Abstract
BACKGROUND Large language models (LLMs) are fast emerging as potent tools in healthcare, including ophthalmology. This systematic review offers a twofold contribution: it summarises current trends in ophthalmology-related LLM research and projects future directions for this burgeoning field. METHODS We systematically searched across various databases (PubMed, Europe PMC, Scopus and Web of Science) for articles related to LLM use in ophthalmology, published between 1 January 2022 and 31 July 2023. Selected articles were summarised, and categorised by type (editorial, commentary, original research, etc) and their research focus (eg, evaluating ChatGPT's performance in ophthalmology examinations or clinical tasks). FINDINGS We identified 32 articles meeting our criteria, published between January and July 2023, with a peak in June (n=12). Most were original research evaluating LLMs' proficiency in clinically related tasks (n=9). Studies demonstrated that ChatGPT-4.0 outperformed its predecessor, ChatGPT-3.5, in ophthalmology exams. Furthermore, ChatGPT excelled in constructing discharge notes (n=2), evaluating diagnoses (n=2) and answering general medical queries (n=6). However, it struggled with generating scientific articles or abstracts (n=3) and answering specific subdomain questions, especially those regarding specific treatment options (n=2). ChatGPT's performance relative to other LLMs (Google's Bard, Microsoft's Bing) varied by study design. Ethical concerns such as data hallucination (n=27), authorship (n=5) and data privacy (n=2) were frequently cited. INTERPRETATION While LLMs hold transformative potential for healthcare and ophthalmology, concerns over accountability, accuracy and data security remain. Future research should focus on application programming interface integration, comparative assessments of popular LLMs, their ability to interpret image-based data and the establishment of standardised evaluation frameworks.
Collapse
Affiliation(s)
| | - Zhi Wei Lim
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - Krithi Pushpanathan
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health & Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
| | - Carol Y Cheung
- Ophthalmology and Visual Sciences, The Chinese University of Hong Kong, Hong Kong, Hong Kong
| | - Ya Xing Wang
- Beijing Institute of Ophthalmology, Beijing Tongren Hospital, Capital University of Medical Science, Beijing, China
| | - David Chen
- Centre for Innovation and Precision Eye Health & Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Department of Ophthalmology, National University Hospital, Singapore
| | - Yih Chung Tham
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Centre for Innovation and Precision Eye Health & Department of Ophthalmology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore
| |
Collapse
|