1
|
Wu J, Ma Y, Wang J, Xiao M. The Application of ChatGPT in Medicine: A Scoping Review and Bibliometric Analysis. J Multidiscip Healthc 2024; 17:1681-1692. [PMID: 38650670 PMCID: PMC11034560 DOI: 10.2147/jmdh.s463128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Accepted: 03/25/2024] [Indexed: 04/25/2024] Open
Abstract
Purpose ChatGPT has a wide range of applications in the medical field. Therefore, this review aims to define the key issues and provide a comprehensive view of the literature based on the application of ChatGPT in medicine. Methods This scope follows Arksey and O'Malley's five-stage framework. A comprehensive literature search of publications (30 November 2022 to 16 August 2023) was conducted. Six databases were searched and relevant references were systematically catalogued. Attention was focused on the general characteristics of the articles, their fields of application, and the advantages and disadvantages of using ChatGPT. Descriptive statistics and narrative synthesis methods were used for data analysis. Results Of the 3426 studies, 247 met the criteria for inclusion in this review. The majority of articles (31.17%) were from the United States. Editorials (43.32%) ranked first, followed by experimental studys (11.74%). The potential applications of ChatGPT in medicine are varied, with the largest number of studies (45.75%) exploring clinical practice, including assisting with clinical decision support and providing disease information and medical advice. This was followed by medical education (27.13%) and scientific research (16.19%). Particularly noteworthy in the discipline statistics were radiology, surgery and dentistry at the top of the list. However, ChatGPT in medicine also faces issues of data privacy, inaccuracy and plagiarism. Conclusion The application of ChatGPT in medicine focuses on different disciplines and general application scenarios. ChatGPT has a paradoxical nature: it offers significant advantages, but at the same time raises great concerns about its application in healthcare settings. Therefore, it is imperative to develop theoretical frameworks that not only address its widespread use in healthcare but also facilitate a comprehensive assessment. In addition, these frameworks should contribute to the development of strict and effective guidelines and regulatory measures.
Collapse
Affiliation(s)
- Jie Wu
- Department of Nursing, the First Affiliated Hospital of Chongqing Medical University, Chongqing, People’s Republic of China
| | - Yingzhuo Ma
- Department of Nursing, the First Affiliated Hospital of Chongqing Medical University, Chongqing, People’s Republic of China
| | - Jun Wang
- Department of Nursing, the First Affiliated Hospital of Chongqing Medical University, Chongqing, People’s Republic of China
| | - Mingzhao Xiao
- Department of Urology, the First Affiliated Hospital of Chongqing Medical University, Chongqing, People’s Republic of China
| |
Collapse
|
2
|
Du X, Novoa-Laurentiev J, Plasek JM, Chuang YW, Wang L, Chang F, Datta S, Paek H, Lin B, Wei Q, Wang X, Wang J, Ding H, Manion FJ, Du J, Zhou L. Enhancing Early Detection of Cognitive Decline in the Elderly through Ensemble of NLP Techniques: A Comparative Study Utilizing Large Language Models in Clinical Notes. medRxiv 2024:2024.04.03.24305298. [PMID: 38633810 PMCID: PMC11023645 DOI: 10.1101/2024.04.03.24305298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/19/2024]
Abstract
Background Early detection of cognitive decline in elderly individuals facilitates clinical trial enrollment and timely medical interventions. This study aims to apply, evaluate, and compare advanced natural language processing techniques for identifying signs of cognitive decline in clinical notes. Methods This study, conducted at Mass General Brigham (MGB), Boston, MA, included clinical notes from the 4 years prior to initial mild cognitive impairment (MCI) diagnosis in 2019 for patients ≥ 50 years. Note sections regarding cognitive decline were labeled manually. A random sample of 4,949 note sections filtered with cognitive functions-related keywords were used for traditional AI model development, and 200 random subset were used for LLM and prompt development; another random sample of 1996 note sections without keyword filtering were used for testing. Prompt templates for large language models (LLM), Llama 2 on Amazon Web Service and GPT-4 on Microsoft Azure, were developed with multiple prompting approaches to select the optimal LLM-based method. Baseline comparisons were made with XGBoost and a hierarchical attention-based deep neural network model. An ensemble of the three models was then constructed using majority vote. Results GPT-4 demonstrated superior accuracy and efficiency to Llama 2. The ensemble model outperformed individual models, achieving a precision of 90.3%, recall of 94.2%, and F1-score of 92.2%. Notably, the ensemble model demonstrated a marked improvement in precision (from a 70%-79% range to above 90%) compared to the best performing single model. Error analysis revealed 63 samples were wrongly predicted by at least one model; however, only 2 cases (3.2%) were mutual errors across all models, indicating diverse error profiles among them. Conclusion Our findings indicate that LLMs and traditional models exhibit diverse error profiles. The ensemble of LLMs and locally trained machine learning models on EHR data was found to be complementary, enhancing performance and improving diagnostic accuracy.
Collapse
Affiliation(s)
- Xinsong Du
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
| | - John Novoa-Laurentiev
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, Massachusetts 02115
| | - Joseph M. Plasek
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
| | - Ya-Wen Chuang
- Division of Nephrology, Taichung Veterans General Hospital, Taichung, Taiwan, 407219
| | - Liqin Wang
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
| | - Frank Chang
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, Massachusetts 02115
| | - Surabhi Datta
- Intelligent Medical Objects, Rosemont, Illinois, 60018
| | - Hunki Paek
- Intelligent Medical Objects, Rosemont, Illinois, 60018
| | - Bin Lin
- Intelligent Medical Objects, Rosemont, Illinois, 60018
| | - Qiang Wei
- Intelligent Medical Objects, Rosemont, Illinois, 60018
| | - Xiaoyan Wang
- Intelligent Medical Objects, Rosemont, Illinois, 60018
| | - Jingqi Wang
- Intelligent Medical Objects, Rosemont, Illinois, 60018
| | - Hao Ding
- Intelligent Medical Objects, Rosemont, Illinois, 60018
| | | | - Jingcheng Du
- Intelligent Medical Objects, Rosemont, Illinois, 60018
| | - Li Zhou
- Division of General Internal Medicine and Primary Care, Brigham and Women's Hospital, Boston, Massachusetts 02115
- Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115
| |
Collapse
|
3
|
Carlà MM, Gambini G, Baldascino A, Boselli F, Giannuzzi F, Margollicci F, Rizzo S. Large language models as assistance for glaucoma surgical cases: a ChatGPT vs. Google Gemini comparison. Graefes Arch Clin Exp Ophthalmol 2024:10.1007/s00417-024-06470-5. [PMID: 38573349 DOI: 10.1007/s00417-024-06470-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Revised: 03/11/2024] [Accepted: 03/20/2024] [Indexed: 04/05/2024] Open
Abstract
PURPOSE The aim of this study was to define the capability of ChatGPT-4 and Google Gemini in analyzing detailed glaucoma case descriptions and suggesting an accurate surgical plan. METHODS Retrospective analysis of 60 medical records of surgical glaucoma was divided into "ordinary" (n = 40) and "challenging" (n = 20) scenarios. Case descriptions were entered into ChatGPT and Bard's interfaces with the question "What kind of surgery would you perform?" and repeated three times to analyze the answers' consistency. After collecting the answers, we assessed the level of agreement with the unified opinion of three glaucoma surgeons. Moreover, we graded the quality of the responses with scores from 1 (poor quality) to 5 (excellent quality), according to the Global Quality Score (GQS) and compared the results. RESULTS ChatGPT surgical choice was consistent with those of glaucoma specialists in 35/60 cases (58%), compared to 19/60 (32%) of Gemini (p = 0.0001). Gemini was not able to complete the task in 16 cases (27%). Trabeculectomy was the most frequent choice for both chatbots (53% and 50% for ChatGPT and Gemini, respectively). In "challenging" cases, ChatGPT agreed with specialists in 9/20 choices (45%), outperforming Google Gemini performances (4/20, 20%). Overall, GQS scores were 3.5 ± 1.2 and 2.1 ± 1.5 for ChatGPT and Gemini (p = 0.002). This difference was even more marked if focusing only on "challenging" cases (1.5 ± 1.4 vs. 3.0 ± 1.5, p = 0.001). CONCLUSION ChatGPT-4 showed a good analysis performance for glaucoma surgical cases, either ordinary or challenging. On the other side, Google Gemini showed strong limitations in this setting, presenting high rates of unprecise or missed answers.
Collapse
Affiliation(s)
- Matteo Mario Carlà
- Ophthalmology Department, Fondazione Policlinico Universitario A. Gemelli, IRCCS, 00168, Rome, Italy.
- Ophthalmology Department, Catholic University "Sacro Cuore,", Largo A. Gemelli, 8, Rome, Italy.
| | - Gloria Gambini
- Ophthalmology Department, Fondazione Policlinico Universitario A. Gemelli, IRCCS, 00168, Rome, Italy
- Ophthalmology Department, Catholic University "Sacro Cuore,", Largo A. Gemelli, 8, Rome, Italy
| | - Antonio Baldascino
- Ophthalmology Department, Fondazione Policlinico Universitario A. Gemelli, IRCCS, 00168, Rome, Italy
- Ophthalmology Department, Catholic University "Sacro Cuore,", Largo A. Gemelli, 8, Rome, Italy
| | - Francesco Boselli
- Ophthalmology Department, Fondazione Policlinico Universitario A. Gemelli, IRCCS, 00168, Rome, Italy
- Ophthalmology Department, Catholic University "Sacro Cuore,", Largo A. Gemelli, 8, Rome, Italy
| | - Federico Giannuzzi
- Ophthalmology Department, Fondazione Policlinico Universitario A. Gemelli, IRCCS, 00168, Rome, Italy
- Ophthalmology Department, Catholic University "Sacro Cuore,", Largo A. Gemelli, 8, Rome, Italy
| | - Fabio Margollicci
- Ophthalmology Department, Fondazione Policlinico Universitario A. Gemelli, IRCCS, 00168, Rome, Italy
- Ophthalmology Department, Catholic University "Sacro Cuore,", Largo A. Gemelli, 8, Rome, Italy
| | - Stanislao Rizzo
- Ophthalmology Department, Fondazione Policlinico Universitario A. Gemelli, IRCCS, 00168, Rome, Italy
- Ophthalmology Department, Catholic University "Sacro Cuore,", Largo A. Gemelli, 8, Rome, Italy
| |
Collapse
|
4
|
Wu L, Xu J, Thakkar S, Gray M, Qu Y, Li D, Tong W. A framework enabling LLMs into regulatory environment for transparency and trustworthiness and its application to drug labeling document. Regul Toxicol Pharmacol 2024; 149:105613. [PMID: 38570021 DOI: 10.1016/j.yrtph.2024.105613] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Revised: 03/18/2024] [Accepted: 03/26/2024] [Indexed: 04/05/2024]
Abstract
Regulatory agencies consistently deal with extensive document reviews, ranging from product submissions to both internal and external communications. Large Language Models (LLMs) like ChatGPT can be invaluable tools for these tasks, however present several challenges, particularly the proprietary information, combining customized function with specific review needs, and transparency and explainability of the model's output. Hence, a localized and customized solution is imperative. To tackle these challenges, we formulated a framework named askFDALabel on FDA drug labeling documents that is a crucial resource in the FDA drug review process. AskFDALabel operates within a secure IT environment and comprises two key modules: a semantic search and a Q&A/text-generation module. The Module S built on word embeddings to enable comprehensive semantic queries within labeling documents. The Module T utilizes a tuned LLM to generate responses based on references from Module S. As the result, our framework enabled small LLMs to perform comparably to ChatGPT with as a computationally inexpensive solution for regulatory application. To conclude, through AskFDALabel, we have showcased a pathway that harnesses LLMs to support agency operations within a secure environment, offering tailored functions for the needs of regulatory research.
Collapse
Affiliation(s)
- Leihong Wu
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US FDA, 3900 NCTR Rd, Jefferson AR, 72211, USA.
| | - Joshua Xu
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US FDA, 3900 NCTR Rd, Jefferson AR, 72211, USA
| | - Shraddha Thakkar
- Office of Translational Sciences, Center for Drug Evaluation and Research (CDER), US FDA, 10903 New Hampshire Avenue, Silver Spring, MD, 20993, USA
| | - Magnus Gray
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US FDA, 3900 NCTR Rd, Jefferson AR, 72211, USA
| | - Yanyan Qu
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US FDA, 3900 NCTR Rd, Jefferson AR, 72211, USA
| | - Dongying Li
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US FDA, 3900 NCTR Rd, Jefferson AR, 72211, USA
| | - Weida Tong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, US FDA, 3900 NCTR Rd, Jefferson AR, 72211, USA.
| |
Collapse
|
5
|
Koga S. The double-edged nature of ChatGPT in self-diagnosis. Wien Klin Wochenschr 2024; 136:243-244. [PMID: 38504058 DOI: 10.1007/s00508-024-02343-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2024] [Accepted: 02/27/2024] [Indexed: 03/21/2024]
Affiliation(s)
- Shunsuke Koga
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, 3400 Spruce Street, 19104, Philadelphia, PA, USA.
| |
Collapse
|
6
|
Elyoseph Z, Levkovich I. Comparing the Perspectives of Generative AI, Mental Health Experts, and the General Public on Schizophrenia Recovery: Case Vignette Study. JMIR Ment Health 2024; 11:e53043. [PMID: 38533615 PMCID: PMC11004608 DOI: 10.2196/53043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/24/2023] [Revised: 01/24/2024] [Accepted: 02/11/2024] [Indexed: 03/28/2024] Open
Abstract
Background The current paradigm in mental health care focuses on clinical recovery and symptom remission. This model's efficacy is influenced by therapist trust in patient recovery potential and the depth of the therapeutic relationship. Schizophrenia is a chronic illness with severe symptoms where the possibility of recovery is a matter of debate. As artificial intelligence (AI) becomes integrated into the health care field, it is important to examine its ability to assess recovery potential in major psychiatric disorders such as schizophrenia. Objective This study aimed to evaluate the ability of large language models (LLMs) in comparison to mental health professionals to assess the prognosis of schizophrenia with and without professional treatment and the long-term positive and negative outcomes. Methods Vignettes were inputted into LLMs interfaces and assessed 10 times by 4 AI platforms: ChatGPT-3.5, ChatGPT-4, Google Bard, and Claude. A total of 80 evaluations were collected and benchmarked against existing norms to analyze what mental health professionals (general practitioners, psychiatrists, clinical psychologists, and mental health nurses) and the general public think about schizophrenia prognosis with and without professional treatment and the positive and negative long-term outcomes of schizophrenia interventions. Results For the prognosis of schizophrenia with professional treatment, ChatGPT-3.5 was notably pessimistic, whereas ChatGPT-4, Claude, and Bard aligned with professional views but differed from the general public. All LLMs believed untreated schizophrenia would remain static or worsen without professional treatment. For long-term outcomes, ChatGPT-4 and Claude predicted more negative outcomes than Bard and ChatGPT-3.5. For positive outcomes, ChatGPT-3.5 and Claude were more pessimistic than Bard and ChatGPT-4. Conclusions The finding that 3 out of the 4 LLMs aligned closely with the predictions of mental health professionals when considering the "with treatment" condition is a demonstration of the potential of this technology in providing professional clinical prognosis. The pessimistic assessment of ChatGPT-3.5 is a disturbing finding since it may reduce the motivation of patients to start or persist with treatment for schizophrenia. Overall, although LLMs hold promise in augmenting health care, their application necessitates rigorous validation and a harmonious blend with human expertise.
Collapse
Affiliation(s)
- Zohar Elyoseph
- Department of Brain Sciences, Faculty of Medicine, Imperial College London, London, United Kingdom
- The Center for Psychobiological Research, Department of Psychology and Educational Counseling, Max Stern Yezreel Valley College, Emek Yezreel, Israel
| | - Inbar Levkovich
- Faculty of Graduate Studies, Oranim Academic College, Kiryat Tiv'on, Israel
| |
Collapse
|
7
|
Raman R, Venugopalan M, Kamal A. Evaluating human resources management literacy: A performance analysis of ChatGPT and bard. Heliyon 2024; 10:e27026. [PMID: 38486738 PMCID: PMC10937570 DOI: 10.1016/j.heliyon.2024.e27026] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2023] [Revised: 02/16/2024] [Accepted: 02/22/2024] [Indexed: 03/17/2024] Open
Abstract
This study presents a comprehensive analysis comparing the literacy levels of two Generative Artificial Intelligence (GAI) tools, ChatGPT and Bard, using a dataset of 134 questions from the Human Resources (HR) domain. The generated responses are evaluated for accuracy, relevance, and clarity. We find that ChatGPT outperforms Bard in overall accuracy (84.3% vs. 82.8%). This difference in performance suggests that ChatGPT could serve as a robotic advisor in transactional HR roles. In contrast, Bard may possess additional safeguards against misuse in the HR function, making it less capable of generating responses to certain types of questions. Statistical tests reveal that although the two systems differ in their mean accuracy, relevance, and clarity of the responses, the observed differences are not always statistically significant, implying that both tools may be more complementary than competitive. The Pearson correlation coefficients further support this by showing weak to non-existent relationships in performance metrics between the two tools. Confirmation queries don't improve ChatGPT or Bard's response accuracy. The study thus contributes to emerging research on the utility of GAI tools in Human Resources Management and suggests that involving certified HR professionals in the design phase could enhance underlying language model performance.
Collapse
Affiliation(s)
- Raghu Raman
- Amrita School of Business, Amrita Vishwa Vidyapeetham, Amritapuri, India
| | - Murale Venugopalan
- Amrita School of Business, Amrita Vishwa Vidyapeetham, Amritapuri, India
| | - Anju Kamal
- Amrita School of Business, Amrita Vishwa Vidyapeetham, Amritapuri, India
| |
Collapse
|
8
|
Carlà MM, Gambini G, Baldascino A, Giannuzzi F, Boselli F, Crincoli E, D'Onofrio NC, Rizzo S. Exploring AI-chatbots' capability to suggest surgical planning in ophthalmology: ChatGPT versus Google Gemini analysis of retinal detachment cases. Br J Ophthalmol 2024:bjo-2023-325143. [PMID: 38448201 DOI: 10.1136/bjo-2023-325143] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2023] [Accepted: 02/16/2024] [Indexed: 03/08/2024]
Abstract
BACKGROUND We aimed to define the capability of three different publicly available large language models, Chat Generative Pretrained Transformer (ChatGPT-3.5), ChatGPT-4 and Google Gemini in analysing retinal detachment cases and suggesting the best possible surgical planning. METHODS Analysis of 54 retinal detachments records entered into ChatGPT and Gemini's interfaces. After asking 'Specify what kind of surgical planning you would suggest and the eventual intraocular tamponade.' and collecting the given answers, we assessed the level of agreement with the common opinion of three expert vitreoretinal surgeons. Moreover, ChatGPT and Gemini answers were graded 1-5 (from poor to excellent quality), according to the Global Quality Score (GQS). RESULTS After excluding 4 controversial cases, 50 cases were included. Overall, ChatGPT-3.5, ChatGPT-4 and Google Gemini surgical choices agreed with those of vitreoretinal surgeons in 40/50 (80%), 42/50 (84%) and 35/50 (70%) of cases. Google Gemini was not able to respond in five cases. Contingency analysis showed significant differences between ChatGPT-4 and Gemini (p=0.03). ChatGPT's GQS were 3.9±0.8 and 4.2±0.7 for versions 3.5 and 4, while Gemini scored 3.5±1.1. There was no statistical difference between the two ChatGPTs (p=0.22), while both outperformed Gemini scores (p=0.03 and p=0.002, respectively). The main source of error was endotamponade choice (14% for ChatGPT-3.5 and 4, and 12% for Google Gemini). Only ChatGPT-4 was able to suggest a combined phacovitrectomy approach. CONCLUSION In conclusion, Google Gemini and ChatGPT evaluated vitreoretinal patients' records in a coherent manner, showing a good level of agreement with expert surgeons. According to the GQS, ChatGPT's recommendations were much more accurate and precise.
Collapse
Affiliation(s)
- Matteo Mario Carlà
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
- Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy
| | - Gloria Gambini
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
- Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy
| | - Antonio Baldascino
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
- Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy
| | - Federico Giannuzzi
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
- Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy
| | - Francesco Boselli
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
- Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy
| | - Emanuele Crincoli
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
- Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy
| | - Nicola Claudio D'Onofrio
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
- Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy
| | - Stanislao Rizzo
- Ophthalmology Department, Catholic University "Sacro Cuore", Rome, Italy
- Ophthalmology Department, Fondazione Policlinico Universitario "A. Gemelli", IRCCS, Rome, Italy
| |
Collapse
|
9
|
Koga S, Du W. Integrating AI in medicine: Lessons from Chat-GPT's limitations in medical imaging. Dig Liver Dis 2024:S1590-8658(24)00275-5. [PMID: 38429138 DOI: 10.1016/j.dld.2024.02.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/17/2024] [Accepted: 02/19/2024] [Indexed: 03/03/2024]
Affiliation(s)
- Shunsuke Koga
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, 3400 Spruce Street, Philadelphia, PA 19104, United States.
| | - Wei Du
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, 3400 Spruce Street, Philadelphia, PA 19104, United States
| |
Collapse
|
10
|
Wang L, Chen X, Deng X, Wen H, You M, Liu W, Li Q, Li J. Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. NPJ Digit Med 2024; 7:41. [PMID: 38378899 PMCID: PMC10879172 DOI: 10.1038/s41746-024-01029-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2023] [Accepted: 02/05/2024] [Indexed: 02/22/2024] Open
Abstract
The use of large language models (LLMs) in clinical medicine is currently thriving. Effectively transferring LLMs' pertinent theoretical knowledge from computer science to their application in clinical medicine is crucial. Prompt engineering has shown potential as an effective method in this regard. To explore the application of prompt engineering in LLMs and to examine the reliability of LLMs, different styles of prompts were designed and used to ask different LLMs about their agreement with the American Academy of Orthopedic Surgeons (AAOS) osteoarthritis (OA) evidence-based guidelines. Each question was asked 5 times. We compared the consistency of the findings with guidelines across different evidence levels for different prompts and assessed the reliability of different prompts by asking the same question 5 times. gpt-4-Web with ROT prompting had the highest overall consistency (62.9%) and a significant performance for strong recommendations, with a total consistency of 77.5%. The reliability of the different LLMs for different prompts was not stable (Fleiss kappa ranged from -0.002 to 0.984). This study revealed that different prompts had variable effects across various models, and the gpt-4-Web with ROT prompt was the most consistent. An appropriate prompt could improve the accuracy of responses to professional medical questions.
Collapse
Affiliation(s)
- Li Wang
- Sports Medicine Center, West China Hospital, Sichuan University, Chengdu, China
- Department of Orthopedics and Orthopedic Research Institute, West China Hospital, Sichuan University, Chengdu, China
| | - Xi Chen
- Sports Medicine Center, West China Hospital, Sichuan University, Chengdu, China
- Department of Orthopedics and Orthopedic Research Institute, West China Hospital, Sichuan University, Chengdu, China
| | - XiangWen Deng
- Shenzhen International Graduate School, Tsinghua University, Beijing, China
| | - Hao Wen
- Shenzhen International Graduate School, Tsinghua University, Beijing, China
| | - MingKe You
- Sports Medicine Center, West China Hospital, Sichuan University, Chengdu, China
- Department of Orthopedics and Orthopedic Research Institute, West China Hospital, Sichuan University, Chengdu, China
| | - WeiZhi Liu
- Sports Medicine Center, West China Hospital, Sichuan University, Chengdu, China
- Department of Orthopedics and Orthopedic Research Institute, West China Hospital, Sichuan University, Chengdu, China
| | - Qi Li
- Sports Medicine Center, West China Hospital, Sichuan University, Chengdu, China.
- Department of Orthopedics and Orthopedic Research Institute, West China Hospital, Sichuan University, Chengdu, China.
| | - Jian Li
- Sports Medicine Center, West China Hospital, Sichuan University, Chengdu, China.
- Department of Orthopedics and Orthopedic Research Institute, West China Hospital, Sichuan University, Chengdu, China.
| |
Collapse
|
11
|
García-Méndez S, de Arriba-Pérez F. Large Language Models and Healthcare Alliance: Potential and Challenges of Two Representative Use Cases. Ann Biomed Eng 2024:10.1007/s10439-024-03454-8. [PMID: 38310159 DOI: 10.1007/s10439-024-03454-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Accepted: 01/15/2024] [Indexed: 02/05/2024]
Abstract
Large language models (LLMS) emerge as the most promising Natural Language Processing approach for clinical practice acceleration (i.e., diagnosis, prevention and treatment procedures). Similarly, intelligent conversational systems that leverage LLMS have disruptively become the future of therapy in the era of ChatGPT. Accordingly, this research addresses the application of LLMS in healthcare, paying particular attention to two relevant use cases: cognitive decline and depression, more specifically, postpartum depression. In the end, the most promising opportunities they represent (e.g., clinical tasks augmentation, personalized healthcare, etc.) and related concerns (e.g., data privacy and quality, fairness, etc.) are discussed to contribute to the global debate on their integration in the sanitary system.
Collapse
|
12
|
Iannantuono GM, Bracken-Clarke D, Karzai F, Choo-Wosoba H, Gulley JL, Floudas CS. Comparison of Large Language Models in Answering Immuno-Oncology Questions: A Cross-Sectional Study. medRxiv 2023:2023.10.31.23297825. [PMID: 38076813 PMCID: PMC10705618 DOI: 10.1101/2023.10.31.23297825] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Background The capability of large language models (LLMs) to understand and generate human-readable text has prompted the investigation of their potential as educational and management tools for cancer patients and healthcare providers. Materials and Methods We conducted a cross-sectional study aimed at evaluating the ability of ChatGPT-4, ChatGPT-3.5, and Google Bard to answer questions related to four domains of immuno-oncology (Mechanisms, Indications, Toxicities, and Prognosis). We generated 60 open-ended questions (15 for each section). Questions were manually submitted to LLMs, and responses were collected on June 30th, 2023. Two reviewers evaluated the answers independently. Results ChatGPT-4 and ChatGPT-3.5 answered all questions, whereas Google Bard answered only 53.3% (p <0.0001). The number of questions with reproducible answers was higher for ChatGPT-4 (95%) and ChatGPT3.5 (88.3%) than for Google Bard (50%) (p <0.0001). In terms of accuracy, the number of answers deemed fully correct were 75.4%, 58.5%, and 43.8% for ChatGPT-4, ChatGPT-3.5, and Google Bard, respectively (p = 0.03). Furthermore, the number of responses deemed highly relevant was 71.9%, 77.4%, and 43.8% for ChatGPT-4, ChatGPT-3.5, and Google Bard, respectively (p = 0.04). Regarding readability, the number of highly readable was higher for ChatGPT-4 and ChatGPT-3.5 (98.1%) and (100%) compared to Google Bard (87.5%) (p = 0.02). Conclusion ChatGPT-4 and ChatGPT-3.5 are potentially powerful tools in immuno-oncology, whereas Google Bard demonstrated relatively poorer performance. However, the risk of inaccuracy or incompleteness in the responses was evident in all three LLMs, highlighting the importance of expert-driven verification of the outputs returned by these technologies.
Collapse
Affiliation(s)
- Giovanni Maria Iannantuono
- Genitourinary Malignancies Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Dara Bracken-Clarke
- Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Fatima Karzai
- Genitourinary Malignancies Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Hyoyoung Choo-Wosoba
- Biostatistics and Data Management Section, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - James L. Gulley
- Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| | - Charalampos S. Floudas
- Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, United States
| |
Collapse
|