1
|
Siepmann R, Huppertz M, Rastkhiz A, Reen M, Corban E, Schmidt C, Wilke S, Schad P, Yüksel C, Kuhl C, Truhn D, Nebelung S. The virtual reference radiologist: comprehensive AI assistance for clinical image reading and interpretation. Eur Radiol 2024; 34:6652-6666. [PMID: 38627289 PMCID: PMC11399201 DOI: 10.1007/s00330-024-10727-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 02/27/2024] [Accepted: 03/08/2024] [Indexed: 04/20/2024]
Abstract
OBJECTIVES Large language models (LLMs) have shown potential in radiology, but their ability to aid radiologists in interpreting imaging studies remains unexplored. We investigated the effects of a state-of-the-art LLM (GPT-4) on the radiologists' diagnostic workflow. MATERIALS AND METHODS In this retrospective study, six radiologists of different experience levels read 40 selected radiographic [n = 10], CT [n = 10], MRI [n = 10], and angiographic [n = 10] studies unassisted (session one) and assisted by GPT-4 (session two). Each imaging study was presented with demographic data, the chief complaint, and associated symptoms, and diagnoses were registered using an online survey tool. The impact of Artificial Intelligence (AI) on diagnostic accuracy, confidence, user experience, input prompts, and generated responses was assessed. False information was registered. Linear mixed-effect models were used to quantify the factors (fixed: experience, modality, AI assistance; random: radiologist) influencing diagnostic accuracy and confidence. RESULTS When assessing if the correct diagnosis was among the top-3 differential diagnoses, diagnostic accuracy improved slightly from 181/240 (75.4%, unassisted) to 188/240 (78.3%, AI-assisted). Similar improvements were found when only the top differential diagnosis was considered. AI assistance was used in 77.5% of the readings. Three hundred nine prompts were generated, primarily involving differential diagnoses (59.1%) and imaging features of specific conditions (27.5%). Diagnostic confidence was significantly higher when readings were AI-assisted (p > 0.001). Twenty-three responses (7.4%) were classified as hallucinations, while two (0.6%) were misinterpretations. CONCLUSION Integrating GPT-4 in the diagnostic process improved diagnostic accuracy slightly and diagnostic confidence significantly. Potentially harmful hallucinations and misinterpretations call for caution and highlight the need for further safeguarding measures. CLINICAL RELEVANCE STATEMENT Using GPT-4 as a virtual assistant when reading images made six radiologists of different experience levels feel more confident and provide more accurate diagnoses; yet, GPT-4 gave factually incorrect and potentially harmful information in 7.4% of its responses.
Collapse
Affiliation(s)
- Robert Siepmann
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Marc Huppertz
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Annika Rastkhiz
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Matthias Reen
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Eric Corban
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Christian Schmidt
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Stephan Wilke
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Philipp Schad
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Can Yüksel
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Christiane Kuhl
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Daniel Truhn
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Sven Nebelung
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany.
| |
Collapse
|
2
|
Soleimani M, Seyyedi N, Ayyoubzadeh SM, Kalhori SRN, Keshavarz H. Practical Evaluation of ChatGPT Performance for Radiology Report Generation. Acad Radiol 2024:S1076-6332(24)00454-9. [PMID: 39142976 DOI: 10.1016/j.acra.2024.07.020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Revised: 07/12/2024] [Accepted: 07/14/2024] [Indexed: 08/16/2024]
Abstract
RATIONALE AND OBJECTIVES The process of generating radiology reports is often time-consuming and labor-intensive, prone to incompleteness, heterogeneity, and errors. By employing natural language processing (NLP)-based techniques, this study explores the potential for enhancing the efficiency of radiology report generation through the remarkable capabilities of ChatGPT (Generative Pre-training Transformer), a prominent large language model (LLM). MATERIALS AND METHODS Using a sample of 1000 records from the Medical Information Mart for Intensive Care (MIMIC) Chest X-ray Database, this investigation employed Claude.ai to extract initial radiological report keywords. ChatGPT then generated radiology reports using a consistent 3-step prompt template outline. Various lexical and sentence similarity techniques were employed to evaluate the correspondence between the AI assistant-generated reports and reference reports authored by medical professionals. RESULTS Results showed varying performance among NLP models, with Bart (Bidirectional and Auto-Regressive Transformers) and XLM (Cross-lingual Language Model) displaying high proficiency (mean similarity scores up to 99.3%), closely mirroring physician reports. Conversely, DeBERTa (Decoding-enhanced BERT with disentangled attention) and sequence-matching models scored lower, indicating less alignment with medical language. In the Impression section, the Word-Embedding model excelled with a mean similarity of 84.4%, while others like the Jaccard index showed lower performance. CONCLUSION Overall, the study highlights significant variations across NLP models in their ability to generate radiology reports consistent with medical professionals' language. Pairwise comparisons and Kruskal-Wallis tests confirmed these differences, emphasizing the need for careful selection and evaluation of NLP models in radiology report generation. This research underscores the potential of ChatGPT to streamline and improve the radiology reporting process, with implications for enhancing efficiency and accuracy in clinical practice.
Collapse
Affiliation(s)
- Mohsen Soleimani
- Department of Health Information Management and Medical Informatics, School of Allied Medical Sciences, Tehran University of Medical Sciences, Tehran, Iran
| | - Navisa Seyyedi
- Department of Health Information Management and Medical Informatics, School of Allied Medical Sciences, Tehran University of Medical Sciences, Tehran, Iran.
| | - Seyed Mohammad Ayyoubzadeh
- Department of Health Information Management and Medical Informatics, School of Allied Medical Sciences, Tehran University of Medical Sciences, Tehran, Iran; Health Information Management Research Centre, Tehran University of Medical Sciences, Tehran, Iran
| | - Sharareh Rostam Niakan Kalhori
- Department of Health Information Management and Medical Informatics, School of Allied Medical Sciences, Tehran University of Medical Sciences, Tehran, Iran; Peter L. Reichertz Institute for Medical Informatics, TU Braunschweig and Hannover Medical School, Braunschweig, Germany
| | - Hamidreza Keshavarz
- Faculty of Electrical and Computer Engineering, Tarbiat Modares University, Tehran, Iran
| |
Collapse
|
3
|
Chow JC, Cheng TY, Chien TW, Chou W. Assessing ChatGPT's Capability for Multiple Choice Questions Using RaschOnline: Observational Study. JMIR Form Res 2024; 8:e46800. [PMID: 39115919 PMCID: PMC11346125 DOI: 10.2196/46800] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2023] [Revised: 07/03/2023] [Accepted: 07/31/2023] [Indexed: 08/10/2024] Open
Abstract
BACKGROUND ChatGPT (OpenAI), a state-of-the-art large language model, has exhibited remarkable performance in various specialized applications. Despite the growing popularity and efficacy of artificial intelligence, there is a scarcity of studies that assess ChatGPT's competence in addressing multiple-choice questions (MCQs) using KIDMAP of Rasch analysis-a website tool used to evaluate ChatGPT's performance in MCQ answering. OBJECTIVE This study aims to (1) showcase the utility of the website (Rasch analysis, specifically RaschOnline), and (2) determine the grade achieved by ChatGPT when compared to a normal sample. METHODS The capability of ChatGPT was evaluated using 10 items from the English tests conducted for Taiwan college entrance examinations in 2023. Under a Rasch model, 300 simulated students with normal distributions were simulated to compete with ChatGPT's responses. RaschOnline was used to generate 5 visual presentations, including item difficulties, differential item functioning, item characteristic curve, Wright map, and KIDMAP, to address the research objectives. RESULTS The findings revealed the following: (1) the difficulty of the 10 items increased in a monotonous pattern from easier to harder, represented by logits (-2.43, -1.78, -1.48, -0.64, -0.1, 0.33, 0.59, 1.34, 1.7, and 2.47); (2) evidence of differential item functioning was observed between gender groups for item 5 (P=.04); (3) item 5 displayed a good fit to the Rasch model (P=.61); (4) all items demonstrated a satisfactory fit to the Rasch model, indicated by Infit mean square errors below the threshold of 1.5; (5) no significant difference was found in the measures obtained between gender groups (P=.83); (6) a significant difference was observed among ability grades (P<.001); and (7) ChatGPT's capability was graded as A, surpassing grades B to E. CONCLUSIONS By using RaschOnline, this study provides evidence that ChatGPT possesses the ability to achieve a grade A when compared to a normal sample. It exhibits excellent proficiency in answering MCQs from the English tests conducted in 2023 for the Taiwan college entrance examinations.
Collapse
Affiliation(s)
- Julie Chi Chow
- Department of Pediatrics, Chi Mei Medical Center, Tainan, Taiwan
- Department of Pediatrics, School of Medicine, College of Medicine, Chung Shan Medical University, Taichung, Taiwan
| | - Teng Yun Cheng
- Department of Emergency Medicine, Chi Mei Medical Center, Tainan, Taiwan
| | - Tsair-Wei Chien
- Department of Statistics, Coding Data Analytics, Tainan, Taiwan
| | - Willy Chou
- Department of Physical Medicine and Rehabilitation, Chi Mei Medical Center, Tainan, Taiwan
- Department of Leisure and Sports Management, Far East University, Tainan, Taiwan
| |
Collapse
|
4
|
Young CC, Enichen E, Rao A, Hilker S, Butler A, Laird-Gion J, Succi MD. Pilot Study of Large Language Models as an Age-Appropriate Explanatory Tool for Chronic Pediatric Conditions. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.08.06.24311544. [PMID: 39148860 PMCID: PMC11326333 DOI: 10.1101/2024.08.06.24311544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 08/17/2024]
Abstract
There exists a gap in existing patient education resources for children with chronic conditions. This pilot study assesses large language models' (LLMs) capacity to deliver developmentally appropriate explanations of chronic conditions to pediatric patients. Two commonly used LLMs generated responses that accurately, appropriately, and effectively communicate complex medical information, making them a potentially valuable tool for enhancing patient understanding and engagement in clinical settings.
Collapse
Affiliation(s)
- Cameron C. Young
- Harvard Medical School, Boston, MA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center, Mass General Brigham, Boston, MA
| | - Elizabeth Enichen
- Harvard Medical School, Boston, MA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center, Mass General Brigham, Boston, MA
| | - Arya Rao
- Harvard Medical School, Boston, MA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center, Mass General Brigham, Boston, MA
| | - Sidney Hilker
- Harvard Medical School, Boston, MA
- Boston Children’s Hospital, Boston, MA
| | - Alex Butler
- Harvard Medical School, Boston, MA
- Boston Children’s Hospital, Boston, MA
| | - Jessica Laird-Gion
- Harvard Medical School, Boston, MA
- Boston Children’s Hospital, Boston, MA
| | - Marc D. Succi
- Harvard Medical School, Boston, MA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center, Mass General Brigham, Boston, MA
- Department of Radiology, Massachusetts General Hospital, Boston, MA
| |
Collapse
|
5
|
Garg N, Campbell DJ, Yang A, McCann A, Moroco AE, Estephan LE, Palmer WJ, Krein H, Heffelfinger R. Chatbots as Patient Education Resources for Aesthetic Facial Plastic Surgery: Evaluation of ChatGPT and Google Bard Responses. Facial Plast Surg Aesthet Med 2024. [PMID: 38946595 DOI: 10.1089/fpsam.2023.0368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/02/2024] Open
Abstract
Background: ChatGPT and Google Bard™ are popular artificial intelligence chatbots with utility for patients, including those undergoing aesthetic facial plastic surgery. Objective: To compare the accuracy and readability of chatbot-generated responses to patient education questions regarding aesthetic facial plastic surgery using a response accuracy scale and readability testing. Method: ChatGPT and Google Bard™ were asked 28 identical questions using four prompts: none, patient friendly, eighth-grade level, and references. Accuracy was assessed using Global Quality Scale (range: 1-5). Flesch-Kincaid grade level was calculated, and chatbot-provided references were analyzed for veracity. Results: Although 59.8% of responses were good quality (Global Quality Scale ≥4), ChatGPT generated more accurate responses than Google Bard™ on patient-friendly prompting (p < 0.001). Google Bard™ responses were of a significantly lower grade level than ChatGPT for all prompts (p < 0.05). Despite eighth-grade prompting, response grade level for both chatbots was high: ChatGPT (10.5 ± 1.8) and Google Bard™ (9.6 ± 1.3). Prompting for references yielded 108/108 of chatbot-generated references. Forty-one (38.0%) citations were legitimate. Twenty (18.5%) provided accurately reported information from the reference. Conclusion: Although ChatGPT produced more accurate responses and at a higher education level than Google Bard™, both chatbots provided responses above recommended grade levels for patients and failed to provide accurate references.
Collapse
Affiliation(s)
- Neha Garg
- Department of Otolaryngology - Head and Neck Surgery, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania, USA
| | - Daniel J Campbell
- Department of Otolaryngology - Head and Neck Surgery, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania, USA
| | - Angela Yang
- Sidney Kimmel Medical College, Philadelphia, Pennsylvania, USA
| | - Adam McCann
- Department of Otolaryngology - Head and Neck Surgery, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania, USA
| | - Annie E Moroco
- Department of Otolaryngology - Head and Neck Surgery, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania, USA
| | - Leonard E Estephan
- Department of Otolaryngology - Head and Neck Surgery, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania, USA
| | - William J Palmer
- Department of Otolaryngology - Head and Neck Surgery, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania, USA
| | - Howard Krein
- Department of Otolaryngology - Head and Neck Surgery, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania, USA
| | - Ryan Heffelfinger
- Department of Otolaryngology - Head and Neck Surgery, Thomas Jefferson University Hospitals, Philadelphia, Pennsylvania, USA
| |
Collapse
|
6
|
Law S, Oldfield B, Yang W. ChatGPT/GPT-4 (large language models): Opportunities and challenges of perspective in bariatric healthcare professionals. Obes Rev 2024; 25:e13746. [PMID: 38613164 DOI: 10.1111/obr.13746] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/10/2023] [Revised: 03/14/2024] [Accepted: 03/15/2024] [Indexed: 04/14/2024]
Abstract
ChatGPT/GPT-4 is a conversational large language model (LLM) based on artificial intelligence (AI). The potential application of LLM as a virtual assistant for bariatric healthcare professionals in education and practice may be promising if relevant and valid issues are actively examined and addressed. In general medical terms, it is possible that AI models like ChatGPT/GPT-4 will be deeply integrated into medical scenarios, improving medical efficiency and quality, and allowing doctors more time to communicate with patients and implement personalized health management. Chatbots based on AI have great potential in bariatric healthcare and may play an important role in predicting and intervening in weight loss and obesity-related complications. However, given its potential limitations, we should carefully consider the medical, legal, ethical, data security, privacy, and liability issues arising from medical errors caused by ChatGPT/GPT-4. This concern also extends to ChatGPT/GPT -4's ability to justify wrong decisions, and there is an urgent need for appropriate guidelines and regulations to ensure the safe and responsible use of ChatGPT/GPT-4.
Collapse
Affiliation(s)
- Saikam Law
- Department of Metabolic and Bariatric Surgery, The First Affiliated Hospital of Jinan University, Guangzhou, China
- School of Medicine, Jinan University, Guangzhou, China
| | - Brian Oldfield
- Department of Physiology, Monash Biomedicine Discovery Institute, Monash University, Melbourne, Australia
| | - Wah Yang
- Department of Metabolic and Bariatric Surgery, The First Affiliated Hospital of Jinan University, Guangzhou, China
| |
Collapse
|
7
|
Hasani AM, Singh S, Zahergivar A, Ryan B, Nethala D, Bravomontenegro G, Mendhiratta N, Ball M, Farhadi F, Malayeri A. Evaluating the performance of Generative Pre-trained Transformer-4 (GPT-4) in standardizing radiology reports. Eur Radiol 2024; 34:3566-3574. [PMID: 37938381 DOI: 10.1007/s00330-023-10384-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Revised: 09/01/2023] [Accepted: 09/08/2023] [Indexed: 11/09/2023]
Abstract
OBJECTIVE Radiology reporting is an essential component of clinical diagnosis and decision-making. With the advent of advanced artificial intelligence (AI) models like GPT-4 (Generative Pre-trained Transformer 4), there is growing interest in evaluating their potential for optimizing or generating radiology reports. This study aimed to compare the quality and content of radiologist-generated and GPT-4 AI-generated radiology reports. METHODS A comparative study design was employed in the study, where a total of 100 anonymized radiology reports were randomly selected and analyzed. Each report was processed by GPT-4, resulting in the generation of a corresponding AI-generated report. Quantitative and qualitative analysis techniques were utilized to assess similarities and differences between the two sets of reports. RESULTS The AI-generated reports showed comparable quality to radiologist-generated reports in most categories. Significant differences were observed in clarity (p = 0.027), ease of understanding (p = 0.023), and structure (p = 0.050), favoring the AI-generated reports. AI-generated reports were more concise, with 34.53 fewer words and 174.22 fewer characters on average, but had greater variability in sentence length. Content similarity was high, with an average Cosine Similarity of 0.85, Sequence Matcher Similarity of 0.52, BLEU Score of 0.5008, and BERTScore F1 of 0.8775. CONCLUSION The results of this proof-of-concept study suggest that GPT-4 can be a reliable tool for generating standardized radiology reports, offering potential benefits such as improved efficiency, better communication, and simplified data extraction and analysis. However, limitations and ethical implications must be addressed to ensure the safe and effective implementation of this technology in clinical practice. CLINICAL RELEVANCE STATEMENT The findings of this study suggest that GPT-4 (Generative Pre-trained Transformer 4), an advanced AI model, has the potential to significantly contribute to the standardization and optimization of radiology reporting, offering improved efficiency and communication in clinical practice. KEY POINTS • Large language model-generated radiology reports exhibited high content similarity and moderate structural resemblance to radiologist-generated reports. • Performance metrics highlighted the strong matching of word selection and order, as well as high semantic similarity between AI and radiologist-generated reports. • Large language model demonstrated potential for generating standardized radiology reports, improving efficiency and communication in clinical settings.
Collapse
Affiliation(s)
- Amir M Hasani
- Laboratory of Translation Research, National Heart Blood Lung Institute, NIH, Bethesda, MD, USA
| | - Shiva Singh
- Radiology & Imaging Sciences Department, Clinical Center, NIH, Bethesda, MD, USA
| | - Aryan Zahergivar
- Radiology & Imaging Sciences Department, Clinical Center, NIH, Bethesda, MD, USA
| | - Beth Ryan
- Urology Oncology Branch, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Daniel Nethala
- Urology Oncology Branch, National Cancer Institute, NIH, Bethesda, MD, USA
| | | | - Neil Mendhiratta
- Urology Oncology Branch, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Mark Ball
- Urology Oncology Branch, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Faraz Farhadi
- Radiology & Imaging Sciences Department, Clinical Center, NIH, Bethesda, MD, USA
| | - Ashkan Malayeri
- Radiology & Imaging Sciences Department, Clinical Center, NIH, Bethesda, MD, USA.
| |
Collapse
|
8
|
Le KDR, Tay SBP, Choy KT, Verjans J, Sasanelli N, Kong JCH. Applications of natural language processing tools in the surgical journey. Front Surg 2024; 11:1403540. [PMID: 38826809 PMCID: PMC11140056 DOI: 10.3389/fsurg.2024.1403540] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Accepted: 05/07/2024] [Indexed: 06/04/2024] Open
Abstract
Background Natural language processing tools are becoming increasingly adopted in multiple industries worldwide. They have shown promising results however their use in the field of surgery is under-recognised. Many trials have assessed these benefits in small settings with promising results before large scale adoption can be considered in surgery. This study aims to review the current research and insights into the potential for implementation of natural language processing tools into surgery. Methods A narrative review was conducted following a computer-assisted literature search on Medline, EMBASE and Google Scholar databases. Papers related to natural language processing tools and consideration into their use for surgery were considered. Results Current applications of natural language processing tools within surgery are limited. From the literature, there is evidence of potential improvement in surgical capability and service delivery, such as through the use of these technologies to streamline processes including surgical triaging, data collection and auditing, surgical communication and documentation. Additionally, there is potential to extend these capabilities to surgical academia to improve processes in surgical research and allow innovation in the development of educational resources. Despite these outcomes, the evidence to support these findings are challenged by small sample sizes with limited applicability to broader settings. Conclusion With the increasing adoption of natural language processing technology, such as in popular forms like ChatGPT, there has been increasing research in the use of these tools within surgery to improve surgical workflow and efficiency. This review highlights multifaceted applications of natural language processing within surgery, albeit with clear limitations due to the infancy of the infrastructure available to leverage these technologies. There remains room for more rigorous research into broader capability of natural language processing technology within the field of surgery and the need for cross-sectoral collaboration to understand the ways in which these algorithms can best be integrated.
Collapse
Affiliation(s)
- Khang Duy Ricky Le
- Department of General Surgical Specialties, The Royal Melbourne Hospital, Melbourne, VIC, Australia
- Department of Surgical Oncology, Peter MacCallum Cancer Centre, Melbourne, VIC, Australia
- Geelong Clinical School, Deakin University, Geelong, VIC, Australia
- Department of Medical Education, The University of Melbourne, Melbourne, VIC, Australia
| | - Samuel Boon Ping Tay
- Department of Anaesthesia and Pain Medicine, Eastern Health, Box Hill, VIC, Australia
| | - Kay Tai Choy
- Department of Surgery, Austin Health, Melbourne, VIC, Australia
| | - Johan Verjans
- Australian Institute for Machine Learning (AIML), University of Adelaide, Adelaide, SA, Australia
- Lifelong Health Theme (Platform AI), South Australian Health and Medical Research Institute, Adelaide, SA, Australia
| | - Nicola Sasanelli
- Division of Information Technology, Engineering and the Environment, University of South Australia, Adelaide, SA, Australia
- Department of Operations (Strategic and International Partnerships), SmartSAT Cooperative Research Centre, Adelaide, SA, Australia
- Agora High Tech, Adelaide, SA, Australia
| | - Joseph C. H. Kong
- Department of Surgical Oncology, Peter MacCallum Cancer Centre, Melbourne, VIC, Australia
- Monash University Department of Surgery, Alfred Hospital, Melbourne, VIC, Australia
- Department of Colorectal Surgery, Alfred Hospital, Melbourne, VIC, Australia
- Sir Peter MacCallum Department of Oncology, The University of Melbourne, Melbourne, VIC, Australia
| |
Collapse
|
9
|
Kedia N, Sanjeev S, Ong J, Chhablani J. ChatGPT and Beyond: An overview of the growing field of large language models and their use in ophthalmology. Eye (Lond) 2024; 38:1252-1261. [PMID: 38172581 PMCID: PMC11076576 DOI: 10.1038/s41433-023-02915-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 11/23/2023] [Accepted: 12/20/2023] [Indexed: 01/05/2024] Open
Abstract
ChatGPT, an artificial intelligence (AI) chatbot built on large language models (LLMs), has rapidly gained popularity. The benefits and limitations of this transformative technology have been discussed across various fields, including medicine. The widespread availability of ChatGPT has enabled clinicians to study how these tools could be used for a variety of tasks such as generating differential diagnosis lists, organizing patient notes, and synthesizing literature for scientific research. LLMs have shown promising capabilities in ophthalmology by performing well on the Ophthalmic Knowledge Assessment Program, providing fairly accurate responses to questions about retinal diseases, and in generating differential diagnoses list. There are current limitations to this technology, including the propensity of LLMs to "hallucinate", or confidently generate false information; their potential role in perpetuating biases in medicine; and the challenges in incorporating LLMs into research without allowing "AI-plagiarism" or publication of false information. In this paper, we provide a balanced overview of what LLMs are and introduce some of the LLMs that have been generated in the past few years. We discuss recent literature evaluating the role of these language models in medicine with a focus on ChatGPT. The field of AI is fast-paced, and new applications based on LLMs are being generated rapidly; therefore, it is important for ophthalmologists to be aware of how this technology works and how it may impact patient care. Here, we discuss the benefits, limitations, and future advancements of LLMs in patient care and research.
Collapse
Affiliation(s)
- Nikita Kedia
- Department of Ophthalmology, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA
| | | | - Joshua Ong
- Department of Ophthalmology and Visual Sciences, University of Michigan Kellogg Eye Center, Ann Arbor, MI, USA
| | - Jay Chhablani
- Department of Ophthalmology, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA.
| |
Collapse
|
10
|
Allan P, Knight M, Evans R, Narayanan A. Artificial intelligence is poised to usher in a paradigm shift in surgery: application of ChatGPT in Aotearoa New Zealand and Australia. ANZ J Surg 2024; 94:780-781. [PMID: 38616527 DOI: 10.1111/ans.19000] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2024] [Revised: 03/21/2024] [Accepted: 03/21/2024] [Indexed: 04/16/2024]
Affiliation(s)
- Philip Allan
- Vascular, Endovascular & Transplantation Service, Wellington Regional Hospital, Wellington, New Zealand
- Vascular & Endovascular Surgery, Waikato Hospital, Waikato, New Zealand
| | - Michael Knight
- Vascular, Endovascular & Transplantation Service, Wellington Regional Hospital, Wellington, New Zealand
| | - Richard Evans
- Vascular, Endovascular & Transplantation Service, Wellington Regional Hospital, Wellington, New Zealand
| | - Anantha Narayanan
- Vascular, Endovascular & Transplantation Service, Wellington Regional Hospital, Wellington, New Zealand
- Vascular & Endovascular Surgery, Waikato Hospital, Waikato, New Zealand
- Department of Surgery, Faculty of Medical and Health Sciences, University of Auckland, Auckland, New Zealand
| |
Collapse
|
11
|
Kasapovic A, Ali T, Babasiz M, Bojko J, Gathen M, Kaczmarczyk R, Roos J. Does the Information Quality of ChatGPT Meet the Requirements of Orthopedics and Trauma Surgery? Cureus 2024; 16:e60318. [PMID: 38882956 PMCID: PMC11177007 DOI: 10.7759/cureus.60318] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/15/2024] [Indexed: 06/18/2024] Open
Abstract
BACKGROUND The integration of artificial intelligence (AI) in medicine, particularly through AI-based language models like ChatGPT, offers a promising avenue for enhancing patient education and healthcare delivery. This study aims to evaluate the quality of medical information provided by Chat Generative Pre-trained Transformer (ChatGPT) regarding common orthopedic and trauma surgical procedures, assess its limitations, and explore its potential as a supplementary source for patient education. METHODS Using the GPT-3.5-Turbo version of ChatGPT, simulated patient information was generated for 20 orthopedic and trauma surgical procedures. The study utilized standardized information forms as a reference for evaluating ChatGPT's responses. The accuracy and quality of the provided information were assessed using a modified DISCERN instrument, and a global medical assessment was conducted to categorize the information's usefulness and reliability. RESULTS ChatGPT mentioned an average of 47% of relevant keywords across procedures, with a variance in the mention rate between 30.5% and 68.6%. The average modified DISCERN (mDISCERN) score was 2.4 out of 5, indicating a moderate to low quality of information. None of the ChatGPT-generated fact sheets were rated as "very useful," with 45% deemed "somewhat useful," 35% "not useful," and 20% classified as "dangerous." A positive correlation was found between higher mDISCERN scores and better physician ratings, suggesting that information quality directly impacts perceived utility. CONCLUSION While AI-based language models like ChatGPT hold significant promise for medical education and patient care, the current quality of information provided in the field of orthopedics and trauma surgery is suboptimal. Further development and refinement of AI sources and algorithms are necessary to improve the accuracy and reliability of medical information. This study underscores the need for ongoing research and development in AI applications in healthcare, emphasizing the critical role of accurate, high-quality information in patient education and informed consent processes.
Collapse
Affiliation(s)
- Adnan Kasapovic
- Department of Orthopedics and Trauma Surgery, University Hospital of Bonn, Bonn, DEU
| | - Thaer Ali
- Department of Orthopedics and Trauma Surgery, University Hospital of Bonn, Bonn, DEU
| | - Mari Babasiz
- Department of Orthopedics and Trauma Surgery, University Hospital of Bonn, Bonn, DEU
| | - Jessica Bojko
- Department of Orthopedics and Trauma Surgery, University Hospital of Bonn, Bonn, DEU
| | - Martin Gathen
- Department of Orthopedics and Trauma Surgery, University Hospital of Bonn, Bonn, DEU
| | - Robert Kaczmarczyk
- Department of Dermatology and Allergy, School of Medicine, Technical University of Munich, Munich, DEU
| | - Jonas Roos
- Department of Orthopedics and Trauma Surgery, University Hospital of Bonn, Bonn, DEU
| |
Collapse
|
12
|
Ha LT, Kelley KD. Artificial Intelligence: Promise or Pitfalls? A Clinical Vignette of Real-Life ChatGPT Implementation in Perioperative Medicine. J Gen Intern Med 2024; 39:1063-1067. [PMID: 38252252 DOI: 10.1007/s11606-024-08611-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Accepted: 01/05/2024] [Indexed: 01/23/2024]
Affiliation(s)
- Leslie Thienly Ha
- Department of Internal Medicine, University of California, Davis, Davis, USA.
- , Sacramento, USA.
| | - Kristen D Kelley
- Department of Internal Medicine, University of California, Davis, Davis, USA
| |
Collapse
|
13
|
Mert S, Stoerzer P, Brauer J, Fuchs B, Haas-Lützenberger EM, Demmer W, Giunta RE, Nuernberger T. Diagnostic power of ChatGPT 4 in distal radius fracture detection through wrist radiographs. Arch Orthop Trauma Surg 2024; 144:2461-2467. [PMID: 38578309 PMCID: PMC11093861 DOI: 10.1007/s00402-024-05298-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/09/2024] [Accepted: 03/27/2024] [Indexed: 04/06/2024]
Abstract
Distal radius fractures rank among the most prevalent fractures in humans, necessitating accurate radiological imaging and interpretation for optimal diagnosis and treatment. In addition to human radiologists, artificial intelligence systems are increasingly employed for radiological assessments. Since 2023, ChatGPT 4 has offered image analysis capabilities, which can also be used for the analysis of wrist radiographs. This study evaluates the diagnostic power of ChatGPT 4 in identifying distal radius fractures, comparing it with a board-certified radiologist, a hand surgery resident, a medical student, and the well-established AI Gleamer BoneView™. Results demonstrate ChatGPT 4's good diagnostic accuracy (sensitivity 0.88, specificity 0.98, diagnostic power (AUC) 0.93), surpassing the medical student (sensitivity 0.98, specificity 0.72, diagnostic power (AUC) 0.85; p = 0.04) significantly. Nevertheless, the diagnostic power of ChatGPT 4 lags behind the hand surgery resident (sensitivity 0.99, specificity 0.98, diagnostic power (AUC) 0.985; p = 0.014) and Gleamer BoneView™(sensitivity 1.00, specificity 0.98, diagnostic power (AUC) 0.99; p = 0.006). This study highlights the utility and potential applications of artificial intelligence in modern medicine, emphasizing ChatGPT 4 as a valuable tool for enhancing diagnostic capabilities in the field of medical imaging.
Collapse
Affiliation(s)
- Sinan Mert
- Division of Hand, Plastic and Aesthetic Surgery, LMU University Hospital, LMU Munich, 80336, München, Germany.
| | - Patrick Stoerzer
- Division of Hand, Plastic and Aesthetic Surgery, LMU University Hospital, LMU Munich, 80336, München, Germany
| | - Johannes Brauer
- Division of Hand, Plastic and Aesthetic Surgery, LMU University Hospital, LMU Munich, 80336, München, Germany
| | - Benedikt Fuchs
- Division of Hand, Plastic and Aesthetic Surgery, LMU University Hospital, LMU Munich, 80336, München, Germany
| | | | - Wolfram Demmer
- Division of Hand, Plastic and Aesthetic Surgery, LMU University Hospital, LMU Munich, 80336, München, Germany
| | - Riccardo E Giunta
- Division of Hand, Plastic and Aesthetic Surgery, LMU University Hospital, LMU Munich, 80336, München, Germany
| | - Tim Nuernberger
- Division of Hand, Plastic and Aesthetic Surgery, LMU University Hospital, LMU Munich, 80336, München, Germany
| |
Collapse
|
14
|
Rao A, Kim J, Lie W, Pang M, Fuh L, Dreyer KJ, Succi MD. Proactive Polypharmacy Management Using Large Language Models: Opportunities to Enhance Geriatric Care. J Med Syst 2024; 48:41. [PMID: 38632172 DOI: 10.1007/s10916-024-02058-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Accepted: 03/25/2024] [Indexed: 04/19/2024]
Abstract
Polypharmacy remains an important challenge for patients with extensive medical complexity. Given the primary care shortage and the increasing aging population, effective polypharmacy management is crucial to manage the increasing burden of care. The capacity of large language model (LLM)-based artificial intelligence to aid in polypharmacy management has yet to be evaluated. Here, we evaluate ChatGPT's performance in polypharmacy management via its deprescribing decisions in standardized clinical vignettes. We inputted several clinical vignettes originally from a study of general practicioners' deprescribing decisions into ChatGPT 3.5, a publicly available LLM, and evaluated its capacity for yes/no binary deprescribing decisions as well as list-based prompts in which the model was prompted to choose which of several medications to deprescribe. We recorded ChatGPT responses to yes/no binary deprescribing prompts and the number and types of medications deprescribed. In yes/no binary deprescribing decisions, ChatGPT universally recommended deprescribing medications regardless of ADL status in patients with no overlying CVD history; in patients with CVD history, ChatGPT's answers varied by technical replicate. Total number of medications deprescribed ranged from 2.67 to 3.67 (out of 7) and did not vary with CVD status, but increased linearly with severity of ADL impairment. Among medication types, ChatGPT preferentially deprescribed pain medications. ChatGPT's deprescribing decisions vary along the axes of ADL status, CVD history, and medication type, indicating some concordance of internal logic between general practitioners and the model. These results indicate that specifically trained LLMs may provide useful clinical support in polypharmacy management for primary care physicians.
Collapse
Affiliation(s)
- Arya Rao
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Massachusetts General Hospital, Department of Radiology, 55 Fruit Street, Boston, MA, 02114, USA
| | - John Kim
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Massachusetts General Hospital, Department of Radiology, 55 Fruit Street, Boston, MA, 02114, USA
| | - Winston Lie
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Massachusetts General Hospital, Department of Radiology, 55 Fruit Street, Boston, MA, 02114, USA
| | - Michael Pang
- Harvard Medical School, Boston, MA, USA
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA
- Massachusetts General Hospital, Department of Radiology, 55 Fruit Street, Boston, MA, 02114, USA
| | - Lanting Fuh
- Massachusetts General Hospital, Department of Radiology, 55 Fruit Street, Boston, MA, 02114, USA
| | - Keith J Dreyer
- Harvard Medical School, Boston, MA, USA
- Data Science Office, Mass General Brigham, Boston, MA, USA
| | - Marc D Succi
- Harvard Medical School, Boston, MA, USA.
- Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, USA.
- Massachusetts General Hospital, Department of Radiology, 55 Fruit Street, Boston, MA, 02114, USA.
| |
Collapse
|
15
|
Bajaj S, Gandhi D, Nayar D. Potential Applications and Impact of ChatGPT in Radiology. Acad Radiol 2024; 31:1256-1261. [PMID: 37802673 DOI: 10.1016/j.acra.2023.08.039] [Citation(s) in RCA: 15] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2023] [Revised: 08/15/2023] [Accepted: 08/28/2023] [Indexed: 10/08/2023]
Abstract
Radiology has always gone hand-in-hand with technology and artificial intelligence (AI) is not new to the field. While various AI devices and algorithms have already been integrated in the daily clinical practice of radiology, with applications ranging from scheduling patient appointments to detecting and diagnosing certain clinical conditions on imaging, the use of natural language processing and large language model based software have been in discussion for a long time. Algorithms like ChatGPT can help in improving patient outcomes, increasing the efficiency of radiology interpretation, and aiding in the overall workflow of radiologists and here we discuss some of its potential applications.
Collapse
Affiliation(s)
- Suryansh Bajaj
- Department of Radiology, University of Arkansas for Medical Sciences, Little Rock, Arkansas 72205 (S.B.)
| | - Darshan Gandhi
- Department of Diagnostic Radiology, University of Tennessee Health Science Center, Memphis, Tennessee 38103 (D.G.).
| | - Divya Nayar
- Department of Neurology, University of Arkansas for Medical Sciences, Little Rock, Arkansas 72205 (D.N.)
| |
Collapse
|
16
|
Karimov Z, Allahverdiyev I, Agayarov OY, Demir D, Almuradova E. ChatGPT vs UpToDate: comparative study of usefulness and reliability of Chatbot in common clinical presentations of otorhinolaryngology-head and neck surgery. Eur Arch Otorhinolaryngol 2024; 281:2145-2151. [PMID: 38217726 PMCID: PMC10942922 DOI: 10.1007/s00405-023-08423-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2023] [Accepted: 12/18/2023] [Indexed: 01/15/2024]
Abstract
PURPOSE The usage of Chatbots as a kind of Artificial Intelligence in medicine is getting to increase in recent years. UpToDate® is another well-known search tool established on evidence-based knowledge and is used daily by doctors worldwide. In this study, we aimed to investigate the usefulness and reliability of ChatGPT compared to UpToDate in Otorhinolaryngology and Head and Neck Surgery (ORL-HNS). MATERIALS AND METHODS ChatGPT-3.5 and UpToDate were interrogated for the management of 25 common clinical case scenarios (13 males/12 females) recruited from literature considering the daily observation at the Department of Otorhinolaryngology of Ege University Faculty of Medicine. Scientific references for the management were requested for each clinical case. The accuracy of the references in the ChatGPT answers was assessed on a 0-2 scale and the usefulness of the ChatGPT and UpToDate answers was assessed with 1-3 scores by reviewers. UpToDate and ChatGPT 3.5 responses were compared. RESULTS ChatGPT did not give references in some questions in contrast to UpToDate. Information on the ChatGPT was limited to 2021. UpToDate supported the paper with subheadings, tables, figures, and algorithms. The mean accuracy score of references in ChatGPT answers was 0.25-weak/unrelated. The median (Q1-Q3) was 1.00 (1.25-2.00) for ChatGPT and 2.63 (2.75-3.00) for UpToDate, the difference was statistically significant (p < 0.001). UpToDate was observed more useful and reliable than ChatGPT. CONCLUSIONS ChatGPT has the potential to support the physicians to find out the information but our results suggest that ChatGPT needs to be improved to increase the usefulness and reliability of medical evidence-based knowledge.
Collapse
Affiliation(s)
- Ziya Karimov
- Medicine Program, Ege University Faculty of Medicine, 35100, Izmir, Türkiye.
| | - Irshad Allahverdiyev
- Medicine Program, Istanbul University, Istanbul Faculty of Medicine, Istanbul, Türkiye
| | - Ozlem Yagiz Agayarov
- Department of Otolaryngology-Head and Neck Surgery, Izmir Tepecik Education and Research Hospital, Health Sciences University, Izmir, Türkiye
| | - Dogukan Demir
- Department of Otolaryngology-Head and Neck Surgery, Izmir Tepecik Education and Research Hospital, Health Sciences University, Izmir, Türkiye
| | - Elvina Almuradova
- Department of Medical Oncology, Ege University Faculty of Medicine, Izmir, Türkiye
- Department of Oncology, Medicana International Hospital, Izmir, Türkiye
| |
Collapse
|
17
|
Temperley HC, O'Sullivan NJ, Mac Curtain BM, Corr A, Meaney JF, Kelly ME, Brennan I. Current applications and future potential of ChatGPT in radiology: A systematic review. J Med Imaging Radiat Oncol 2024; 68:257-264. [PMID: 38243605 DOI: 10.1111/1754-9485.13621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Accepted: 12/29/2023] [Indexed: 01/21/2024]
Abstract
This study aimed to comprehensively evaluate the current utilization and future potential of ChatGPT, an AI-based chat model, in the field of radiology. The primary focus is on its role in enhancing decision-making processes, optimizing workflow efficiency, and fostering interdisciplinary collaboration and teaching within healthcare. A systematic search was conducted in PubMed, EMBASE and Web of Science databases. Key aspects, such as its impact on complex decision-making, workflow enhancement and collaboration, were assessed. Limitations and challenges associated with ChatGPT implementation were also examined. Overall, six studies met the inclusion criteria and were included in our analysis. All studies were prospective in nature. A total of 551 chatGPT (version 3.0 to 4.0) assessment events were included in our analysis. Considering the generation of academic papers, ChatGPT was found to output data inaccuracies 80% of the time. When ChatGPT was asked questions regarding common interventional radiology procedures, it contained entirely incorrect information 45% of the time. ChatGPT was seen to better answer US board-style questions when lower order thinking was required (P = 0.002). Improvements were seen between chatGPT 3.5 and 4.0 in regard to imaging questions with accuracy rates of 61 versus 85%(P = 0.009). ChatGPT was observed to have an average translational ability score of 4.27/5 on the Likert scale regarding CT and MRI findings. ChatGPT demonstrates substantial potential to augment decision-making and optimizing workflow. While ChatGPT's promise is evident, thorough evaluation and validation are imperative before widespread adoption in the field of radiology.
Collapse
Affiliation(s)
- Hugo C Temperley
- Department of Radiology, St. James's Hospital, Dublin, Ireland
- Department of Surgery, St. James's Hospital, Dublin, Ireland
| | | | | | - Alison Corr
- Department of Radiology, St. James's Hospital, Dublin, Ireland
| | - James F Meaney
- Department of Radiology, St. James's Hospital, Dublin, Ireland
| | - Michael E Kelly
- Department of Surgery, St. James's Hospital, Dublin, Ireland
| | - Ian Brennan
- Department of Radiology, St. James's Hospital, Dublin, Ireland
| |
Collapse
|
18
|
Zampatti S, Peconi C, Megalizzi D, Calvino G, Trastulli G, Cascella R, Strafella C, Caltagirone C, Giardina E. Innovations in Medicine: Exploring ChatGPT's Impact on Rare Disorder Management. Genes (Basel) 2024; 15:421. [PMID: 38674356 PMCID: PMC11050022 DOI: 10.3390/genes15040421] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Revised: 03/25/2024] [Accepted: 03/26/2024] [Indexed: 04/28/2024] Open
Abstract
Artificial intelligence (AI) is rapidly transforming the field of medicine, announcing a new era of innovation and efficiency. Among AI programs designed for general use, ChatGPT holds a prominent position, using an innovative language model developed by OpenAI. Thanks to the use of deep learning techniques, ChatGPT stands out as an exceptionally viable tool, renowned for generating human-like responses to queries. Various medical specialties, including rheumatology, oncology, psychiatry, internal medicine, and ophthalmology, have been explored for ChatGPT integration, with pilot studies and trials revealing each field's potential benefits and challenges. However, the field of genetics and genetic counseling, as well as that of rare disorders, represents an area suitable for exploration, with its complex datasets and the need for personalized patient care. In this review, we synthesize the wide range of potential applications for ChatGPT in the medical field, highlighting its benefits and limitations. We pay special attention to rare and genetic disorders, aiming to shed light on the future roles of AI-driven chatbots in healthcare. Our goal is to pave the way for a healthcare system that is more knowledgeable, efficient, and centered around patient needs.
Collapse
Affiliation(s)
- Stefania Zampatti
- Genomic Medicine Laboratory UILDM, IRCCS Santa Lucia Foundation, 00179 Rome, Italy; (S.Z.)
| | - Cristina Peconi
- Genomic Medicine Laboratory UILDM, IRCCS Santa Lucia Foundation, 00179 Rome, Italy; (S.Z.)
| | - Domenica Megalizzi
- Genomic Medicine Laboratory UILDM, IRCCS Santa Lucia Foundation, 00179 Rome, Italy; (S.Z.)
- Department of Science, Roma Tre University, 00146 Rome, Italy
| | - Giulia Calvino
- Genomic Medicine Laboratory UILDM, IRCCS Santa Lucia Foundation, 00179 Rome, Italy; (S.Z.)
- Department of Science, Roma Tre University, 00146 Rome, Italy
| | - Giulia Trastulli
- Genomic Medicine Laboratory UILDM, IRCCS Santa Lucia Foundation, 00179 Rome, Italy; (S.Z.)
- Department of System Medicine, Tor Vergata University, 00133 Rome, Italy
| | - Raffaella Cascella
- Genomic Medicine Laboratory UILDM, IRCCS Santa Lucia Foundation, 00179 Rome, Italy; (S.Z.)
- Department of Chemical-Toxicological and Pharmacological Evaluation of Drugs, Catholic University Our Lady of Good Counsel, 1000 Tirana, Albania
| | - Claudia Strafella
- Genomic Medicine Laboratory UILDM, IRCCS Santa Lucia Foundation, 00179 Rome, Italy; (S.Z.)
| | - Carlo Caltagirone
- Department of Clinical and Behavioral Neurology, IRCCS Fondazione Santa Lucia, 00179 Rome, Italy;
| | - Emiliano Giardina
- Genomic Medicine Laboratory UILDM, IRCCS Santa Lucia Foundation, 00179 Rome, Italy; (S.Z.)
- Department of Biomedicine and Prevention, Tor Vergata University, 00133 Rome, Italy
| |
Collapse
|
19
|
Yim D, Khuntia J, Parameswaran V, Meyers A. Preliminary Evidence of the Use of Generative AI in Health Care Clinical Services: Systematic Narrative Review. JMIR Med Inform 2024; 12:e52073. [PMID: 38506918 PMCID: PMC10993141 DOI: 10.2196/52073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Revised: 10/12/2023] [Accepted: 01/30/2024] [Indexed: 03/21/2024] Open
Abstract
BACKGROUND Generative artificial intelligence tools and applications (GenAI) are being increasingly used in health care. Physicians, specialists, and other providers have started primarily using GenAI as an aid or tool to gather knowledge, provide information, train, or generate suggestive dialogue between physicians and patients or between physicians and patients' families or friends. However, unless the use of GenAI is oriented to be helpful in clinical service encounters that can improve the accuracy of diagnosis, treatment, and patient outcomes, the expected potential will not be achieved. As adoption continues, it is essential to validate the effectiveness of the infusion of GenAI as an intelligent technology in service encounters to understand the gap in actual clinical service use of GenAI. OBJECTIVE This study synthesizes preliminary evidence on how GenAI assists, guides, and automates clinical service rendering and encounters in health care The review scope was limited to articles published in peer-reviewed medical journals. METHODS We screened and selected 0.38% (161/42,459) of articles published between January 1, 2020, and May 31, 2023, identified from PubMed. We followed the protocols outlined in the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines to select highly relevant studies with at least 1 element on clinical use, evaluation, and validation to provide evidence of GenAI use in clinical services. The articles were classified based on their relevance to clinical service functions or activities using the descriptive and analytical information presented in the articles. RESULTS Of 161 articles, 141 (87.6%) reported using GenAI to assist services through knowledge access, collation, and filtering. GenAI was used for disease detection (19/161, 11.8%), diagnosis (14/161, 8.7%), and screening processes (12/161, 7.5%) in the areas of radiology (17/161, 10.6%), cardiology (12/161, 7.5%), gastrointestinal medicine (4/161, 2.5%), and diabetes (6/161, 3.7%). The literature synthesis in this study suggests that GenAI is mainly used for diagnostic processes, improvement of diagnosis accuracy, and screening and diagnostic purposes using knowledge access. Although this solves the problem of knowledge access and may improve diagnostic accuracy, it is oriented toward higher value creation in health care. CONCLUSIONS GenAI informs rather than assisting or automating clinical service functions in health care. There is potential in clinical service, but it has yet to be actualized for GenAI. More clinical service-level evidence that GenAI is used to streamline some functions or provides more automated help than only information retrieval is needed. To transform health care as purported, more studies related to GenAI applications must automate and guide human-performed services and keep up with the optimism that forward-thinking health care organizations will take advantage of GenAI.
Collapse
Affiliation(s)
- Dobin Yim
- Loyola University, Maryland, MD, United States
| | - Jiban Khuntia
- University of Colorado Denver, Denver, CO, United States
| | | | - Arlen Meyers
- University of Colorado Denver, Denver, CO, United States
| |
Collapse
|
20
|
Moreno AC, Bitterman DS. Toward Clinical-Grade Evaluation of Large Language Models. Int J Radiat Oncol Biol Phys 2024; 118:916-920. [PMID: 38401979 PMCID: PMC11221761 DOI: 10.1016/j.ijrobp.2023.11.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2023] [Accepted: 11/05/2023] [Indexed: 02/26/2024]
Affiliation(s)
- Amy C Moreno
- Department of Radiation Oncology, University of Texas MD Anderson Cancer Center, Houston, Texas
| | - Danielle S Bitterman
- Department of Radiation Oncology, Brigham and Women's Hospital/Dana-Farber Cancer Institute, Harvard Medical School, Boston, Massachusetts; Artificial Intelligence in Medicine Program, Mass General Brigham, Harvard Medical School, Boston, Massachusetts.
| |
Collapse
|
21
|
Caterson J, Ambler O, Cereceda-Monteoliva N, Horner M, Jones A, Poacher AT. Application of generative language models to orthopaedic practice. BMJ Open 2024; 14:e076484. [PMID: 38485486 PMCID: PMC10941106 DOI: 10.1136/bmjopen-2023-076484] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Accepted: 01/08/2024] [Indexed: 03/17/2024] Open
Abstract
OBJECTIVE To explore whether large language models (LLMs) Generated Pre-trained Transformer (GPT)-3 and ChatGPT can write clinical letters and predict management plans for common orthopaedic scenarios. DESIGN Fifteen scenarios were generated and ChatGPT and GPT-3 prompted to write clinical letters and separately generate management plans for identical scenarios with plans removed. MAIN OUTCOME MEASURES Letters were assessed for readability using the Readable Tool. Accuracy of letters and management plans were assessed by three independent orthopaedic surgery clinicians. RESULTS Both models generated complete letters for all scenarios after single prompting. Readability was compared using Flesch-Kincade Grade Level (ChatGPT: 8.77 (SD 0.918); GPT-3: 8.47 (SD 0.982)), Flesch Readability Ease (ChatGPT: 58.2 (SD 4.00); GPT-3: 59.3 (SD 6.98)), Simple Measure of Gobbledygook (SMOG) Index (ChatGPT: 11.6 (SD 0.755); GPT-3: 11.4 (SD 1.01)), and reach (ChatGPT: 81.2%; GPT-3: 80.3%). ChatGPT produced more accurate letters (8.7/10 (SD 0.60) vs 7.3/10 (SD 1.41), p=0.024) and management plans (7.9/10 (SD 0.63) vs 6.8/10 (SD 1.06), p<0.001) than GPT-3. However, both LLMs sometimes omitted key information or added additional guidance which was at worst inaccurate. CONCLUSIONS This study shows that LLMs are effective for generation of clinical letters. With little prompting, they are readable and mostly accurate. However, they are not consistent, and include inappropriate omissions or insertions. Furthermore, management plans produced by LLMs are generic but often accurate. In the future, a healthcare specific language model trained on accurate and secure data could provide an excellent tool for increasing the efficiency of clinicians through summarisation of large volumes of data into a single clinical letter.
Collapse
Affiliation(s)
| | - Olivia Ambler
- Plastic Surgery, Morriston Hospital, Swansea, Wales, UK
| | | | - Matthew Horner
- Trauma Department, University Hospital of Wales, Cardiff, Cardiff, UK
- Trauma and Orthopaedic Surgery, University Hospital of Wales, Cardiff, Cardiff, UK
| | - Andrew Jones
- Orthopaedic Surgery, University Hospital of Wales, Cardiff, Cardiff, UK
| | - Arwel Tomos Poacher
- Trauma Department, University Hospital of Wales, Cardiff, Cardiff, UK
- School of Biosciences, Cardiff University, Cardiff, Cardiff, UK
| |
Collapse
|
22
|
Park YJ, Pillai A, Deng J, Guo E, Gupta M, Paget M, Naugler C. Assessing the research landscape and clinical utility of large language models: a scoping review. BMC Med Inform Decis Mak 2024; 24:72. [PMID: 38475802 DOI: 10.1186/s12911-024-02459-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Accepted: 02/12/2024] [Indexed: 03/14/2024] Open
Abstract
IMPORTANCE Large language models (LLMs) like OpenAI's ChatGPT are powerful generative systems that rapidly synthesize natural language responses. Research on LLMs has revealed their potential and pitfalls, especially in clinical settings. However, the evolving landscape of LLM research in medicine has left several gaps regarding their evaluation, application, and evidence base. OBJECTIVE This scoping review aims to (1) summarize current research evidence on the accuracy and efficacy of LLMs in medical applications, (2) discuss the ethical, legal, logistical, and socioeconomic implications of LLM use in clinical settings, (3) explore barriers and facilitators to LLM implementation in healthcare, (4) propose a standardized evaluation framework for assessing LLMs' clinical utility, and (5) identify evidence gaps and propose future research directions for LLMs in clinical applications. EVIDENCE REVIEW We screened 4,036 records from MEDLINE, EMBASE, CINAHL, medRxiv, bioRxiv, and arXiv from January 2023 (inception of the search) to June 26, 2023 for English-language papers and analyzed findings from 55 worldwide studies. Quality of evidence was reported based on the Oxford Centre for Evidence-based Medicine recommendations. FINDINGS Our results demonstrate that LLMs show promise in compiling patient notes, assisting patients in navigating the healthcare system, and to some extent, supporting clinical decision-making when combined with human oversight. However, their utilization is limited by biases in training data that may harm patients, the generation of inaccurate but convincing information, and ethical, legal, socioeconomic, and privacy concerns. We also identified a lack of standardized methods for evaluating LLMs' effectiveness and feasibility. CONCLUSIONS AND RELEVANCE This review thus highlights potential future directions and questions to address these limitations and to further explore LLMs' potential in enhancing healthcare delivery.
Collapse
Affiliation(s)
- Ye-Jean Park
- Temerty Faculty of Medicine, University of Toronto, 1 King's College Cir, M5S 1A8, Toronto, ON, Canada.
| | - Abhinav Pillai
- Cumming School of Medicine, University of Calgary, 3330 Hospital Dr NW, T2N 4N1, Calgary, AB, Canada
| | - Jiawen Deng
- Temerty Faculty of Medicine, University of Toronto, 1 King's College Cir, M5S 1A8, Toronto, ON, Canada
| | - Eddie Guo
- Cumming School of Medicine, University of Calgary, 3330 Hospital Dr NW, T2N 4N1, Calgary, AB, Canada
| | - Mehul Gupta
- Cumming School of Medicine, University of Calgary, 3330 Hospital Dr NW, T2N 4N1, Calgary, AB, Canada
| | - Mike Paget
- Cumming School of Medicine, University of Calgary, 3330 Hospital Dr NW, T2N 4N1, Calgary, AB, Canada
| | - Christopher Naugler
- Cumming School of Medicine, University of Calgary, 3330 Hospital Dr NW, T2N 4N1, Calgary, AB, Canada
| |
Collapse
|
23
|
Sheth S, Baker HP, Prescher H, Strelzow JA. Ethical Considerations of Artificial Intelligence in Health Care: Examining the Role of Generative Pretrained Transformer-4. J Am Acad Orthop Surg 2024; 32:205-210. [PMID: 38175996 DOI: 10.5435/jaaos-d-23-00787] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/25/2023] [Accepted: 11/26/2023] [Indexed: 01/06/2024] Open
Abstract
The integration of artificial intelligence technologies, such as large language models (LLMs), in health care holds potential for improved efficiency and decision support. However, ethical concerns must be addressed before widespread adoption. This article focuses on the ethical principles surrounding the use of Generative Pretrained Transformer-4 and its conversational model, ChatGPT, in healthcare settings. One concern is potential inaccuracies in generated content. LLMs can produce believable yet incorrect information, risking errors in medical records. Opacity of training data exacerbates this, hindering accuracy assessment. To mitigate, LLMs should train on precise, validated medical data sets. Model bias is another critical concern because LLMs may perpetuate biases from their training, leading to medically inaccurate and discriminatory responses. Sampling, programming, and compliance biases contribute necessitating careful consideration to avoid perpetuating harmful stereotypes. Privacy is paramount in health care, using public LLMs raises risks. Strict data-sharing agreements and Health Insurance Portability and Accountability Act (HIPAA)-compliant training protocols are necessary to protect patient privacy. Although artificial intelligence technologies offer promising opportunities in health care, careful consideration of ethical principles is crucial. Addressing concerns of inaccuracy, bias, and privacy will ensure responsible and patient-centered implementation, benefiting both healthcare professionals and patients.
Collapse
Affiliation(s)
- Suraj Sheth
- From the Department of Orthopaedic Surgery, The University of Chicago, Chicago, IL
| | | | | | | |
Collapse
|
24
|
Tunçer G, Güçlü KG. How Reliable is ChatGPT as a Novel Consultant in Infectious Diseases and Clinical Microbiology? INFECTIOUS DISEASES & CLINICAL MICROBIOLOGY 2024; 6:55-59. [PMID: 38633442 PMCID: PMC11020004 DOI: 10.36519/idcm.2024.286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 10/24/2023] [Accepted: 12/14/2023] [Indexed: 04/19/2024]
Abstract
Objective The study aimed to investigate the reliability of ChatGPT's answers to medical questions, including those sourced from patients and guide recommendations. The focus was on evaluating ChatGPT's accuracy in responding to various types of infectious disease questions. Materials and Methods The study was conducted using 200 questions sourced from social media, experts, and guidelines related to various infectious diseases like urinary tract infection, pneumonia, HIV, various types of hepatitis, COVID-19, skin infections, and tuberculosis. The questions were arranged for clarity and consistency by excluding repetitive or unclear ones. The answers were based on guidelines from reputable sources like the Infectious Diseases Society of America (IDSA), Centers for Disease Control and Prevention (CDC), European Association for the Study of Liver Disease (EASL) and Joint United Nations Programme on HIV/AIDS (UNAIDS) AIDSinfo. According to the scoring system, completely correct answers were given 1-point, and completely incorrect ones were given 4-points. To assess reproducibility, each question was posed twice on separate computers. Repeatability was determined by the consistency of the answers' scores. Results In the study, ChatGPT was posed with 200 questions: 107 from social media platforms and 93 from guidelines. The questions covered a range of topics: urinary tract infections (n=18 questions), pneumonia (n=22), HIV (n=39), hepatitis B and C (n=53), COVID-19 (n=11), skin and soft tissue infections (n=38), and tuberculosis (n=19). The lowest accuracy was 72% for urinary tract infections. ChatGPT answered 92% of social media platform questions correctly (scored 1-point) versus 69% of guideline questions (p=0.001; OR=5.48, 95% CI=2.29-13.11). Conclusion Artificial intelligence is widely used in the medical field by both healthcare professionals and patients. Although ChatGPT answers questions from social media platforms quite properly, we recommend that healthcare professionals be conscientious when using it.
Collapse
Affiliation(s)
- Gülşah Tunçer
- Bilecik Training and Research Hospital, Bilecik, Türkiye
| | | |
Collapse
|
25
|
Ma H, Ma X, Yang C, Niu Q, Gao T, Liu C, Chen Y. Development and evaluation of a program based on a generative pre-trained transformer model from a public natural language processing platform for efficiency enhancement in post-procedural quality control of esophageal endoscopic submucosal dissection. Surg Endosc 2024; 38:1264-1272. [PMID: 38097750 DOI: 10.1007/s00464-023-10620-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Accepted: 11/28/2023] [Indexed: 02/23/2024]
Abstract
BACKGROUND Post-procedural quality control of endoscopic submucosal dissection (ESD) is emphasized in guidelines. However, this process can be tedious and time-consuming. Recently, a pre-training model called generative pre-trained transformer (GPT) on a public natural language processing platform has emerged and garnered significant attention, whose capabilities align well with the post-procedural quality control process and have the potential to streamline it. Therefore, we developed a simple program utilizing this platform and evaluated its performance. METHODS Esophageal ESDs were retrospectively included. The manual quality control process was performed and act as reference standard. GPT's prompt was optimized through multiple iterations. A Python program was developed to automatically submit prompt with pathological report of each ESD procedure and collect quality control information provided by GPT. Its performance on quality control was evaluated with accuracy, precision, recall, and F-1 score. RESULTS 165 cases were involved into the dataset, of which 5 were utilized as the prompt optimization dataset and 160 as the validation dataset. Definitive prompt was achieved through seven iterations. Time spent on the validation dataset by GPT was 13.47 ± 2.43 min. Accuracies of pathological diagnosis, invasion depth, horizontal margin, vertical margin, vascular invasion, and lymphatic invasion of the quality control program were (0.940, 0.952) (95% CI), (0.925, 0.945) (95% CI), 0.931, 1.0, and 1.0, respectively. Precisions were (0.965, 0.969) (95% CI), (0.934, 0.954) (95% CI), and 0.957 for pathological diagnosis, invasion depth, and horizontal margin, respectively. Recalls were (0.940, 0.952) (95% CI), (0.925, 0.945) (95% CI), and 0.931 for factors as mentioned, respectively. F1-score were (0.945, 0.957) (95% CI), (0.928, 0.948) (95% CI), and 0.941 for factors as mentioned, respectively. CONCLUSIONS This quality control program was qualified of post-procedural quality control of esophageal ESDs. GPT can be easily applied to this quality control process and reduce workload of the endoscopists.
Collapse
Affiliation(s)
- Huaiyuan Ma
- Department of Gastroenterology and Hepatology, Binzhou Medical University Hospital, Binzhou, 256603, Shandong, China
- Digestive Disease Research Institute of Binzhou Medical University Hospital, Binzhou, Shandong, China
| | - Xingbin Ma
- Department of Gastroenterology and Hepatology, Binzhou Medical University Hospital, Binzhou, 256603, Shandong, China
- Digestive Disease Research Institute of Binzhou Medical University Hospital, Binzhou, Shandong, China
| | - Chunxiao Yang
- Department of Gastroenterology and Hepatology, Binzhou Medical University Hospital, Binzhou, 256603, Shandong, China
- Digestive Disease Research Institute of Binzhou Medical University Hospital, Binzhou, Shandong, China
| | - Qiong Niu
- Department of Gastroenterology and Hepatology, Binzhou Medical University Hospital, Binzhou, 256603, Shandong, China
- Digestive Disease Research Institute of Binzhou Medical University Hospital, Binzhou, Shandong, China
| | - Tao Gao
- Endoscopy Center of Binzhou Medical University Hospital, Binzhou, Shandong, China
| | - Chengxia Liu
- Department of Gastroenterology and Hepatology, Binzhou Medical University Hospital, Binzhou, 256603, Shandong, China.
- Digestive Disease Research Institute of Binzhou Medical University Hospital, Binzhou, Shandong, China.
- Endoscopy Center of Binzhou Medical University Hospital, Binzhou, Shandong, China.
| | - Yan Chen
- Department of Gastroenterology and Hepatology, Binzhou Medical University Hospital, Binzhou, 256603, Shandong, China.
- Digestive Disease Research Institute of Binzhou Medical University Hospital, Binzhou, Shandong, China.
- Endoscopy Center of Binzhou Medical University Hospital, Binzhou, Shandong, China.
| |
Collapse
|
26
|
Posner KM, Bakus C, Basralian G, Chester G, Zeiman M, O'Malley GR, Klein GR. Evaluating ChatGPT's Capabilities on Orthopedic Training Examinations: An Analysis of New Image Processing Features. Cureus 2024; 16:e55945. [PMID: 38601421 PMCID: PMC11005479 DOI: 10.7759/cureus.55945] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/11/2024] [Indexed: 04/12/2024] Open
Abstract
Introduction The efficacy of integrating artificial intelligence (AI) models like ChatGPT into the medical field, specifically orthopedic surgery, has yet to be fully determined. The most recent adaptation of ChatGPT that has yet to be explored is its image analysis capabilities. This study assesses ChatGPT's performance in answering Orthopedic In-Training Examination (OITE) questions, including those that require image analysis. Methods Questions from the 2014, 2015, 2021, and 2022 AAOS OITE were screened for inclusion. All questions without images were entered into ChatGPT 3.5 and 4.0 twice. Questions that necessitated the use of images were only entered into ChatGPT 4.0 twice, as this is the only version of the system that can analyze images. The responses were recorded and compared to AAOS's correct answers, evaluating the AI's accuracy and precision. Results A total of 940 questions were included in the final analysis (457 questions with images and 483 questions without images). ChatGPT 4.0 performed significantly better on questions that did not require image analysis (67.81% vs 47.59%, p<0.001). Discussion While the use of AI in orthopedics is an intriguing possibility, this evaluation demonstrates how, even with the addition of image processing capabilities, ChatGPT still falls short in terms of its accuracy. As AI technology evolves, ongoing research is vital to harness AI's potential effectively, ensuring it complements rather than attempts to replace the nuanced skills of orthopedic surgeons.
Collapse
Affiliation(s)
- Kevin M Posner
- Department of Orthopedic Surgery, Hackensack Meridian School of Medicine, Nutley, USA
| | - Cassandra Bakus
- Department of Orthopedic Surgery, Hackensack Meridian School of Medicine, Nutley, USA
| | - Grace Basralian
- Department of Orthopedic Surgery, Hackensack Meridian School of Medicine, Nutley, USA
| | - Grace Chester
- Department of Orthopedic Surgery, Hackensack Meridian School of Medicine, Nutley, USA
| | - Mallery Zeiman
- Department of Orthopedic Surgery, Hackensack Meridian School of Medicine, Nutley, USA
| | - Geoffrey R O'Malley
- Department of Orthopedic Surgery, Hackensack University Medical Center, Hackensack, USA
| | - Gregg R Klein
- Department of Orthopedic Surgery, Hackensack University Medical Center, Hackensack, USA
| |
Collapse
|
27
|
Williams SC, Starup-Hansen J, Funnell JP, Hanrahan JG, Valetopoulou A, Singh N, Sinha S, Muirhead WR, Marcus HJ. Can ChatGPT outperform a neurosurgical trainee? A prospective comparative study. Br J Neurosurg 2024:1-10. [PMID: 38305239 DOI: 10.1080/02688697.2024.2308222] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Accepted: 01/16/2024] [Indexed: 02/03/2024]
Abstract
PURPOSE This study aimed to compare the performance of ChatGPT, a large language model (LLM), with human neurosurgical applicants in a neurosurgical national selection interview, to assess the potential of artificial intelligence (AI) and LLMs in healthcare and provide insights into their integration into the field. METHODS In a prospective comparative study, a set of neurosurgical national selection-style interview questions were asked to eight human participants and ChatGPT in an online interview. All participants were doctors currently practicing in the UK who had applied for a neurosurgical National Training Number. Interviews were recorded, anonymised, and scored by three neurosurgical consultants with experience as interviewers for national selection. Answers provided by ChatGPT were used as a template for a virtual interview. Interview transcripts were subsequently scored by neurosurgical consultants using criteria utilised in real national selection interviews. Overall interview score and subdomain scores were compared between human participants and ChatGPT. RESULTS For overall score, ChatGPT fell behind six human competitors and did not achieve a mean score higher than any individuals who achieved training positions. Several factors, including factual inaccuracies and deviations from expected structure and style may have contributed to ChatGPT's underperformance. CONCLUSIONS LLMs such as ChatGPT have huge potential for integration in healthcare. However, this study emphasises the need for further development to address limitations and challenges. While LLMs have not surpassed human performance yet, collaboration between humans and AI systems holds promise for the future of healthcare.
Collapse
Affiliation(s)
- Simon C Williams
- Department of Neurosurgery, St George's University Hospital, London, UK
- Wellcome/EPSRC Centre for Interventional and Surgical Sciences, University College London, London, UK
| | - Joachim Starup-Hansen
- Wellcome/EPSRC Centre for Interventional and Surgical Sciences, University College London, London, UK
- Department of Neurosurgery, National Hospital for Neurology and Neurosurgery, London, UK
| | - Jonathan P Funnell
- Department of Neurosurgery, St George's University Hospital, London, UK
- Wellcome/EPSRC Centre for Interventional and Surgical Sciences, University College London, London, UK
| | - John Gerrard Hanrahan
- Wellcome/EPSRC Centre for Interventional and Surgical Sciences, University College London, London, UK
- Department of Neurosurgery, National Hospital for Neurology and Neurosurgery, London, UK
| | | | - Navneet Singh
- Department of Neurosurgery, St George's University Hospital, London, UK
| | - Saurabh Sinha
- Department of Neurosurgery, Sheffield Teaching Hospitals, Sheffield, UK
| | - William R Muirhead
- Wellcome/EPSRC Centre for Interventional and Surgical Sciences, University College London, London, UK
- Department of Neurosurgery, National Hospital for Neurology and Neurosurgery, London, UK
| | - Hani J Marcus
- Wellcome/EPSRC Centre for Interventional and Surgical Sciences, University College London, London, UK
- Department of Neurosurgery, National Hospital for Neurology and Neurosurgery, London, UK
| |
Collapse
|
28
|
Gritti MN, AlTurki H, Farid P, Morgan CT. Progression of an Artificial Intelligence Chatbot (ChatGPT) for Pediatric Cardiology Educational Knowledge Assessment. Pediatr Cardiol 2024; 45:309-313. [PMID: 38170274 DOI: 10.1007/s00246-023-03385-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/16/2023] [Accepted: 12/13/2023] [Indexed: 01/05/2024]
Abstract
Artificial intelligence chatbots, like ChatGPT, have become powerful tools that are disrupting how humans interact with technology. The potential uses within medicine are vast. In medical education, these chatbots have shown improvements, in a short time span, in generalized medical examinations. We evaluated the overall performance and improvement between ChatGPT 3.5 and 4.0 in a test of pediatric cardiology knowledge. ChatGPT 3.5 and ChatGPT 4.0 were used to answer text-based multiple-choice questions derived from a Pediatric Cardiology Board Review textbook. Each chatbot was given an 88 question test, subcategorized into 11 topics. We excluded questions with modalities other than text (sound clips or images). Statistical analysis was done using an unpaired two-tailed t-test. Of the same 88 questions, ChatGPT 4.0 answered 66% of the questions correctly (n = 58/88) which was significantly greater (p < 0.0001) than ChatGPT 3.5, which only answered 38% (33/88). The ChatGPT 4.0 version also did better on each subspeciality topic as compared to ChatGPT 3.5. While acknowledging that ChatGPT does not yet offer subspecialty level knowledge in pediatric cardiology, the performance in pediatric cardiology educational assessments showed a considerable improvement in a short period of time between ChatGPT 3.5 and 4.0.
Collapse
Affiliation(s)
- Michael N Gritti
- Division of Cardiology, The Labatt Family Heart Centre, The Hospital for Sick Children, 555 University Ave, Toronto, ON, M5G 1X8, Canada.
- Department of Pediatrics, University of Toronto, Toronto, ON, Canada.
| | - Hussain AlTurki
- Department of Pediatrics, University of Toronto, Toronto, ON, Canada
- Department of Pediatrics, The Hospital for Sick Children, Toronto, ON, Canada
| | - Pedrom Farid
- Division of Cardiology, The Labatt Family Heart Centre, The Hospital for Sick Children, 555 University Ave, Toronto, ON, M5G 1X8, Canada
- Schulich School of Medicine and Dentistry, University of Western Ontario, London, ON, Canada
| | - Conall T Morgan
- Division of Cardiology, The Labatt Family Heart Centre, The Hospital for Sick Children, 555 University Ave, Toronto, ON, M5G 1X8, Canada
- Department of Pediatrics, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
29
|
Wagner MW, Ertl-Wagner BB. Accuracy of Information and References Using ChatGPT-3 for Retrieval of Clinical Radiological Information. Can Assoc Radiol J 2024; 75:69-73. [PMID: 37078489 DOI: 10.1177/08465371231171125] [Citation(s) in RCA: 20] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/21/2023] Open
Abstract
Purpose: To assess the accuracy of answers provided by ChatGPT-3 when prompted with questions from the daily routine of radiologists and to evaluate the text response when ChatGPT-3 was prompted to provide references for a given answer. Methods: ChatGPT-3 (San Francisco, OpenAI) is an artificial intelligence chatbot based on a large language model (LLM) that has been designed to generate human-like text. A total of 88 questions were submitted to ChatGPT-3 using textual prompt. These 88 questions were equally dispersed across 8 subspecialty areas of radiology. The responses provided by ChatGPT-3 were assessed for correctness by cross-checking them with peer-reviewed, PubMed-listed references. In addition, the references provided by ChatGPT-3 were evaluated for authenticity. Results: A total of 59 of 88 responses (67%) to radiological questions were correct, while 29 responses (33%) had errors. Out of 343 references provided, only 124 references (36.2%) were available through internet search, while 219 references (63.8%) appeared to be generated by ChatGPT-3. When examining the 124 identified references, only 47 references (37.9%) were considered to provide enough background to correctly answer 24 questions (37.5%). Conclusion: In this pilot study, ChatGPT-3 provided correct responses to questions from the daily clinical routine of radiologists in only about two thirds, while the remainder of responses contained errors. The majority of provided references were not found and only a minority of the provided references contained the correct information to answer the question. Caution is advised when using ChatGPT-3 to retrieve radiological information.
Collapse
Affiliation(s)
- Matthias W Wagner
- Department of Diagnostic Imaging, Division of Neuroradiology, The Hospital for Sick Children, Toronto, Canada
- Department of Medical Imaging, University of Toronto, Toronto, Canada
| | - Birgit B Ertl-Wagner
- Department of Diagnostic Imaging, Division of Neuroradiology, The Hospital for Sick Children, Toronto, Canada
- Department of Medical Imaging, University of Toronto, Toronto, Canada
| |
Collapse
|
30
|
Eppler M, Ganjavi C, Ramacciotti LS, Piazza P, Rodler S, Checcucci E, Gomez Rivas J, Kowalewski KF, Belenchón IR, Puliatti S, Taratkin M, Veccia A, Baekelandt L, Teoh JYC, Somani BK, Wroclawski M, Abreu A, Porpiglia F, Gill IS, Murphy DG, Canes D, Cacciamani GE. Awareness and Use of ChatGPT and Large Language Models: A Prospective Cross-sectional Global Survey in Urology. Eur Urol 2024; 85:146-153. [PMID: 37926642 DOI: 10.1016/j.eururo.2023.10.014] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Revised: 09/27/2023] [Accepted: 10/24/2023] [Indexed: 11/07/2023]
Abstract
BACKGROUND Since its release in November 2022, ChatGPT has captivated society and shown potential for various aspects of health care. OBJECTIVE To investigate potential use of ChatGPT, a large language model (LLM), in urology by gathering opinions from urologists worldwide. DESIGN, SETTING, AND PARTICIPANTS An open web-based survey was distributed via social media and e-mail chains to urologists between April 20, 2023 and May 5, 2023. Participants were asked to answer questions related to their knowledge and experience with artificial intelligence, as well as their opinions of potential use of ChatGPT/LLMs in research and clinical practice. OUTCOME MEASUREMENTS AND STATISTICAL ANALYSIS Data are reported as the mean and standard deviation for continuous variables, and the frequency and percentage for categorical variables. Charts and tables are used as appropriate, with descriptions of the chart types and the measures used. The data are reported in accordance with the Checklist for Reporting Results of Internet E-Surveys (CHERRIES). RESULTS AND LIMITATIONS A total of 456 individuals completed the survey (64% completion rate). Nearly half (47.7%) reported that they use ChatGPT/LLMs in their academic practice, with fewer using the technology in clinical practice (19.8%). More than half (62.2%) believe there are potential ethical concerns when using ChatGPT for scientific or academic writing, and 53% reported that they have experienced limitations when using ChatGPT in academic practice. CONCLUSIONS Urologists recognise the potential of ChatGPT/LLMs in research but have concerns regarding ethics and patient acceptance. There is a desire for regulations and guidelines to ensure appropriate use. In addition, measures should be taken to establish rules and guidelines to maximise safety and efficiency when using this novel technology. PATIENT SUMMARY A survey asked 456 urologists from around the world about using an artificial intelligence tool called ChatGPT in their work. Almost half of them use ChatGPT for research, but not many use it for patients care. The resonders think ChatGPT could be helpful, but they worry about problems like ethics and want rules to make sure it's used safely.
Collapse
Affiliation(s)
- Michael Eppler
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA; AI Center, USC Institute of Urology, University of Southern California, Los Angeles, CA, USA
| | - Conner Ganjavi
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA; AI Center, USC Institute of Urology, University of Southern California, Los Angeles, CA, USA
| | - Lorenzo Storino Ramacciotti
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA; AI Center, USC Institute of Urology, University of Southern California, Los Angeles, CA, USA
| | - Pietro Piazza
- Division of Urology, IRCCS Azienda Ospedaliero-Universitaria di Bologna, Bologna, Italy
| | - Severin Rodler
- Department of Urology, Klinikum der Universität München, Munich, Germany
| | - Enrico Checcucci
- Department of Surgery, FPO-IRCCS Candiolo Cancer Institute, Candiolo, Italy
| | - Juan Gomez Rivas
- Department of Urology, Clinico San Carlos University Hospital, Madrid, Spain
| | - Karl F Kowalewski
- Department of Urology, University Medical Center Mannheim, Heidelberg University, Mannheim, Germany
| | - Ines Rivero Belenchón
- Urology and Nephrology Department, Virgen del Rocío University Hospital, Seville, Spain
| | - Stefano Puliatti
- Urology Department, University of Modena and Reggio Emilia, Modena, Italy
| | - Mark Taratkin
- Institute for Urology and Reproductive Health, Sechenov University, Moscow, Russia
| | - Alessandro Veccia
- Department of Urology, Azienda Ospedaliera Universitaria Integrata Verona, Verona, Italy
| | - Loïc Baekelandt
- Department of Urology, University Hospitals Leuven, Leuven, Belgium
| | - Jeremy Y-C Teoh
- Department of Surgery, S.H. Ho Urology Centre, The Chinese University of Hong Kong, Hong Kong, China
| | - Bhaskar K Somani
- University Hospital Southampton NHS Foundation Trust, Southampton, UK
| | - Marcelo Wroclawski
- Hospital Israelita Albert Einstein, São Paulo, Brazil; Beneficência Portuguesa de São Paulo, São Paulo, Brazil; Faculdade de Medicina do ABC, Santo Andre, Brazil
| | - Andre Abreu
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA; AI Center, USC Institute of Urology, University of Southern California, Los Angeles, CA, USA
| | | | - Inderbir S Gill
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA; AI Center, USC Institute of Urology, University of Southern California, Los Angeles, CA, USA
| | - Declan G Murphy
- Division of Cancer Surgery, Peter MacCallum Cancer Centre, University of Melbourne, Melbourne, Australia
| | - David Canes
- Division of Urology, Lahey Hospital & Medical Center, Burlington, MA, USA
| | - Giovanni E Cacciamani
- USC Institute of Urology and Catherine and Joseph Aresty Department of Urology, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA; AI Center, USC Institute of Urology, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
31
|
Kapsali MZ, Livanis E, Tsalikidis C, Oikonomou P, Voultsos P, Tsaroucha A. Ethical Concerns About ChatGPT in Healthcare: A Useful Tool or the Tombstone of Original and Reflective Thinking? Cureus 2024; 16:e54759. [PMID: 38523987 PMCID: PMC10961144 DOI: 10.7759/cureus.54759] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/23/2024] [Indexed: 03/26/2024] Open
Abstract
Artificial intelligence (AI), the uprising technology of computer science aiming to create digital systems with human behavior and intelligence, seems to have invaded almost every field of modern life. Launched in November 2022, ChatGPT (Chat Generative Pre-trained Transformer) is a textual AI application capable of creating human-like responses characterized by original language and high coherence. Although AI-based language models have demonstrated impressive capabilities in healthcare, ChatGPT has received controversial annotations from the scientific and academic communities. This chatbot already appears to have a massive impact as an educational tool for healthcare professionals and transformative potential for clinical practice and could lead to dramatic changes in scientific research. Nevertheless, rational concerns were raised regarding whether the pre-trained, AI-generated text would be a menace not only for original thinking and new scientific ideas but also for academic and research integrity, as it gets more and more difficult to distinguish its AI origin due to the coherence and fluency of the produced text. This short review aims to summarize the potential applications and the consequential implications of ChatGPT in the three critical pillars of medicine: education, research, and clinical practice. In addition, this paper discusses whether the current use of this chatbot is in compliance with the ethical principles for the safe use of AI in healthcare, as determined by the World Health Organization. Finally, this review highlights the need for an updated ethical framework and the increased vigilance of healthcare stakeholders to harvest the potential benefits and limit the imminent dangers of this new innovative technology.
Collapse
Affiliation(s)
- Marina Z Kapsali
- Postgraduate Program on Bioethics, Laboratory of Bioethics, Democritus University of Thrace, Alexandroupolis, GRC
| | - Efstratios Livanis
- Department of Accounting and Finance, University of Macedonia, Thessaloniki, GRC
| | - Christos Tsalikidis
- Department of General Surgery, Democritus University of Thrace, Alexandroupolis, GRC
| | - Panagoula Oikonomou
- Laboratory of Experimental Surgery, Department of General Surgery, Democritus University of Thrace, Alexandroupolis, GRC
| | - Polychronis Voultsos
- Laboratory of Forensic Medicine & Toxicology (Medical Law and Ethics), School of Medicine, Faculty of Health Sciences, Aristotle University of Thessaloniki, Thessaloniki, GRC
| | - Aleka Tsaroucha
- Department of General Surgery, Democritus University of Thrace, Alexandroupolis, GRC
| |
Collapse
|
32
|
Cobanaj M, Corti C, Dee EC, McCullum L, Boldrini L, Schlam I, Tolaney SM, Celi LA, Curigliano G, Criscitiello C. Advancing equitable and personalized cancer care: Novel applications and priorities of artificial intelligence for fairness and inclusivity in the patient care workflow. Eur J Cancer 2024; 198:113504. [PMID: 38141549 PMCID: PMC11362966 DOI: 10.1016/j.ejca.2023.113504] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Accepted: 12/13/2023] [Indexed: 12/25/2023]
Abstract
Patient care workflows are highly multimodal and intertwined: the intersection of data outputs provided from different disciplines and in different formats remains one of the main challenges of modern oncology. Artificial Intelligence (AI) has the potential to revolutionize the current clinical practice of oncology owing to advancements in digitalization, database expansion, computational technologies, and algorithmic innovations that facilitate discernment of complex relationships in multimodal data. Within oncology, radiation therapy (RT) represents an increasingly complex working procedure, involving many labor-intensive and operator-dependent tasks. In this context, AI has gained momentum as a powerful tool to standardize treatment performance and reduce inter-observer variability in a time-efficient manner. This review explores the hurdles associated with the development, implementation, and maintenance of AI platforms and highlights current measures in place to address them. In examining AI's role in oncology workflows, we underscore that a thorough and critical consideration of these challenges is the only way to ensure equitable and unbiased care delivery, ultimately serving patients' survival and quality of life.
Collapse
Affiliation(s)
- Marisa Cobanaj
- National Center for Radiation Research in Oncology, OncoRay, Helmholtz-Zentrum Dresden-Rossendorf, Dresden, Germany
| | - Chiara Corti
- Breast Oncology Program, Dana-Farber Brigham Cancer Center, Boston, MA, USA; Harvard Medical School, Boston, MA, USA; Division of New Drugs and Early Drug Development for Innovative Therapies, European Institute of Oncology, IRCCS, Milan, Italy; Department of Oncology and Hematology-Oncology (DIPO), University of Milan, Milan, Italy.
| | - Edward C Dee
- Department of Radiation Oncology, Memorial Sloan Kettering Cancer Center, New York, NY, USA
| | - Lucas McCullum
- Department of Radiation Oncology, MD Anderson Cancer Center, Houston, TX, USA
| | - Laura Boldrini
- Division of New Drugs and Early Drug Development for Innovative Therapies, European Institute of Oncology, IRCCS, Milan, Italy; Department of Oncology and Hematology-Oncology (DIPO), University of Milan, Milan, Italy
| | - Ilana Schlam
- Department of Hematology and Oncology, Tufts Medical Center, Boston, MA, USA; Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Sara M Tolaney
- Breast Oncology Program, Dana-Farber Brigham Cancer Center, Boston, MA, USA; Harvard Medical School, Boston, MA, USA; Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, USA
| | - Leo A Celi
- Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA; Laboratory for Computational Physiology, Massachusetts Institute of Technology, Cambridge, MA, USA; Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Giuseppe Curigliano
- Division of New Drugs and Early Drug Development for Innovative Therapies, European Institute of Oncology, IRCCS, Milan, Italy; Department of Oncology and Hematology-Oncology (DIPO), University of Milan, Milan, Italy
| | - Carmen Criscitiello
- Division of New Drugs and Early Drug Development for Innovative Therapies, European Institute of Oncology, IRCCS, Milan, Italy; Department of Oncology and Hematology-Oncology (DIPO), University of Milan, Milan, Italy
| |
Collapse
|
33
|
Jain N, Gottlich C, Fisher J, Campano D, Winston T. Assessing ChatGPT's orthopedic in-service training exam performance and applicability in the field. J Orthop Surg Res 2024; 19:27. [PMID: 38167093 PMCID: PMC10762835 DOI: 10.1186/s13018-023-04467-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Accepted: 12/12/2023] [Indexed: 01/05/2024] Open
Abstract
BACKGROUND ChatGPT has gained widespread attention for its ability to understand and provide human-like responses to inputs. However, few works have focused on its use in Orthopedics. This study assessed ChatGPT's performance on the Orthopedic In-Service Training Exam (OITE) and evaluated its decision-making process to determine whether adoption as a resource in the field is practical. METHODS ChatGPT's performance on three OITE exams was evaluated through inputting multiple choice questions. Questions were classified by their orthopedic subject area. Yearly, OITE technical reports were used to gauge scores against resident physicians. ChatGPT's rationales were compared with testmaker explanations using six different groups denoting answer accuracy and logic consistency. Variables were analyzed using contingency table construction and Chi-squared analyses. RESULTS Of 635 questions, 360 were useable as inputs (56.7%). ChatGPT-3.5 scored 55.8%, 47.7%, and 54% for the years 2020, 2021, and 2022, respectively. Of 190 correct outputs, 179 provided a consistent logic (94.2%). Of 170 incorrect outputs, 133 provided an inconsistent logic (78.2%). Significant associations were found between test topic and correct answer (p = 0.011), and type of logic used and tested topic (p = < 0.001). Basic Science and Sports had adjusted residuals greater than 1.96. Basic Science and correct, no logic; Basic Science and incorrect, inconsistent logic; Sports and correct, no logic; and Sports and incorrect, inconsistent logic; had adjusted residuals greater than 1.96. CONCLUSIONS Based on annual OITE technical reports for resident physicians, ChatGPT-3.5 performed around the PGY-1 level. When answering correctly, it displayed congruent reasoning with testmakers. When answering incorrectly, it exhibited some understanding of the correct answer. It outperformed in Basic Science and Sports, likely due to its ability to output rote facts. These findings suggest that it lacks the fundamental capabilities to be a comprehensive tool in Orthopedic Surgery in its current form. LEVEL OF EVIDENCE II.
Collapse
Affiliation(s)
- Neil Jain
- Department of Orthopedic Surgery, Texas Tech University Health Sciences Center Lubbock, 3601 4th St, Lubbock, TX, 79430, USA.
| | - Caleb Gottlich
- Department of Orthopedic Surgery, Texas Tech University Health Sciences Center Lubbock, 3601 4th St, Lubbock, TX, 79430, USA
| | - John Fisher
- Department of Orthopedic Surgery, Texas Tech University Health Sciences Center Lubbock, 3601 4th St, Lubbock, TX, 79430, USA
| | - Dominic Campano
- Department of Orthopedic Surgery, Texas Tech University Health Sciences Center Lubbock, 3601 4th St, Lubbock, TX, 79430, USA
| | - Travis Winston
- Department of Orthopedic Surgery, Texas Tech University Health Sciences Center Lubbock, 3601 4th St, Lubbock, TX, 79430, USA
| |
Collapse
|
34
|
Morales-Ramirez P, Mishek H, Dasgupta A. The Genie Is Out of the Bottle: What ChatGPT Can and Cannot Do for Medical Professionals. Obstet Gynecol 2024; 143:e1-e6. [PMID: 37944140 DOI: 10.1097/aog.0000000000005446] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Accepted: 10/12/2023] [Indexed: 11/12/2023]
Abstract
ChatGPT is a cutting-edge artificial intelligence technology that was released for public use in November 2022. Its rapid adoption has raised questions about capabilities, limitations, and risks. This article presents an overview of ChatGPT, and it highlights the current state of this technology for the medical field. The article seeks to provide a balanced perspective on what the model can and cannot do in three specific domains: clinical practice, research, and medical education. It also provides suggestions on how to optimize the use of this tool.
Collapse
|
35
|
Mannstadt I, Mehta B. Large language models and the future of rheumatology: assessing impact and emerging opportunities. Curr Opin Rheumatol 2024; 36:46-51. [PMID: 37729050 DOI: 10.1097/bor.0000000000000981] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/22/2023]
Abstract
PURPOSE OF REVIEW Large language models (LLMs) have grown rapidly in size and capabilities as more training data and compute power has become available. Since the release of ChatGPT in late 2022, there has been growing interest and exploration around potential applications of LLM technology. Numerous examples and pilot studies demonstrating the capabilities of these tools have emerged across several domains. For rheumatology professionals and patients, LLMs have the potential to transform current practices in medicine. RECENT FINDINGS Recent studies have begun exploring capabilities of LLMs that can assist rheumatologists in clinical practice, research, and medical education, though applications are still emerging. In clinical settings, LLMs have shown promise in assist healthcare professionals enabling more personalized medicine or generating routine documentation like notes and letters. Challenges remain around integrating LLMs into clinical workflows, accuracy of the LLMs and ensuring patient data confidentiality. In research, early experiments demonstrate LLMs can offer analysis of datasets, with quality control as a critical piece. Lastly, LLMs could supplement medical education by providing personalized learning experiences and integration into established curriculums. SUMMARY As these powerful tools continue evolving at a rapid pace, rheumatology professionals should stay informed on how they may impact the field.
Collapse
Affiliation(s)
| | - Bella Mehta
- Weill Cornell Medicine
- Hospital for Special Surgery, New York, New York, USA
| |
Collapse
|
36
|
Giannakopoulos K, Kavadella A, Aaqel Salim A, Stamatopoulos V, Kaklamanos EG. Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study. J Med Internet Res 2023; 25:e51580. [PMID: 38009003 PMCID: PMC10784979 DOI: 10.2196/51580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Revised: 10/15/2023] [Accepted: 11/20/2023] [Indexed: 11/28/2023] Open
Abstract
BACKGROUND The increasing application of generative artificial intelligence large language models (LLMs) in various fields, including dentistry, raises questions about their accuracy. OBJECTIVE This study aims to comparatively evaluate the answers provided by 4 LLMs, namely Bard (Google LLC), ChatGPT-3.5 and ChatGPT-4 (OpenAI), and Bing Chat (Microsoft Corp), to clinically relevant questions from the field of dentistry. METHODS The LLMs were queried with 20 open-type, clinical dentistry-related questions from different disciplines, developed by the respective faculty of the School of Dentistry, European University Cyprus. The LLMs' answers were graded 0 (minimum) to 10 (maximum) points against strong, traditionally collected scientific evidence, such as guidelines and consensus statements, using a rubric, as if they were examination questions posed to students, by 2 experienced faculty members. The scores were statistically compared to identify the best-performing model using the Friedman and Wilcoxon tests. Moreover, the evaluators were asked to provide a qualitative evaluation of the comprehensiveness, scientific accuracy, clarity, and relevance of the LLMs' answers. RESULTS Overall, no statistically significant difference was detected between the scores given by the 2 evaluators; therefore, an average score was computed for every LLM. Although ChatGPT-4 statistically outperformed ChatGPT-3.5 (P=.008), Bing Chat (P=.049), and Bard (P=.045), all models occasionally exhibited inaccuracies, generality, outdated content, and a lack of source references. The evaluators noted instances where the LLMs delivered irrelevant information, vague answers, or information that was not fully accurate. CONCLUSIONS This study demonstrates that although LLMs hold promising potential as an aid in the implementation of evidence-based dentistry, their current limitations can lead to potentially harmful health care decisions if not used judiciously. Therefore, these tools should not replace the dentist's critical thinking and in-depth understanding of the subject matter. Further research, clinical validation, and model improvements are necessary for these tools to be fully integrated into dental practice. Dental practitioners must be aware of the limitations of LLMs, as their imprudent use could potentially impact patient care. Regulatory measures should be established to oversee the use of these evolving technologies.
Collapse
Affiliation(s)
| | - Argyro Kavadella
- School of Dentistry, European University Cyprus, Nicosia, Cyprus
| | - Anas Aaqel Salim
- School of Dentistry, European University Cyprus, Nicosia, Cyprus
| | - Vassilis Stamatopoulos
- Information Management Systems Institute, ATHENA Research and Innovation Center, Athens, Greece
| | - Eleftherios G Kaklamanos
- School of Dentistry, European University Cyprus, Nicosia, Cyprus
- School of Dentistry, Aristotle University of Thessaloniki, Thessaloniki, Greece
- Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai, United Arab Emirates
| |
Collapse
|
37
|
Koranteng E, Rao A, Flores E, Lev M, Landman A, Dreyer K, Succi M. Empathy and Equity: Key Considerations for Large Language Model Adoption in Health Care. JMIR MEDICAL EDUCATION 2023; 9:e51199. [PMID: 38153778 PMCID: PMC10884892 DOI: 10.2196/51199] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Revised: 10/01/2023] [Accepted: 10/14/2023] [Indexed: 12/29/2023]
Abstract
The growing presence of large language models (LLMs) in health care applications holds significant promise for innovative advancements in patient care. However, concerns about ethical implications and potential biases have been raised by various stakeholders. Here, we evaluate the ethics of LLMs in medicine along 2 key axes: empathy and equity. We outline the importance of these factors in novel models of care and develop frameworks for addressing these alongside LLM deployment.
Collapse
Affiliation(s)
| | - Arya Rao
- Harvard Medical School, Boston, MA, United States
| | - Efren Flores
- Harvard Medical School, Boston, MA, United States
| | - Michael Lev
- Harvard Medical School, Boston, MA, United States
| | - Adam Landman
- Harvard Medical School, Boston, MA, United States
| | - Keith Dreyer
- Harvard Medical School, Boston, MA, United States
| | - Marc Succi
- Massachusetts General Hospital, Boston, United States
| |
Collapse
|
38
|
Ćirković A, Katz T. Exploring the Potential of ChatGPT-4 in Predicting Refractive Surgery Categorizations: Comparative Study. JMIR Form Res 2023; 7:e51798. [PMID: 38153777 PMCID: PMC10784977 DOI: 10.2196/51798] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Revised: 11/01/2023] [Accepted: 12/04/2023] [Indexed: 12/29/2023] Open
Abstract
BACKGROUND Refractive surgery research aims to optimally precategorize patients by their suitability for various types of surgery. Recent advances have led to the development of artificial intelligence-powered algorithms, including machine learning approaches, to assess risks and enhance workflow. Large language models (LLMs) like ChatGPT-4 (OpenAI LP) have emerged as potential general artificial intelligence tools that can assist across various disciplines, possibly including refractive surgery decision-making. However, their actual capabilities in precategorizing refractive surgery patients based on real-world parameters remain unexplored. OBJECTIVE This exploratory study aimed to validate ChatGPT-4's capabilities in precategorizing refractive surgery patients based on commonly used clinical parameters. The goal was to assess whether ChatGPT-4's performance when categorizing batch inputs is comparable to those made by a refractive surgeon. A simple binary set of categories (patient suitable for laser refractive surgery or not) as well as a more detailed set were compared. METHODS Data from 100 consecutive patients from a refractive clinic were anonymized and analyzed. Parameters included age, sex, manifest refraction, visual acuity, and various corneal measurements and indices from Scheimpflug imaging. This study compared ChatGPT-4's performance with a clinician's categorizations using Cohen κ coefficient, a chi-square test, a confusion matrix, accuracy, precision, recall, F1-score, and receiver operating characteristic area under the curve. RESULTS A statistically significant noncoincidental accordance was found between ChatGPT-4 and the clinician's categorizations with a Cohen κ coefficient of 0.399 for 6 categories (95% CI 0.256-0.537) and 0.610 for binary categorization (95% CI 0.372-0.792). The model showed temporal instability and response variability, however. The chi-square test on 6 categories indicated an association between the 2 raters' distributions (χ²5=94.7, P<.001). Here, the accuracy was 0.68, precision 0.75, recall 0.68, and F1-score 0.70. For 2 categories, the accuracy was 0.88, precision 0.88, recall 0.88, F1-score 0.88, and area under the curve 0.79. CONCLUSIONS This study revealed that ChatGPT-4 exhibits potential as a precategorization tool in refractive surgery, showing promising agreement with clinician categorizations. However, its main limitations include, among others, dependency on solely one human rater, small sample size, the instability and variability of ChatGPT's (OpenAI LP) output between iterations and nontransparency of the underlying models. The results encourage further exploration into the application of LLMs like ChatGPT-4 in health care, particularly in decision-making processes that require understanding vast clinical data. Future research should focus on defining the model's accuracy with prompt and vignette standardization, detecting confounding factors, and comparing to other versions of ChatGPT-4 and other LLMs to pave the way for larger-scale validation and real-world implementation.
Collapse
Affiliation(s)
| | - Toam Katz
- Department of Ophthalmology, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| |
Collapse
|
39
|
Lukac S, Dayan D, Fink V, Leinert E, Hartkopf A, Veselinovic K, Janni W, Rack B, Pfister K, Heitmeir B, Ebner F. Evaluating ChatGPT as an adjunct for the multidisciplinary tumor board decision-making in primary breast cancer cases. Arch Gynecol Obstet 2023; 308:1831-1844. [PMID: 37458761 PMCID: PMC10579162 DOI: 10.1007/s00404-023-07130-5] [Citation(s) in RCA: 17] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Accepted: 06/27/2023] [Indexed: 10/17/2023]
Abstract
BACKGROUND As the available information about breast cancer is growing every day, the decision-making process for the therapy is getting more complex. ChatGPT as a transformer-based language model possesses the ability to write scientific articles and pass medical exams. But is it able to support the multidisciplinary tumor board (MDT) in the planning of the therapy of patients with breast cancer? MATERIAL AND METHODS We performed a pilot study on 10 consecutive cases of breast cancer patients discussed in MDT at our department in January 2023. Included were patients with a primary diagnosis of early breast cancer. The recommendation of MDT was compared with the recommendation of the ChatGPT for particular patients and the clinical score of the agreement was calculated. RESULTS Results showed that ChatGPT provided mostly general answers regarding chemotherapy, breast surgery, radiation therapy, chemotherapy, and antibody therapy. It was able to identify risk factors for hereditary breast cancer and point out the elderly patient indicated for chemotherapy to evaluate the cost/benefit effect. ChatGPT wrongly identified the patient with Her2 1 + and 2 + (FISH negative) as in need of therapy with an antibody and called endocrine therapy "hormonal treatment". CONCLUSIONS Support of artificial intelligence by finding individualized and personalized therapy for our patients in the time of rapidly expanding amount of information is looking for the ways in the clinical routine. ChatGPT has the potential to find its spot in clinical medicine, but the current version is not able to provide specific recommendations for the therapy of patients with primary breast cancer.
Collapse
Affiliation(s)
- Stefan Lukac
- Department of Gynecology and Obstetrics, University Hospital Ulm, Prittwitzstr. 43, 89075, Ulm, Germany.
| | - Davut Dayan
- Department of Gynecology and Obstetrics, University Hospital Ulm, Prittwitzstr. 43, 89075, Ulm, Germany
| | - Visnja Fink
- Department of Gynecology and Obstetrics, University Hospital Ulm, Prittwitzstr. 43, 89075, Ulm, Germany
| | - Elena Leinert
- Department of Gynecology and Obstetrics, University Hospital Ulm, Prittwitzstr. 43, 89075, Ulm, Germany
| | - Andreas Hartkopf
- Department of Gynecology and Obstetrics, University Hospital Ulm, Prittwitzstr. 43, 89075, Ulm, Germany
| | - Kristina Veselinovic
- Department of Gynecology and Obstetrics, University Hospital Ulm, Prittwitzstr. 43, 89075, Ulm, Germany
| | - Wolfgang Janni
- Department of Gynecology and Obstetrics, University Hospital Ulm, Prittwitzstr. 43, 89075, Ulm, Germany
| | - Brigitte Rack
- Department of Gynecology and Obstetrics, University Hospital Ulm, Prittwitzstr. 43, 89075, Ulm, Germany
| | - Kerstin Pfister
- Department of Gynecology and Obstetrics, University Hospital Ulm, Prittwitzstr. 43, 89075, Ulm, Germany
| | - Benedikt Heitmeir
- Department of Gynecology and Obstetrics, University Hospital Ulm, Prittwitzstr. 43, 89075, Ulm, Germany
| | - Florian Ebner
- Department of Gynecology and Obstetrics, University Hospital Ulm, Prittwitzstr. 43, 89075, Ulm, Germany
- Gynäkologische Gemeinschaftspraxis Freising & Moosburg, Munich, Germany
| |
Collapse
|
40
|
Parker JL, Becker K, Carroca C. ChatGPT for Automated Writing Evaluation in Scholarly Writing Instruction. J Nurs Educ 2023; 62:721-727. [PMID: 38049299 DOI: 10.3928/01484834-20231006-02] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/06/2023]
Abstract
BACKGROUND Effective strategies for developing scholarly writing skills in postsecondary nursing students are needed. Generative artificial intelligence (GAI) tools, such as ChatGPT, for automated writing evaluation (AWE) hold promise for mitigating challenges associated with scholarly writing instruction in nursing education. This article explores the suitability of ChatGPT for AWE in writing instruction. METHOD ChatGPT feedback on 42 nursing student texts from the Michigan Corpus of Upper-Level Student Papers was assessed. Assessment criteria were derived from recent AWE research. RESULTS ChatGPT demonstrated utility as an AWE tool. Its scoring performance demonstrated stricter grading than human raters, related feedback to macro-level writing features, and supported multiple submissions and learner autonomy. CONCLUSION Despite concerns surrounding GAI in academia, educators can accelerate the feedback process without increasing their workload, and students can receive individualized feedback by incorporating AWE provided by ChatGPT into the writing process. [J Nurs Educ. 2023;62(12):721-727.].
Collapse
|
41
|
Liu H, Azam M, Bin Naeem S, Faiola A. An overview of the capabilities of ChatGPT for medical writing and its implications for academic integrity. Health Info Libr J 2023; 40:440-446. [PMID: 37806782 DOI: 10.1111/hir.12509] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2023] [Accepted: 09/25/2023] [Indexed: 10/10/2023]
Abstract
The artificial intelligence (AI) tool ChatGPT, which is based on a large language model (LLM), is gaining popularity in academic institutions, notably in the medical field. This article provides a brief overview of the capabilities of ChatGPT for medical writing and its implications for academic integrity. It provides a list of AI generative tools, common use of AI generative tools for medical writing, and provides a list of AI generative text detection tools. It also provides recommendations for policymakers, information professionals, and medical faculty for the constructive use of AI generative tools and related technology. It also highlights the role of health sciences librarians and educators in protecting students from generating text through ChatGPT in their academic work.
Collapse
Affiliation(s)
- Huihui Liu
- Shanxi University, Xiaodian District, Taiyuan, People's Republic of China
| | - Mehreen Azam
- Department of Information Management, The Islamia University of Bahawalpur, Bahawalpur, Pakistan
| | - Salman Bin Naeem
- Department of Information Management, The Islamia University of Bahawalpur, Bahawalpur, Pakistan
| | - Anthony Faiola
- Department of Health and Clinical Sciences, College of Health Sciences, University of Kentucky, Lexington, Kentucky, USA
| |
Collapse
|
42
|
Bagde H, Dhopte A, Alam MK, Basri R. A systematic review and meta-analysis on ChatGPT and its utilization in medical and dental research. Heliyon 2023; 9:e23050. [PMID: 38144348 PMCID: PMC10746423 DOI: 10.1016/j.heliyon.2023.e23050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2023] [Revised: 10/24/2023] [Accepted: 11/24/2023] [Indexed: 12/26/2023] Open
Abstract
Since its release, ChatGPT has taken the world by storm with its utilization in various fields of life. This review's main goal was to offer a thorough and fact-based evaluation of ChatGPT's potential as a tool for medical and dental research, which could direct subsequent research and influence clinical practices. METHODS Different online databases were scoured for relevant articles that were in accordance with the study objectives. A team of reviewers was assembled to devise a proper methodological framework for inclusion of articles and meta-analysis. RESULTS 11 descriptive studies were considered for this review that evaluated the accuracy of ChatGPT in answering medical queries related to different domains such as systematic reviews, cancer, liver diseases, diagnostic imaging, education, and COVID-19 vaccination. The studies reported different accuracy ranges, from 18.3 % to 100 %, across various datasets and specialties. The meta-analysis showed an odds ratio (OR) of 2.25 and a relative risk (RR) of 1.47 with a 95 % confidence interval (CI), indicating that the accuracy of ChatGPT in providing correct responses was significantly higher compared to the total responses for queries. However, significant heterogeneity was present among the studies, suggesting considerable variability in the effect sizes across the included studies. CONCLUSION The observations indicate that ChatGPT has the ability to provide appropriate solutions to questions in the medical and dentistry areas, but researchers and doctors should cautiously assess its responses because they might not always be dependable. Overall, the importance of this study rests in shedding light on ChatGPT's accuracy in the medical and dentistry fields and emphasizing the need for additional investigation to enhance its performance. © 2017 Elsevier Inc. All rights reserved.
Collapse
Affiliation(s)
- Hiroj Bagde
- Department of Periodontology, Chhattisgarh Dental College and Research Institute, Rajnandgaon, Chhattisgarh, India
| | - Ashwini Dhopte
- Department of Oral Medicine and Radiology, Chhattisgarh Dental College and Research Institute, Rajnandgaon, Chhattisgarh, India
| | - Mohammad Khursheed Alam
- Preventive Dentistry Department, College of Dentistry, Jouf University, Sakaka, 72345, Saudi Arabia
- Department of Dental Research Cell, Saveetha Dental College and Hospitals, Saveetha Institute of Medical and Technical Sciences, Chennai, India
- Department of Public Health, Faculty of Allied Health Sciences, Daffodil International University, Dhaka, Bangladesh
| | - Rehana Basri
- Department of Internal Medicine, College of Medicine, Jouf University, Sakaka, 72345, Saudi Arabia
| |
Collapse
|
43
|
Morita P, Abhari S, Kaur J. Do ChatGPT and Other Artificial Intelligence Bots Have Applications in Health Policy-Making? Opportunities and Threats. Int J Health Policy Manag 2023; 12:8131. [PMID: 38618768 PMCID: PMC10843407 DOI: 10.34172/ijhpm.2023.8131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Accepted: 10/16/2023] [Indexed: 04/16/2024] Open
Affiliation(s)
- Plinio Morita
- School of Public Health Sciences, University of Waterloo, Waterloo, ON, Canada
- Department of Systems Design Engineering, University of Waterloo, Waterloo, ON, Canada
- Research Institute for Aging, University of Waterloo, Waterloo, ON, Canada
- Centre for Digital Therapeutics, Techna Institute, University Health Network, Toronto, ON, Canada
- Institute of Health Policy, Management, and Evaluation, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| | - Shahabeddin Abhari
- School of Public Health Sciences, University of Waterloo, Waterloo, ON, Canada
| | - Jasleen Kaur
- School of Public Health Sciences, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|
44
|
Song H, Xia Y, Luo Z, Liu H, Song Y, Zeng X, Li T, Zhong G, Li J, Chen M, Zhang G, Xiao B. Evaluating the Performance of Different Large Language Models on Health Consultation and Patient Education in Urolithiasis. J Med Syst 2023; 47:125. [PMID: 37999899 DOI: 10.1007/s10916-023-02021-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Accepted: 11/14/2023] [Indexed: 11/25/2023]
Abstract
OBJECTIVES To evaluate the effectiveness of four large language models (LLMs) (Claude, Bard, ChatGPT4, and New Bing) that have large user bases and significant social attention, in the context of medical consultation and patient education in urolithiasis. MATERIALS AND METHODS In this study, we developed a questionnaire consisting of 21 questions and 2 clinical scenarios related to urolithiasis. Subsequently, clinical consultations were simulated for each of the four models to assess their responses to the questions. Urolithiasis experts then evaluated the model responses in terms of accuracy, comprehensiveness, ease of understanding, human care, and clinical case analysis ability based on a predesigned 5-point Likert scale. Visualization and statistical analyses were then employed to compare the four models and evaluate their performance. RESULTS All models yielded satisfying performance, except for Bard, who failed to provide a valid response to Question 13. Claude consistently scored the highest in all dimensions compared with the other three models. ChatGPT4 ranked second in accuracy, with a relatively stable output across multiple tests, but shortcomings were observed in empathy and human caring. Bard exhibited the lowest accuracy and overall performance. Claude and ChatGPT4 both had a high capacity to analyze clinical cases of urolithiasis. Overall, Claude emerged as the best performer in urolithiasis consultations and education. CONCLUSION Claude demonstrated superior performance compared with the other three in urolithiasis consultation and education. This study highlights the remarkable potential of LLMs in medical health consultations and patient education, although professional review, further evaluation, and modifications are still required.
Collapse
Affiliation(s)
- Haifeng Song
- Department of Urology, Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua University, 168 Litang Rd, Beijing, 102218, China
- Institute of Urology, School of Clinical Medicine, Tsinghua University, Beijing, 102218, China
| | - Yi Xia
- Department of Urology, Zhongda Hospital, Southeast University, 87 Dingjiaqiao, Nanjing, 210009, China
- School of Medicine, Southeast University, Nanjing, 210009, China
| | - Zhichao Luo
- Department of Urology, Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua University, 168 Litang Rd, Beijing, 102218, China
- Institute of Urology, School of Clinical Medicine, Tsinghua University, Beijing, 102218, China
| | - Hui Liu
- Department of Urology, Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua University, 168 Litang Rd, Beijing, 102218, China
- Institute of Urology, School of Clinical Medicine, Tsinghua University, Beijing, 102218, China
| | - Yan Song
- Department of Urology, Sheng Jing Hospital of China Medical University, Shenyang, 110000, China
| | - Xue Zeng
- Department of Urology, Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua University, 168 Litang Rd, Beijing, 102218, China
- Institute of Urology, School of Clinical Medicine, Tsinghua University, Beijing, 102218, China
| | - Tianjie Li
- Department of Urology, Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua University, 168 Litang Rd, Beijing, 102218, China
- Institute of Urology, School of Clinical Medicine, Tsinghua University, Beijing, 102218, China
| | - Guangxin Zhong
- Department of Urology, Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua University, 168 Litang Rd, Beijing, 102218, China
- Institute of Urology, School of Clinical Medicine, Tsinghua University, Beijing, 102218, China
| | - Jianxing Li
- Department of Urology, Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua University, 168 Litang Rd, Beijing, 102218, China
- Institute of Urology, School of Clinical Medicine, Tsinghua University, Beijing, 102218, China
| | - Ming Chen
- Department of Urology, Zhongda Hospital, Southeast University, 87 Dingjiaqiao, Nanjing, 210009, China
| | - Guangyuan Zhang
- Department of Urology, Zhongda Hospital, Southeast University, 87 Dingjiaqiao, Nanjing, 210009, China.
| | - Bo Xiao
- Department of Urology, Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua University, 168 Litang Rd, Beijing, 102218, China.
- Institute of Urology, School of Clinical Medicine, Tsinghua University, Beijing, 102218, China.
| |
Collapse
|
45
|
Rosoł M, Gąsior JS, Łaba J, Korzeniewski K, Młyńczak M. Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish Medical Final Examination. Sci Rep 2023; 13:20512. [PMID: 37993519 PMCID: PMC10665355 DOI: 10.1038/s41598-023-46995-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Accepted: 11/07/2023] [Indexed: 11/24/2023] Open
Abstract
The study aimed to evaluate the performance of two Large Language Models (LLMs): ChatGPT (based on GPT-3.5) and GPT-4 with two temperature parameter values, on the Polish Medical Final Examination (MFE). The models were tested on three editions of the MFE from: Spring 2022, Autumn 2022, and Spring 2023 in two language versions-English and Polish. The accuracies of both models were compared and the relationships between the correctness of answers with the answer's metrics were investigated. The study demonstrated that GPT-4 outperformed GPT-3.5 in all three examinations regardless of the language used. GPT-4 achieved mean accuracies of 79.7% for both Polish and English versions, passing all MFE versions. GPT-3.5 had mean accuracies of 54.8% for Polish and 60.3% for English, passing none and 2 of 3 Polish versions for temperature parameter equal to 0 and 1 respectively while passing all English versions regardless of the temperature parameter value. GPT-4 score was mostly lower than the average score of a medical student. There was a statistically significant correlation between the correctness of the answers and the index of difficulty for both models. The overall accuracy of both models was still suboptimal and worse than the average for medical students. This emphasizes the need for further improvements in LLMs before they can be reliably deployed in medical settings. These findings suggest an increasing potential for the usage of LLMs in terms of medical education.
Collapse
Affiliation(s)
- Maciej Rosoł
- Faculty of Mechatronics, Institute of Metrology and Biomedical Engineering, Warsaw University of Technology, Boboli 8 Street, 02-525, Warsaw, Poland.
| | - Jakub S Gąsior
- Department of Pediatric Cardiology and General Pediatrics, Medical University of Warsaw, Warsaw, Poland
| | - Jonasz Łaba
- Faculty of Mechatronics, Institute of Metrology and Biomedical Engineering, Warsaw University of Technology, Boboli 8 Street, 02-525, Warsaw, Poland
| | - Kacper Korzeniewski
- Faculty of Mechatronics, Institute of Metrology and Biomedical Engineering, Warsaw University of Technology, Boboli 8 Street, 02-525, Warsaw, Poland
| | - Marcel Młyńczak
- Faculty of Mechatronics, Institute of Metrology and Biomedical Engineering, Warsaw University of Technology, Boboli 8 Street, 02-525, Warsaw, Poland
| |
Collapse
|
46
|
Gödde D, Nöhl S, Wolf C, Rupert Y, Rimkus L, Ehlers J, Breuckmann F, Sellmann T. A SWOT (Strengths, Weaknesses, Opportunities, and Threats) Analysis of ChatGPT in the Medical Literature: Concise Review. J Med Internet Res 2023; 25:e49368. [PMID: 37865883 PMCID: PMC10690535 DOI: 10.2196/49368] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Revised: 09/26/2023] [Accepted: 09/27/2023] [Indexed: 10/23/2023] Open
Abstract
BACKGROUND ChatGPT is a 175-billion-parameter natural language processing model that is already involved in scientific content and publications. Its influence ranges from providing quick access to information on medical topics, assisting in generating medical and scientific articles and papers, performing medical data analyses, and even interpreting complex data sets. OBJECTIVE The future role of ChatGPT remains uncertain and a matter of debate already shortly after its release. This review aimed to analyze the role of ChatGPT in the medical literature during the first 3 months after its release. METHODS We performed a concise review of literature published in PubMed from December 1, 2022, to March 31, 2023. To find all publications related to ChatGPT or considering ChatGPT, the search term was kept simple ("ChatGPT" in AllFields). All publications available as full text in German or English were included. All accessible publications were evaluated according to specifications by the author team (eg, impact factor, publication modus, article type, publication speed, and type of ChatGPT integration or content). The conclusions of the articles were used for later SWOT (strengths, weaknesses, opportunities, and threats) analysis. All data were analyzed on a descriptive basis. RESULTS Of 178 studies in total, 160 met the inclusion criteria and were evaluated. The average impact factor was 4.423 (range 0-96.216), and the average publication speed was 16 (range 0-83) days. Among the articles, there were 77 editorials (48,1%), 43 essays (26.9%), 21 studies (13.1%), 6 reviews (3.8%), 6 case reports (3.8%), 6 news (3.8%), and 1 meta-analysis (0.6%). Of those, 54.4% (n=87) were published as open access, with 5% (n=8) provided on preprint servers. Over 400 quotes with information on strengths, weaknesses, opportunities, and threats were detected. By far, most (n=142, 34.8%) were related to weaknesses. ChatGPT excels in its ability to express ideas clearly and formulate general contexts comprehensibly. It performs so well that even experts in the field have difficulty identifying abstracts generated by ChatGPT. However, the time-limited scope and the need for corrections by experts were mentioned as weaknesses and threats of ChatGPT. Opportunities include assistance in formulating medical issues for nonnative English speakers, as well as the possibility of timely participation in the development of such artificial intelligence tools since it is in its early stages and can therefore still be influenced. CONCLUSIONS Artificial intelligence tools such as ChatGPT are already part of the medical publishing landscape. Despite their apparent opportunities, policies and guidelines must be implemented to ensure benefits in education, clinical practice, and research and protect against threats such as scientific misconduct, plagiarism, and inaccuracy.
Collapse
Affiliation(s)
- Daniel Gödde
- Department of Pathology and Molecularpathology, Helios University Hospital Wuppertal, Witten/Herdecke University, Witten, Germany
| | - Sophia Nöhl
- Faculty of Health, Witten/Herdecke University, Witten, Germany
| | - Carina Wolf
- Faculty of Health, Witten/Herdecke University, Witten, Germany
| | - Yannick Rupert
- Faculty of Health, Witten/Herdecke University, Witten, Germany
| | - Lukas Rimkus
- Faculty of Health, Witten/Herdecke University, Witten, Germany
| | - Jan Ehlers
- Department of Didactics and Education Research in the Health Sector, Faculty of Health, Witten/Herdecke University, Witten, Germany
| | - Frank Breuckmann
- Department of Cardiology and Vascular Medicine, West German Heart and Vascular Center Essen, University Duisburg-Essen, Essen, Germany
- Department of Cardiology, Pneumology, Neurology and Intensive Care Medicine, Klinik Kitzinger Land, Kitzingen, Germany
| | - Timur Sellmann
- Department of Anaesthesiology I, Witten/Herdecke University, Witten, Germany
- Department of Anaesthesiology and Intensive Care Medicine, Evangelisches Krankenhaus BETHESDA zu Duisburg, Duisburg, Germany
| |
Collapse
|
47
|
Chen TC, Multala E, Kearns P, Delashaw J, Dumont A, Maraganore D, Wang A. Assessment of ChatGPT's performance on neurology written board examination questions. BMJ Neurol Open 2023; 5:e000530. [PMID: 37936648 PMCID: PMC10626870 DOI: 10.1136/bmjno-2023-000530] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Accepted: 10/19/2023] [Indexed: 11/09/2023] Open
Abstract
Background and objectives ChatGPT has shown promise in healthcare. To assess the utility of this novel tool in healthcare education, we evaluated ChatGPT's performance in answering neurology board exam questions. Methods Neurology board-style examination questions were accessed from BoardVitals, a commercial neurology question bank. ChatGPT was provided a full question prompt and multiple answer choices. First attempts and additional attempts up to three tries were given to ChatGPT to select the correct answer. A total of 560 questions (14 blocks of 40 questions) were used, although any image-based questions were disregarded due to ChatGPT's inability to process visual input. The artificial intelligence (AI) answers were then compared with human user data provided by the question bank to gauge its performance. Results Out of 509 eligible questions over 14 question blocks, ChatGPT correctly answered 335 questions (65.8%) on the first attempt/iteration and 383 (75.3%) over three attempts/iterations, scoring at approximately the 26th and 50th percentiles, respectively. The highest performing subjects were pain (100%), epilepsy & seizures (85%) and genetic (82%) while the lowest performing subjects were imaging/diagnostic studies (27%), critical care (41%) and cranial nerves (48%). Discussion This study found that ChatGPT performed similarly to its human counterparts. The accuracy of the AI increased with multiple attempts and performance fell within the expected range of neurology resident learners. This study demonstrates ChatGPT's potential in processing specialised medical information. Future studies would better define the scope to which AI would be able to integrate into medical decision making.
Collapse
Affiliation(s)
- Tse Chian Chen
- Neurology, Tulane University School of Medicine, New Orleans, Louisiana, USA
| | - Evan Multala
- Tulane University School of Medicine, New Orleans, Louisiana, USA
| | - Patrick Kearns
- Tulane University School of Medicine, New Orleans, Louisiana, USA
| | - Johnny Delashaw
- Neurosurgery, Tulane University School of Medicine, New Orleans, Louisiana, USA
| | - Aaron Dumont
- Neurosurgery, Tulane University School of Medicine, New Orleans, Louisiana, USA
| | | | - Arthur Wang
- Neurosurgery, Tulane University School of Medicine, New Orleans, Louisiana, USA
| |
Collapse
|
48
|
Abu-Farha R, Fino L, Al-Ashwal FY, Zawiah M, Gharaibeh L, Harahsheh MM, Darwish Elhajji F. Evaluation of community pharmacists' perceptions and willingness to integrate ChatGPT into their pharmacy practice: A study from Jordan. J Am Pharm Assoc (2003) 2023; 63:1761-1767.e2. [PMID: 37648157 DOI: 10.1016/j.japh.2023.08.020] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2023] [Revised: 08/10/2023] [Accepted: 08/22/2023] [Indexed: 09/01/2023]
Abstract
OBJECTIVES This study aimed to examine the extent of community pharmacists' awareness of Chat Generative Pretraining Transformer (ChatGPT), their willingness to embark on this new development of artificial intelligence (AI) development, and barriers that face the incorporation of this nonconventional source of information into pharmacy practice. METHODS A cross-sectional study was conducted among community pharmacists in Jordanian cities between April 26, 2023, and May 10, 2023. Convenience and snowball sampling techniques were used to select study participants owing to resource and time constraints. The questionnaire was distributed by research assistants through popular social media platforms. Logistic regression analysis was used to assess predictors affecting their willingness to use this service in the future. RESULTS A total of 221 community pharmacists participated in the current study (response rate was not calculated because opt-in recruitment strategies were used). Remarkably, nearly half of the pharmacists (n = 107, 48.4%) indicated a willingness to incorporate the ChatGPT into their pharmacy practice. Nearly half of the pharmacists (n = 105, 47.5%) demonstrated a high perceived benefit score for ChatGPT, whereas approximately 37% of pharmacists (n = 81) expressed a high concern score about ChatGPT. More than 70% of pharmacists believed that ChatGPT lacked the ability to use human judgment and make complicated ethical judgments in its responses (n = 168). Finally, logistics regression analysis showed that pharmacists who had previous experience in using ChatGPT were more willing to integrate ChatGPT in their pharmacy practice than those with no previous experience in using ChatGPT (odds ratio 2.312, P = 0.035). CONCLUSION Although pharmacists show a willingness to incorporate ChatGPT into their practice, especially those with previous experience, there are major concerns. These mainly revolve around the tool's ability to make human-like judgments and ethical decisions. These findings are crucial for the future development and integration of AI tools in pharmacy practice.
Collapse
|
49
|
Daher M, Koa J, Boufadel P, Singh J, Fares MY, Abboud JA. Breaking barriers: can ChatGPT compete with a shoulder and elbow specialist in diagnosis and management? JSES Int 2023; 7:2534-2541. [PMID: 37969495 PMCID: PMC10638599 DOI: 10.1016/j.jseint.2023.07.018] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2023] Open
Abstract
Background ChatGPT is an artificial intelligence (AI) language processing model that uses deep learning to generate human-like responses to natural language inputs. Its potential use in health care has raised questions and several studies have assessed its effectiveness in writing articles, clinical reasoning, and solving complex questions. This study aims to investigate ChatGPT's capabilities and implications in diagnosing and managing patients with new shoulder and elbow complaints in a private clinical setting to provide insights into its potential use as a diagnostic tool for patients and a first consultation resource for primary physicians. Methods In a private clinical setting, patients were assessed by ChatGPT after being seen by a shoulder and elbow specialist for shoulder and elbow symptoms. To be assessed by the AI model, a research fellow filled out a standardized form (including age, gender, major comorbidities, symptoms and the localization, natural history, and duration, any associated symptoms or movement deficit, aggravating/relieving factors, and x-ray/imaging report if present). This form was submitted through the ChatGPT portal and the AI model was asked for a diagnosis and best management modality. Results A total of 29 patients with 15 males and 14 females, were included in this study. The AI model was able to correctly choose the diagnosis and management in 93% (27/29) and 83% (24/29) of the patients, respectively. Furthermore, of the remaining 24 patients that were managed correctly, ChatGPT did not specify the appropriate management in 6 patients and chose only one management in 5 patients, where both were applicable and dependent on the patient's choice. Therefore, 55% of ChatGPT's management was poor. Conclusion ChatGPT made a worthy opponent; however, it will not be able to replace in its current form a shoulder and elbow specialist in diagnosing and treating patients for many reasons such as misdiagnosis, poor management, lack of empathy and interactions with patients, its dependence on magnetic resonance imaging reports, and its lack of new knowledge.
Collapse
Affiliation(s)
| | - Jonathan Koa
- Rothman Institute/Thomas Jefferson Medical Center, Philadelphia, PA, USA
| | - Peter Boufadel
- Rothman Institute/Thomas Jefferson Medical Center, Philadelphia, PA, USA
| | - Jaspal Singh
- Rothman Institute/Thomas Jefferson Medical Center, Philadelphia, PA, USA
| | - Mohamad Y. Fares
- Rothman Institute/Thomas Jefferson Medical Center, Philadelphia, PA, USA
| | - Joseph A. Abboud
- Rothman Institute/Thomas Jefferson Medical Center, Philadelphia, PA, USA
| |
Collapse
|
50
|
Yu P, Xu H, Hu X, Deng C. Leveraging Generative AI and Large Language Models: A Comprehensive Roadmap for Healthcare Integration. Healthcare (Basel) 2023; 11:2776. [PMID: 37893850 PMCID: PMC10606429 DOI: 10.3390/healthcare11202776] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Revised: 10/13/2023] [Accepted: 10/17/2023] [Indexed: 10/29/2023] Open
Abstract
Generative artificial intelligence (AI) and large language models (LLMs), exemplified by ChatGPT, are promising for revolutionizing data and information management in healthcare and medicine. However, there is scant literature guiding their integration for non-AI professionals. This study conducts a scoping literature review to address the critical need for guidance on integrating generative AI and LLMs into healthcare and medical practices. It elucidates the distinct mechanisms underpinning these technologies, such as Reinforcement Learning from Human Feedback (RLFH), including few-shot learning and chain-of-thought reasoning, which differentiates them from traditional, rule-based AI systems. It requires an inclusive, collaborative co-design process that engages all pertinent stakeholders, including clinicians and consumers, to achieve these benefits. Although global research is examining both opportunities and challenges, including ethical and legal dimensions, LLMs offer promising advancements in healthcare by enhancing data management, information retrieval, and decision-making processes. Continued innovation in data acquisition, model fine-tuning, prompt strategy development, evaluation, and system implementation is imperative for realizing the full potential of these technologies. Organizations should proactively engage with these technologies to improve healthcare quality, safety, and efficiency, adhering to ethical and legal guidelines for responsible application.
Collapse
Affiliation(s)
- Ping Yu
- School of Computing and Information Technology, University of Wollongong, Wollongong, NSW 2522, Australia
| | - Hua Xu
- Section of Biomedical Informatics and Data Science, Yale School of Medicine, 100 College Street, Fl 9, New Haven, CT 06510, USA
| | - Xia Hu
- Department of Computer Science, Rice University, P.O. Box 1892, Houston, TX 77251-1892, USA
| | - Chao Deng
- School of Medical, Indigenous and Health Sciences, University of Wollongong, Wollongong, NSW 2522, Australia
| |
Collapse
|