1
|
Busch F, Hoffmann L, Rueger C, van Dijk EH, Kader R, Ortiz-Prado E, Makowski MR, Saba L, Hadamitzky M, Kather JN, Truhn D, Cuocolo R, Adams LC, Bressem KK. Current applications and challenges in large language models for patient care: a systematic review. COMMUNICATIONS MEDICINE 2025; 5:26. [PMID: 39838160 PMCID: PMC11751060 DOI: 10.1038/s43856-024-00717-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Accepted: 12/17/2024] [Indexed: 01/23/2025] Open
Abstract
BACKGROUND The introduction of large language models (LLMs) into clinical practice promises to improve patient education and empowerment, thereby personalizing medical care and broadening access to medical knowledge. Despite the popularity of LLMs, there is a significant gap in systematized information on their use in patient care. Therefore, this systematic review aims to synthesize current applications and limitations of LLMs in patient care. METHODS We systematically searched 5 databases for qualitative, quantitative, and mixed methods articles on LLMs in patient care published between 2022 and 2023. From 4349 initial records, 89 studies across 29 medical specialties were included. Quality assessment was performed using the Mixed Methods Appraisal Tool 2018. A data-driven convergent synthesis approach was applied for thematic syntheses of LLM applications and limitations using free line-by-line coding in Dedoose. RESULTS We show that most studies investigate Generative Pre-trained Transformers (GPT)-3.5 (53.2%, n = 66 of 124 different LLMs examined) and GPT-4 (26.6%, n = 33/124) in answering medical questions, followed by patient information generation, including medical text summarization or translation, and clinical documentation. Our analysis delineates two primary domains of LLM limitations: design and output. Design limitations include 6 second-order and 12 third-order codes, such as lack of medical domain optimization, data transparency, and accessibility issues, while output limitations include 9 second-order and 32 third-order codes, for example, non-reproducibility, non-comprehensiveness, incorrectness, unsafety, and bias. CONCLUSIONS This review systematically maps LLM applications and limitations in patient care, providing a foundational framework and taxonomy for their implementation and evaluation in healthcare settings.
Collapse
Affiliation(s)
- Felix Busch
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany.
| | - Lena Hoffmann
- Department of Neuroradiology, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt Universität zu Berlin, Berlin, Germany
| | - Christopher Rueger
- Department of Neuroradiology, Charité - Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt Universität zu Berlin, Berlin, Germany
| | - Elon Hc van Dijk
- Department of Ophthalmology, Leiden University Medical Center, Leiden, The Netherlands
- Department of Ophthalmology, Sir Charles Gairdner Hospital, Perth, Australia
| | - Rawen Kader
- Division of Surgery and Interventional Sciences, University College London, London, United Kingdom
| | - Esteban Ortiz-Prado
- One Health Research Group, Faculty of Health Science, Universidad de Las Américas, Quito, Ecuador
| | - Marcus R Makowski
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Luca Saba
- Department of Radiology, Azienda Ospedaliero Universitaria (A.O.U.), Cagliari, Italy
| | - Martin Hadamitzky
- School of Medicine and Health, Institute for Cardiovascular Radiology and Nuclear Medicine, German Heart Center Munich, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Jakob Nikolas Kather
- Department of Medical Oncology, National Center for Tumor Diseases (NCT), Heidelberg University Hospital, Heidelberg, Germany
- Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, Technical University Dresden, Dresden, Germany
| | - Daniel Truhn
- Department of Diagnostic and Interventional Radiology, University Hospital Aachen, Aachen, Germany
| | - Renato Cuocolo
- Department of Medicine, Surgery and Dentistry, University of Salerno, Baronissi, Italy
| | - Lisa C Adams
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
| | - Keno K Bressem
- School of Medicine and Health, Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, TUM University Hospital, Technical University of Munich, Munich, Germany
- School of Medicine and Health, Institute for Cardiovascular Radiology and Nuclear Medicine, German Heart Center Munich, TUM University Hospital, Technical University of Munich, Munich, Germany
| |
Collapse
|
2
|
Fisher AD, Fisher G. Danger, Danger, Gaston Labat! Does zero-shot artificial intelligence correlate with anticoagulation guidelines recommendations for neuraxial anesthesia? Reg Anesth Pain Med 2025; 50:73-74. [PMID: 38418408 DOI: 10.1136/rapm-2024-105405] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2024] [Accepted: 02/19/2024] [Indexed: 03/01/2024]
Affiliation(s)
- Andrew D Fisher
- Department of Anesthesia & Perioperative Medicine, Medical University of South Carolina, Charleston, South Carolina, USA
| | - Gabrielle Fisher
- Department of Anesthesia & Perioperative Medicine, Medical University of South Carolina, Charleston, South Carolina, USA
| |
Collapse
|
3
|
Zhou S, Luo X, Chen C, Jiang H, Yang C, Ran G, Yu J, Yin C. The performance of large language model-powered chatbots compared to oncology physicians on colorectal cancer queries. Int J Surg 2024; 110:6509-6517. [PMID: 38935100 PMCID: PMC11487020 DOI: 10.1097/js9.0000000000001850] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Accepted: 06/06/2024] [Indexed: 06/28/2024]
Abstract
BACKGROUND Large language model (LLM)-powered chatbots have become increasingly prevalent in healthcare, while their capacity in oncology remains largely unknown. To evaluate the performance of LLM-powered chatbots compared to oncology physicians in addressing colorectal cancer queries. METHODS This study was conducted between August 13, 2023, and January 5, 2024. A total of 150 questions were designed, and each question was submitted three times to eight chatbots: ChatGPT-3.5, ChatGPT-4, ChatGPT-4 Turbo, Doctor GPT, Llama-2-70B, Mixtral-8x7B, Bard, and Claude 2.1. No feedback was provided to these chatbots. The questions were also answered by nine oncology physicians, including three residents, three fellows, and three attendings. Each answer was scored based on its consistency with guidelines, with a score of 1 for consistent answers and 0 for inconsistent answers. The total score for each question was based on the number of corrected answers, ranging from 0 to 3. The accuracy and scores of the chatbots were compared to those of the physicians. RESULTS Claude 2.1 demonstrated the highest accuracy, with an average accuracy of 82.67%, followed by Doctor GPT at 80.45%, ChatGPT-4 Turbo at 78.44%, ChatGPT-4 at 78%, Mixtral-8x7B at 73.33%, Bard at 70%, ChatGPT-3.5 at 64.89%, and Llama-2-70B at 61.78%. Claude 2.1 outperformed residents, fellows, and attendings. Doctor GPT outperformed residents and fellows. Additionally, Mixtral-8x7B outperformed residents. In terms of scores, Claude 2.1 outperformed residents and fellows. Doctor GPT, ChatGPT-4 Turbo, and ChatGPT-4 outperformed residents. CONCLUSIONS This study shows that LLM-powered chatbots can provide more accurate medical information compared to oncology physicians.
Collapse
Affiliation(s)
- Shan Zhou
- Florida Research and Innovation Center, Cleveland Clinic, Port St. Lucie, FL, USA
| | - Xiao Luo
- Department of Radiology, The First Affiliated Hospital of Shenzhen University, Health Science Center, Shenzhen Second People’s Hospital, Shenzhen, China
| | - Chan Chen
- Department of Clinical Laboratory, Shenzhen Baoan Hospital, The Second Affiliated Hospital of Shenzhen University, Shenzhen
| | - Hong Jiang
- Statistical Office, Zhuhai People’s Hospital, Zhuhai Clinical Medical College of Jinan University, Zhuhai
- Faculty of Medicine, Macau University of Science and Technology, Macau, China
| | - Chun Yang
- Department of Radiology, The First Affiliated Hospital of Shenzhen University, Health Science Center, Shenzhen Second People’s Hospital, Shenzhen, China
| | - Guanghui Ran
- Department of Radiology, The First Affiliated Hospital of Shenzhen University, Health Science Center, Shenzhen Second People’s Hospital, Shenzhen, China
| | - Juan Yu
- Department of Radiology, The First Affiliated Hospital of Shenzhen University, Health Science Center, Shenzhen Second People’s Hospital, Shenzhen, China
| | - Chengliang Yin
- Faculty of Medicine, Macau University of Science and Technology, Macau, China
| |
Collapse
|
4
|
Kuo FH, Fierstein JL, Tudor BH, Gray GM, Ahumada LM, Watkins SC, Rehman MA. Comparing ChatGPT and a Single Anesthesiologist's Responses to Common Patient Questions: An Exploratory Cross-Sectional Survey of a Panel of Anesthesiologists. J Med Syst 2024; 48:77. [PMID: 39172169 DOI: 10.1007/s10916-024-02100-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Accepted: 08/16/2024] [Indexed: 08/23/2024]
Abstract
Increased patient access to electronic medical records and resources has resulted in higher volumes of health-related questions posed to clinical staff, while physicians' rising clinical workloads have resulted in less time for comprehensive, thoughtful responses to patient questions. Artificial intelligence chatbots powered by large language models (LLMs) such as ChatGPT could help anesthesiologists efficiently respond to electronic patient inquiries, but their ability to do so is unclear. A cross-sectional exploratory survey-based study comprised of 100 anesthesia-related patient question/response sets based on two fictitious simple clinical scenarios was performed. Each question was answered by an independent board-certified anesthesiologist and ChatGPT (GPT-3.5 model, August 3, 2023 version). The responses were randomized and evaluated via survey by three blinded board-certified anesthesiologists for various quality and empathy measures. On a 5-point Likert scale, ChatGPT received similar overall quality ratings (4.2 vs. 4.1, p = .81) and significantly higher overall empathy ratings (3.7 vs. 3.4, p < .01) compared to the anesthesiologist. ChatGPT underperformed the anesthesiologist regarding rate of responses in agreement with scientific consensus (96.6% vs. 99.3%, p = .02) and possibility of harm (4.7% vs. 1.7%, p = .04), but performed similarly in other measures (percentage of responses with inappropriate/incorrect information (5.7% vs. 2.7%, p = .07) and missing information (10.0% vs. 7.0%, p = .19)). In conclusion, LLMs show great potential in healthcare, but additional improvement is needed to decrease the risk of patient harm and reduce the need for close physician oversight. Further research with more complex clinical scenarios, clinicians, and live patients is necessary to validate their role in healthcare.
Collapse
Affiliation(s)
- Frederick H Kuo
- Department of Anesthesia and Pain Medicine, Johns Hopkins All Children's Hospital, 601 5th St South, Suite C725, St Petersburg, FL, 33701, USA.
| | - Jamie L Fierstein
- Epidemiology and Biostatistics Shared Resource, Institute for Clinical and Translational Research, Johns Hopkins All Children's Hospital, St Petersburg, FL, USA
| | - Brant H Tudor
- Center for Pediatric Data Science and Analytics Methodology, Johns Hopkins All Children's Hospital, St Petersburg, FL, USA
| | - Geoffrey M Gray
- Center for Pediatric Data Science and Analytics Methodology, Johns Hopkins All Children's Hospital, St Petersburg, FL, USA
| | - Luis M Ahumada
- Center for Pediatric Data Science and Analytics Methodology, Johns Hopkins All Children's Hospital, St Petersburg, FL, USA
| | - Scott C Watkins
- Department of Anesthesia and Pain Medicine, Johns Hopkins All Children's Hospital, 601 5th St South, Suite C725, St Petersburg, FL, 33701, USA
| | - Mohamed A Rehman
- Department of Anesthesia and Pain Medicine, Johns Hopkins All Children's Hospital, 601 5th St South, Suite C725, St Petersburg, FL, 33701, USA
| |
Collapse
|
5
|
Pardo E, Le Cam E, Verdonk F. Artificial intelligence and nonoperating room anesthesia. Curr Opin Anaesthesiol 2024; 37:413-420. [PMID: 38934202 DOI: 10.1097/aco.0000000000001388] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/28/2024]
Abstract
PURPOSE OF REVIEW The integration of artificial intelligence (AI) in nonoperating room anesthesia (NORA) represents a timely and significant advancement. As the demand for NORA services expands, the application of AI is poised to improve patient selection, perioperative care, and anesthesia delivery. This review examines AI's growing impact on NORA and how it can optimize our clinical practice in the near future. RECENT FINDINGS AI has already improved various aspects of anesthesia, including preoperative assessment, intraoperative management, and postoperative care. Studies highlight AI's role in patient risk stratification, real-time decision support, and predictive modeling for patient outcomes. Notably, AI applications can be used to target patients at risk of complications, alert clinicians to the upcoming occurrence of an intraoperative adverse event such as hypotension or hypoxemia, or predict their tolerance of anesthesia after the procedure. Despite these advances, challenges persist, including ethical considerations, algorithmic bias, data security, and the need for transparent decision-making processes within AI systems. SUMMARY The findings underscore the substantial benefits of AI in NORA, which include improved safety, efficiency, and personalized care. AI's predictive capabilities in assessing hypoxemia risk and other perioperative events, have demonstrated potential to exceed human prognostic accuracy. The implications of these findings advocate for a careful yet progressive adoption of AI in clinical practice, encouraging the development of robust ethical guidelines, continual professional training, and comprehensive data management strategies. Furthermore, AI's role in anesthesia underscores the need for multidisciplinary research to address the limitations and fully leverage AI's capabilities for patient-centered anesthesia care.
Collapse
Affiliation(s)
- Emmanuel Pardo
- Sorbonne University, GRC 29, AP-HP, DMU DREAM, Department of Anesthesiology and Critical Care, Saint-Antoine Hospital, Paris, France
| | | | | |
Collapse
|
6
|
Kim HJ, Yang JH, Chang DG, Lenke LG, Pizones J, Castelein R, Watanabe K, Trobisch PD, Mundis GM, Suh SW, Suk SI. Assessing the Reproducibility of the Structured Abstracts Generated by ChatGPT and Bard Compared to Human-Written Abstracts in the Field of Spine Surgery: Comparative Analysis. J Med Internet Res 2024; 26:e52001. [PMID: 38924787 PMCID: PMC11237793 DOI: 10.2196/52001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Revised: 01/15/2024] [Accepted: 04/26/2024] [Indexed: 06/28/2024] Open
Abstract
BACKGROUND Due to recent advances in artificial intelligence (AI), language model applications can generate logical text output that is difficult to distinguish from human writing. ChatGPT (OpenAI) and Bard (subsequently rebranded as "Gemini"; Google AI) were developed using distinct approaches, but little has been studied about the difference in their capability to generate the abstract. The use of AI to write scientific abstracts in the field of spine surgery is the center of much debate and controversy. OBJECTIVE The objective of this study is to assess the reproducibility of the structured abstracts generated by ChatGPT and Bard compared to human-written abstracts in the field of spine surgery. METHODS In total, 60 abstracts dealing with spine sections were randomly selected from 7 reputable journals and used as ChatGPT and Bard input statements to generate abstracts based on supplied paper titles. A total of 174 abstracts, divided into human-written abstracts, ChatGPT-generated abstracts, and Bard-generated abstracts, were evaluated for compliance with the structured format of journal guidelines and consistency of content. The likelihood of plagiarism and AI output was assessed using the iThenticate and ZeroGPT programs, respectively. A total of 8 reviewers in the spinal field evaluated 30 randomly extracted abstracts to determine whether they were produced by AI or human authors. RESULTS The proportion of abstracts that met journal formatting guidelines was greater among ChatGPT abstracts (34/60, 56.6%) compared with those generated by Bard (6/54, 11.1%; P<.001). However, a higher proportion of Bard abstracts (49/54, 90.7%) had word counts that met journal guidelines compared with ChatGPT abstracts (30/60, 50%; P<.001). The similarity index was significantly lower among ChatGPT-generated abstracts (20.7%) compared with Bard-generated abstracts (32.1%; P<.001). The AI-detection program predicted that 21.7% (13/60) of the human group, 63.3% (38/60) of the ChatGPT group, and 87% (47/54) of the Bard group were possibly generated by AI, with an area under the curve value of 0.863 (P<.001). The mean detection rate by human reviewers was 53.8% (SD 11.2%), achieving a sensitivity of 56.3% and a specificity of 48.4%. A total of 56.3% (63/112) of the actual human-written abstracts and 55.9% (62/128) of AI-generated abstracts were recognized as human-written and AI-generated by human reviewers, respectively. CONCLUSIONS Both ChatGPT and Bard can be used to help write abstracts, but most AI-generated abstracts are currently considered unethical due to high plagiarism and AI-detection rates. ChatGPT-generated abstracts appear to be superior to Bard-generated abstracts in meeting journal formatting guidelines. Because humans are unable to accurately distinguish abstracts written by humans from those produced by AI programs, it is crucial to exercise special caution and examine the ethical boundaries of using AI programs, including ChatGPT and Bard.
Collapse
Affiliation(s)
- Hong Jin Kim
- Department of Orthopedic Surgery, Inje University Sanggye Paik Hospital, College of Medicine, Inje University, Seoul, Republic of Korea
| | - Jae Hyuk Yang
- Department of Orthopedic Surgery, Korea University Anam Hospital, College of Medicine, Korea University, Seoul, Republic of Korea
| | - Dong-Gune Chang
- Department of Orthopedic Surgery, Inje University Sanggye Paik Hospital, College of Medicine, Inje University, Seoul, Republic of Korea
| | - Lawrence G Lenke
- Department of Orthopedic Surgery, The Daniel and Jane Och Spine Hospital, Columbia University, New York, NY, United States
| | - Javier Pizones
- Department of Orthopedic Surgery, Hospital Universitario La Paz, Madrid, Spain
| | - René Castelein
- Department of Orthopedic Surgery, University Medical Centre Utrecht, Utrecht, Netherlands
| | - Kota Watanabe
- Department of Orthopedic Surgery, Keio University School of Medicine, Tokyo, Japan
| | - Per D Trobisch
- Department of Spine Surgery, Eifelklinik St. Brigida, Simmerath, Germany
| | - Gregory M Mundis
- Department of Orthopaedic Surgery, Scripps Clinic, La Jolla, CA, United States
| | - Seung Woo Suh
- Department of Orthopedic Surgery, Korea University Guro Hospital, College of Medicine, Korea University, Seoul, Republic of Korea
| | - Se-Il Suk
- Department of Orthopedic Surgery, Inje University Sanggye Paik Hospital, College of Medicine, Inje University, Seoul, Republic of Korea
| |
Collapse
|
7
|
Balasanjeevi G, Surapaneni KM. Comparison of ChatGPT version 3.5 & 4 for utility in respiratory medicine education using clinical case scenarios. Respir Med Res 2024; 85:101091. [PMID: 38657295 DOI: 10.1016/j.resmer.2024.101091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Revised: 02/14/2024] [Accepted: 02/15/2024] [Indexed: 04/26/2024]
Abstract
Integration of ChatGPT in Respiratory medicine presents a promising avenue for enhancing clinical practice and pedagogical approaches. This study compares the performance of ChatGPT version 3.5 and 4 in respiratory medicine, emphasizing its potential in clinical decision support and medical education using clinical cases. Results indicate moderate performance highlighting limitations in handling complex case scenarios. Compared to ChatGPT 3.5, version 4 showed greater promise as a pedagogical tool, providing interactive learning experiences. While serving as a preliminary decision support tool clinically, caution is advised, stressing the need for ongoing validation. Future research should refine its clinical capabilities for optimal integration into medical education and practice.
Collapse
Affiliation(s)
- Gayathri Balasanjeevi
- Department of Tuberculosis & Respiratory Diseases, Panimalar Medical College Hospital & Research Institute, Varadharajapuram, Poonamallee, Chennai 600 123, Tamil Nadu, India
| | - Krishna Mohan Surapaneni
- Department of Biochemistry, Panimalar Medical College Hospital & Research Institute, Varadharajapuram, Poonamallee, Chennai, 600 123 Tamil Nadu, India; Department of Medical Education, Panimalar Medical College Hospital & Research Institute, Varadharajapuram, Poonamallee, Chennai, 600 123 Tamil Nadu, India.
| |
Collapse
|
8
|
Nguyen TP, Carvalho B, Sukhdeo H, Joudi K, Guo N, Chen M, Wolpaw JT, Kiefer JJ, Byrne M, Jamroz T, Mootz AA, Reale SC, Zou J, Sultan P. Comparison of artificial intelligence large language model chatbots in answering frequently asked questions in anaesthesia. BJA OPEN 2024; 10:100280. [PMID: 38764485 PMCID: PMC11099318 DOI: 10.1016/j.bjao.2024.100280] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/26/2023] [Accepted: 03/20/2024] [Indexed: 05/21/2024]
Abstract
Background Patients are increasingly using artificial intelligence (AI) chatbots to seek answers to medical queries. Methods Ten frequently asked questions in anaesthesia were posed to three AI chatbots: ChatGPT4 (OpenAI), Bard (Google), and Bing Chat (Microsoft). Each chatbot's answers were evaluated in a randomised, blinded order by five residency programme directors from 15 medical institutions in the USA. Three medical content quality categories (accuracy, comprehensiveness, safety) and three communication quality categories (understandability, empathy/respect, and ethics) were scored between 1 and 5 (1 representing worst, 5 representing best). Results ChatGPT4 and Bard outperformed Bing Chat (median [inter-quartile range] scores: 4 [3-4], 4 [3-4], and 3 [2-4], respectively; P<0.001 with all metrics combined). All AI chatbots performed poorly in accuracy (score of ≥4 by 58%, 48%, and 36% of experts for ChatGPT4, Bard, and Bing Chat, respectively), comprehensiveness (score ≥4 by 42%, 30%, and 12% of experts for ChatGPT4, Bard, and Bing Chat, respectively), and safety (score ≥4 by 50%, 40%, and 28% of experts for ChatGPT4, Bard, and Bing Chat, respectively). Notably, answers from ChatGPT4, Bard, and Bing Chat differed statistically in comprehensiveness (ChatGPT4, 3 [2-4] vs Bing Chat, 2 [2-3], P<0.001; and Bard 3 [2-4] vs Bing Chat, 2 [2-3], P=0.002). All large language model chatbots performed well with no statistical difference for understandability (P=0.24), empathy (P=0.032), and ethics (P=0.465). Conclusions In answering anaesthesia patient frequently asked questions, the chatbots perform well on communication metrics but are suboptimal for medical content metrics. Overall, ChatGPT4 and Bard were comparable to each other, both outperforming Bing Chat.
Collapse
Affiliation(s)
- Teresa P. Nguyen
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford School of Medicine, Stanford, CA, USA
| | - Brendan Carvalho
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford School of Medicine, Stanford, CA, USA
| | - Hannah Sukhdeo
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford School of Medicine, Stanford, CA, USA
| | - Kareem Joudi
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford School of Medicine, Stanford, CA, USA
| | - Nan Guo
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford School of Medicine, Stanford, CA, USA
| | - Marianne Chen
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford School of Medicine, Stanford, CA, USA
| | - Jed T. Wolpaw
- Department of Anesthesiology and Critical Care Medicine, Johns Hopkins University School of Medicine, Baltimore, MD, USA
| | - Jesse J. Kiefer
- Department of Anesthesiology and Critical Care Medicine, University of Pennsylvania School of Medicine, Philadelphia, PA, USA
| | - Melissa Byrne
- Department of Anesthesiology, Perioperative and Pain Medicine, University of Michigan Ann Arbor School of Medicine, Ann Arbor, MI, USA
| | - Tatiana Jamroz
- Department of Anesthesiology, Perioperative and Pain Medicine, Cleveland Clinic Foundation and Hospitals, Cleveland, OH, USA
| | - Allison A. Mootz
- Department of Anesthesiology, Perioperative and Pain Medicine, Brigham and Women's Hospital, Harvard School of Medicine, Boston, MA, USA
| | - Sharon C. Reale
- Department of Anesthesiology, Perioperative and Pain Medicine, Brigham and Women's Hospital, Harvard School of Medicine, Boston, MA, USA
| | - James Zou
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| | - Pervez Sultan
- Department of Anesthesiology, Perioperative and Pain Medicine, Stanford School of Medicine, Stanford, CA, USA
| |
Collapse
|