1
|
Gonzalez Fiol A, Mootz AA, He Z, Delgado C, Ortiz V, Reale SC. Accuracy of Spanish and English-generated ChatGPT responses to commonly asked patient questions about labor epidurals: a survey-based study among bilingual obstetric anesthesia experts. Int J Obstet Anesth 2025; 61:104290. [PMID: 39579604 DOI: 10.1016/j.ijoa.2024.104290] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/12/2024] [Revised: 10/24/2024] [Accepted: 10/25/2024] [Indexed: 11/25/2024]
Abstract
BACKGROUND Large language models (LLMs), of which ChatGPT is the most well known, are now available to patients to seek medical advice in various languages. However, the accuracy of the information utilized to train these models remains unknown. METHODS Ten commonly asked questions regarding labor epidurals were translated from English to Spanish, and all 20 questions were entered into ChatGPT version 3.5. The answers were transcribed. A survey was then sent to 10 bilingual fellowship-trained obstetric anesthesiologists to assess the accuracy of these answers utilizing a 5-point Likert scale. RESULTS Overall, the accuracy scores for the ChatGPT-generated answers in Spanish were lower than for the English answers with a median score of 34 (IQR 33-36.5) versus 40.5 (IQR 39-44.3), respectively (P value 0.02). Answers to two questions were scored significantly lower: "Do epidurals prolong labor?" (2 (IQR 2-2.5) versus 4 (IQR 4-4.5), P value 0.03) and "Do epidurals increase the risk of needing cesarean delivery?" (3(IQR 2-4) versus 4 (IQR 4-5); P value 0.03). There was a strong agreement that answers to the question "Do epidurals cause autism" were accurate in both Spanish and English. CONCLUSION ChatGPT-generated answers in Spanish to ten questions about labor epidurals scored lower for accuracythananswers generated in English, particularly regarding the effect of labor epidurals on labor course and mode of delivery. This disparity in ChatGPT-generated information may extend already-known health inequities among non-English-speaking patients and perpetuate misinformation.
Collapse
Affiliation(s)
- Antonio Gonzalez Fiol
- Department of Anesthesiology, Yale School of Medicine, New Haven, CT, United States.
| | - Allison A Mootz
- Department of Anesthesiology, University of Texas Southwestern Medical Center & Parkland Memorial Hospital, Dallas, TX, United States.
| | - Zili He
- Yale School of Public Health, United States.
| | - Carlos Delgado
- Department of Anesthesiology and Pain Medicine, University of Washington, Seattle, WA, United States.
| | | | - Sharon C Reale
- Department of Anesthesiology, Perioperative and Pain Medicine, Harvard Medical School, Brigham and Women's Hospital, Boston, MA, United States.
| |
Collapse
|
2
|
Khan AA, Yunus R, Sohail M, Rehman TA, Saeed S, Bu Y, Jackson CD, Sharkey A, Mahmood F, Matyal R. Artificial Intelligence for Anesthesiology Board-Style Examination Questions: Role of Large Language Models. J Cardiothorac Vasc Anesth 2024; 38:1251-1259. [PMID: 38423884 DOI: 10.1053/j.jvca.2024.01.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 01/24/2024] [Accepted: 01/29/2024] [Indexed: 03/02/2024]
Abstract
New artificial intelligence tools have been developed that have implications for medical usage. Large language models (LLMs), such as the widely used ChatGPT developed by OpenAI, have not been explored in the context of anesthesiology education. Understanding the reliability of various publicly available LLMs for medical specialties could offer insight into their understanding of the physiology, pharmacology, and practical applications of anesthesiology. An exploratory prospective review was conducted using 3 commercially available LLMs--OpenAI's ChatGPT GPT-3.5 version (GPT-3.5), OpenAI's ChatGPT GPT-4 (GPT-4), and Google's Bard--on questions from a widely used anesthesia board examination review book. Of the 884 eligible questions, the overall correct answer rates were 47.9% for GPT-3.5, 69.4% for GPT-4, and 45.2% for Bard. GPT-4 exhibited significantly higher performance than both GPT-3.5 and Bard (p = 0.001 and p < 0.001, respectively). None of the LLMs met the criteria required to secure American Board of Anesthesiology certification, according to the 70% passing score approximation. GPT-4 significantly outperformed GPT-3.5 and Bard in terms of overall performance, but lacked consistency in providing explanations that aligned with scientific and medical consensus. Although GPT-4 shows promise, current LLMs are not sufficiently advanced to answer anesthesiology board examination questions with passing success. Further iterations and domain-specific training may enhance their utility in medical education.
Collapse
Affiliation(s)
- Adnan A Khan
- Department of Anesthesia, Critical Care, and Pain Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School Boston, MA
| | - Rayaan Yunus
- Department of Anesthesia, Critical Care, and Pain Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School Boston, MA
| | - Mahad Sohail
- Department of Anesthesia, Critical Care, and Pain Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School Boston, MA
| | - Taha A Rehman
- Department of Anesthesia, Critical Care, and Pain Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School Boston, MA
| | - Shirin Saeed
- Department of Anesthesia, Critical Care, and Pain Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School Boston, MA
| | - Yifan Bu
- Department of Anesthesia, Critical Care, and Pain Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School Boston, MA
| | - Cullen D Jackson
- Department of Anesthesia, Critical Care, and Pain Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School Boston, MA
| | - Aidan Sharkey
- Department of Anesthesia, Critical Care, and Pain Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School Boston, MA
| | - Feroze Mahmood
- Department of Anesthesia, Critical Care, and Pain Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School Boston, MA
| | - Robina Matyal
- Department of Anesthesia, Critical Care, and Pain Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School Boston, MA.
| |
Collapse
|
3
|
Patnaik SS, Hoffmann U. Quantitative evaluation of ChatGPT versus Bard responses to anaesthesia-related queries. Br J Anaesth 2024; 132:169-171. [PMID: 37945414 PMCID: PMC11837762 DOI: 10.1016/j.bja.2023.09.030] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Revised: 09/28/2023] [Accepted: 09/29/2023] [Indexed: 11/12/2023] Open
Affiliation(s)
- Sourav S Patnaik
- Department of Anesthesiology and Pain Management, The University of Texas Southwestern Medical Center, Dallas, TX, USA.
| | - Ulrike Hoffmann
- Department of Anesthesiology and Pain Management, The University of Texas Southwestern Medical Center, Dallas, TX, USA
| |
Collapse
|