1
|
Al Barajraji M, Barrit S, Ben-Hamouda N, Harel E, Torcida N, Pizzarotti B, Massager N, Lechien JR. AI-Driven Information for Relatives of Patients with Malignant Middle Cerebral Artery Infarction: A Preliminary Validation Study Using GPT-4o. Brain Sci 2025; 15:391. [PMID: 40309831 PMCID: PMC12026103 DOI: 10.3390/brainsci15040391] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2025] [Revised: 04/04/2025] [Accepted: 04/10/2025] [Indexed: 05/02/2025] Open
Abstract
Purpose: This study examines GPT-4o's ability to communicate effectively with relatives of patients undergoing decompressive hemicraniectomy (DHC) after malignant middle cerebral artery infarction (MMCAI). Methods: GPT-4o was asked 25 common questions from patients' relatives about DHC for MMCAI, twice over a 7-day interval. Responses were rated for accuracy, clarity, relevance, completeness, sourcing, and usefulness by board-certified intensivist* (one), neurologists, and neurosurgeons using the Quality Analysis of Medical AI (QAMAI) tool. Interrater reliability and stability were measured using ICC and Pearson's correlation. Results: The total QAMAI scores were 22.32 ± 3.08 for the intensivist, 24.68 ± 2.8 for the neurologist, 23.36 ± 2.86 and 26.32 ± 2.91 for the neurosurgeons, representing moderate-to-high accuracy. The evaluators reported moderate ICC (0.631, 95% CI: 0.321-0.821). The highest subscores were for the categories of accuracy, clarity, and relevance while the poorest were associated with completeness, usefulness, and sourcing. GPT-4o did not systematically provide references for their responses. The stability analysis reported moderate-to-high stability. The readability assessment revealed an FRE score of 7.23, an FKG score of 15.87 and a GF index of 18.15. Conclusions: GPT-4o provides moderate-to-high quality information related to DHC for MMCAI, with strengths in accuracy, clarity, and relevance. However, limitations in completeness, sourcing, and readability may impact its effectiveness in patient or their relatives' education.
Collapse
Affiliation(s)
- Mejdeddine Al Barajraji
- Department of Neurosurgery, University Hospital of Lausanne and University of Lausanne, 1005 Lausanne, Switzerland;
| | - Sami Barrit
- Department of Neurosurgery, CHU Tivoli, 7110 La Louvière, Belgium; (S.B.); (N.M.)
| | - Nawfel Ben-Hamouda
- Department of Adult Intensive Care, University Hospital of Lausanne (CHUV), University of Lausanne, 1005 Lausanne, Switzerland;
| | - Ethan Harel
- Department of Neurosurgery, University Hospital of Lausanne and University of Lausanne, 1005 Lausanne, Switzerland;
| | - Nathan Torcida
- Department of Neurology, Hôpital Universitaire de Bruxelles (HUB), 1070 Brussels, Belgium;
| | - Beatrice Pizzarotti
- Department of Neurology, University Hospital of Lausanne (CHUV), University of Lausanne, 1011 Lausanne, Switzerland;
| | - Nicolas Massager
- Department of Neurosurgery, CHU Tivoli, 7110 La Louvière, Belgium; (S.B.); (N.M.)
| | - Jerome R. Lechien
- Department of Surgery, UMONS Research Institute for Health Sciences and Technology, University of Mons (UMons), 7000 Mons, Belgium;
- Department of Otolaryngology, Elsan Polyclinic of Poitiers, 86000 Poitiers, France
- Department of Otolaryngology-Head Neck Surgery, Foch Hospital, School of Medicine, UFR Simone Veil, Université Versailles Saint-Quentin-en-Yvelines (Paris Saclay University), 78035 Paris, France
| |
Collapse
|
2
|
Isch EL, Lee J, Self DM, Sambangi A, Habarth-Morales TE, Vaile J, Caterson EJ. Artificial Intelligence in Surgical Coding: Evaluating Large Language Models for Current Procedural Terminology Accuracy in Hand Surgery. JOURNAL OF HAND SURGERY GLOBAL ONLINE 2025; 7:181-185. [PMID: 40182863 PMCID: PMC11963066 DOI: 10.1016/j.jhsg.2024.11.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2024] [Accepted: 11/21/2024] [Indexed: 04/05/2025] Open
Abstract
Purpose The advent of large language models (LLMs) like ChatGPT has introduced notable advancements in various surgical disciplines. These developments have led to an increased interest in the use of LLMs for Current Procedural Terminology (CPT) coding in surgery. With CPT coding being a complex and time-consuming process, often exacerbated by the scarcity of professional coders, there is a pressing need for innovative solutions to enhance coding efficiency and accuracy. Methods This observational study evaluated the effectiveness of five publicly available large language models-Perplexity.AI, Bard, BingAI, ChatGPT 3.5, and ChatGPT 4.0-in accurately identifying CPT codes for hand surgery procedures. A consistent query format was employed to test each model, ensuring the inclusion of detailed procedure components where necessary. The responses were classified as correct, partially correct, or incorrect based on their alignment with established CPT coding for the specified procedures. Results In the evaluation of artificial intelligence (AI) model performance on simple procedures, Perplexity.AI achieved the highest number of correct outcomes (15), followed by Bard and Bing AI (14 each). ChatGPT 4 and ChatGPT 3.5 yielded 8 and 7 correct outcomes, respectively. For complex procedures, Perplexity.AI and Bard each had three correct outcomes, whereas ChatGPT models had none. Bing AI had the highest number of partially correct outcomes (5). There were significant associations between AI models and performance outcomes for both simple and complex procedures. Conclusions This study highlights the feasibility and potential benefits of integrating LLMs into the CPT coding process for hand surgery. The findings advocate for further refinement and training of AI models to improve their accuracy and practicality, suggesting a future where AI-assisted coding could become a standard component of surgical workflows, aligning with the ongoing digital transformation in health care. Type of study/level of evidence Observational, IIIb.
Collapse
Affiliation(s)
- Emily L. Isch
- Department of General Surgery, Thomas Jefferson University, Philadelphia, PA
| | - Jamie Lee
- Drexel University College of Medicine, Philadelphia, PA
| | - D. Mitchell Self
- Department of Neurosurgery, Thomas Jefferson University and Jefferson Hospital for Neuroscience, Philadelphia, PA
| | - Abhijeet Sambangi
- Sidney Kimmel Medical College at Thomas Jefferson University, Philadelphia, PA
| | | | - John Vaile
- Sidney Kimmel Medical College at Thomas Jefferson University, Philadelphia, PA
| | - EJ Caterson
- Department of Surgery, Division of Plastic Surgery, Nemours Children’s Hospital Wilmington, DE
| |
Collapse
|
3
|
Huo B, Boyle A, Marfo N, Tangamornsuksan W, Steen JP, McKechnie T, Lee Y, Mayol J, Antoniou SA, Thirunavukarasu AJ, Sanger S, Ramji K, Guyatt G. Large Language Models for Chatbot Health Advice Studies: A Systematic Review. JAMA Netw Open 2025; 8:e2457879. [PMID: 39903463 PMCID: PMC11795331 DOI: 10.1001/jamanetworkopen.2024.57879] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 11/26/2024] [Indexed: 02/06/2025] Open
Abstract
Importance There is much interest in the clinical integration of large language models (LLMs) in health care. Many studies have assessed the ability of LLMs to provide health advice, but the quality of their reporting is uncertain. Objective To perform a systematic review to examine the reporting variability among peer-reviewed studies evaluating the performance of generative artificial intelligence (AI)-driven chatbots for summarizing evidence and providing health advice to inform the development of the Chatbot Assessment Reporting Tool (CHART). Evidence Review A search of MEDLINE via Ovid, Embase via Elsevier, and Web of Science from inception to October 27, 2023, was conducted with the help of a health sciences librarian to yield 7752 articles. Two reviewers screened articles by title and abstract followed by full-text review to identify primary studies evaluating the clinical accuracy of generative AI-driven chatbots in providing health advice (chatbot health advice studies). Two reviewers then performed data extraction for 137 eligible studies. Findings A total of 137 studies were included. Studies examined topics in surgery (55 [40.1%]), medicine (51 [37.2%]), and primary care (13 [9.5%]). Many studies focused on treatment (91 [66.4%]), diagnosis (60 [43.8%]), or disease prevention (29 [21.2%]). Most studies (136 [99.3%]) evaluated inaccessible, closed-source LLMs and did not provide enough information to identify the version of the LLM under evaluation. All studies lacked a sufficient description of LLM characteristics, including temperature, token length, fine-tuning availability, layers, and other details. Most studies (136 [99.3%]) did not describe a prompt engineering phase in their study. The date of LLM querying was reported in 54 (39.4%) studies. Most studies (89 [65.0%]) used subjective means to define the successful performance of the chatbot, while less than one-third addressed the ethical, regulatory, and patient safety implications of the clinical integration of LLMs. Conclusions and Relevance In this systematic review of 137 chatbot health advice studies, the reporting quality was heterogeneous and may inform the development of the CHART reporting standards. Ethical, regulatory, and patient safety considerations are crucial as interest grows in the clinical integration of LLMs.
Collapse
Affiliation(s)
- Bright Huo
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Amy Boyle
- Michael G. DeGroote School of Medicine, McMaster University, Hamilton, Ontario, Canada
| | - Nana Marfo
- H. Ross University School of Medicine, Miramar, Florida
| | - Wimonchat Tangamornsuksan
- Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada
| | - Jeremy P. Steen
- Institute of Health Policy, Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
| | - Tyler McKechnie
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Yung Lee
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Julio Mayol
- Hospital Clinico San Carlos, IdISSC, Universidad Complutense de Madrid, Madrid, Spain
| | | | | | - Stephanie Sanger
- Health Science Library, Faculty of Health Sciences, McMaster University, Hamilton, Ontario, Canada
| | - Karim Ramji
- Division of General Surgery, Department of Surgery, McMaster University, Hamilton, Ontario, Canada
| | - Gordon Guyatt
- Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada
| |
Collapse
|
4
|
Guirguis PG, Youssef MP, Punreddy A, Botros M, Raiford M, McDowell S. Is Information About Musculoskeletal Malignancies From Large Language Models or Web Resources at a Suitable Reading Level for Patients? Clin Orthop Relat Res 2025; 483:306-315. [PMID: 39330944 PMCID: PMC11753740 DOI: 10.1097/corr.0000000000003263] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Accepted: 09/06/2024] [Indexed: 09/28/2024]
Abstract
BACKGROUND Patients and caregivers may experience immense distress when receiving the diagnosis of a primary musculoskeletal malignancy and subsequently turn to internet resources for more information. It is not clear whether these resources, including Google and ChatGPT, offer patients information that is readable, a measure of how easy text is to understand. Since many patients turn to Google and artificial intelligence resources for healthcare information, we thought it was important to ascertain whether the information they find is readable and easy to understand. The objective of this study was to compare readability of Google search results and ChatGPT answers to frequently asked questions and assess whether these sources meet NIH recommendations for readability. QUESTIONS/PURPOSES (1) What is the readability of ChatGPT-3.5 as a source of patient information for the three most common primary bone malignancies compared with top online resources from Google search? (2) Do ChatGPT-3.5 responses and online resources meet NIH readability guidelines for patient education materials? METHODS This was a cross-sectional analysis of the 12 most common online questions about osteosarcoma, chondrosarcoma, and Ewing sarcoma. To be consistent with other studies of similar design that utilized national society frequently asked questions lists, questions were selected from the American Cancer Society and categorized based on content, including diagnosis, treatment, and recovery and prognosis. Google was queried using all 36 questions, and top responses were recorded. Author types, such as hospital systems, national health organizations, or independent researchers, were recorded. ChatGPT-3.5 was provided each question in independent queries without further prompting. Responses were assessed with validated reading indices to determine readability by grade level. An independent t-test was performed with significance set at p < 0.05. RESULTS Google (n = 36) and ChatGPT-3.5 (n = 36) answers were recorded, 12 for each of the three cancer types. Reading grade levels based on mean readability scores were 11.0 ± 2.9 and 16.1 ± 3.6, respectively. This corresponds to the eleventh grade reading level for Google and a fourth-year undergraduate student level for ChatGPT-3.5. Google answers were more readable across all individual indices, without differences in word count. No difference in readability was present across author type, question category, or cancer type. Of 72 total responses across both search modalities, none met NIH readability criteria at the sixth-grade level. CONCLUSION Google material was presented at a high school reading level, whereas ChatGPT-3.5 was at an undergraduate reading level. The readability of both resources was inadequate based on NIH recommendations. Improving readability is crucial for better patient understanding during cancer treatment. Physicians should assess patients' needs, offer them tailored materials, and guide them to reliable resources to prevent reliance on online information that is hard to understand. LEVEL OF EVIDENCE Level III, prognostic study.
Collapse
Affiliation(s)
- Paul G. Guirguis
- University of Rochester School of Medicine and Dentistry, Rochester, NY, USA
| | | | - Ankit Punreddy
- University of Rochester School of Medicine and Dentistry, Rochester, NY, USA
| | - Mina Botros
- Department of Orthopaedics and Physical Performance, University of Rochester Medical Center, Rochester, NY, USA
| | - Mattie Raiford
- Department of Orthopaedics and Physical Performance, University of Rochester Medical Center, Rochester, NY, USA
| | - Susan McDowell
- Department of Orthopaedics and Physical Performance, University of Rochester Medical Center, Rochester, NY, USA
| |
Collapse
|
5
|
Almekkawi AK, Caruso JP, Anand S, Hawkins AM, Rauf R, Al-Shaikhli M, Aoun SG, Bagley CA. Comparative Analysis of Large Language Models and Spine Surgeons in Surgical Decision-Making and Radiological Assessment for Spine Pathologies. World Neurosurg 2025; 194:123531. [PMID: 39622288 DOI: 10.1016/j.wneu.2024.11.114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2024] [Revised: 11/25/2024] [Accepted: 11/26/2024] [Indexed: 12/26/2024]
Abstract
OBJECTIVE This study aimed to investigate the accuracy of large language models (LLMs), specifically ChatGPT and Claude, in surgical decision-making and radiological assessment for spine pathologies compared to experienced spine surgeons. METHODS The study employed a comparative analysis between the LLMs and a panel of attending spine surgeons. Five written clinical scenarios encompassing various spine pathologies were presented to the LLMs and surgeons, who provided recommended surgical treatment plans. Additionally, magnetic resonance imaging images depicting spine pathologies were analyzed by the LLMs and surgeons to assess their radiological interpretation abilities. Spino-pelvic parameters were estimated from a scoliosis radiograph by the LLMs. RESULTS Qualitative content analysis revealed limitations in the LLMs' consideration of patient-specific factors and the breadth of treatment options. Both ChatGPT and Claude provided detailed descriptions of magnetic resonance imaging findings but differed from the surgeons in terms of specific levels and severity of pathologies. The LLMs acknowledged the limitations of accurately measuring spino-pelvic parameters without specialized tools. The accuracy of surgical decision-making for the LLMs (20%) was lower than that of the attending surgeons (100%). Statistical analysis showed no significant differences in accuracy between the groups. CONCLUSIONS The study highlights the potential of LLMs in assisting with radiological interpretation and surgical decision-making in spine surgery. However, the current limitations, such as the lack of consideration for patient-specific factors and inaccuracies in treatment recommendations, emphasize the need for further refinement and validation of these artificial intelligence (AI) models. Continued collaboration between AI researchers and clinical experts is crucial to address these challenges and realize the full potential of AI in spine surgery.
Collapse
Affiliation(s)
- Ahmad K Almekkawi
- Saint Luke's Marion Bloch Neuroscience Institute Department of Neurosurgery, Kansas City, Missouri, USA.
| | - James P Caruso
- The University of Texas Southwestern Department of Neurosurgery, Dallas, Texas, USA
| | - Soummitra Anand
- The University of Texas Southwestern Department of Neurosurgery, Dallas, Texas, USA
| | - Angela M Hawkins
- Saint Luke's Marion Bloch Neuroscience Institute Department of Neurosurgery, Kansas City, Missouri, USA
| | - Rayaan Rauf
- The University of Missouri-Kansas City, School of Medicine, Kansas City, Missouri, USA
| | - Mayar Al-Shaikhli
- The University of Missouri-Kansas City, School of Medicine, Kansas City, Missouri, USA
| | - Salah G Aoun
- The University of Texas Southwestern Department of Neurosurgery, Dallas, Texas, USA
| | - Carlos A Bagley
- Saint Luke's Marion Bloch Neuroscience Institute Department of Neurosurgery, Kansas City, Missouri, USA
| |
Collapse
|
6
|
khan MM, Scalia G, Shah N, Umana GE, Chavda V, Chaurasia B. Ethical Concerns of AI in Neurosurgery: A Systematic Review. Brain Behav 2025; 15:e70333. [PMID: 39935215 PMCID: PMC11814476 DOI: 10.1002/brb3.70333] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/13/2024] [Revised: 01/05/2025] [Accepted: 01/20/2025] [Indexed: 02/13/2025] Open
Abstract
BACKGROUND The relentless integration of Artificial Intelligence (AI) into neurosurgery necessitates a meticulous exploration of the associated ethical concerns. This systematic review focuses on synthesizing empirical studies, reviews, and opinion pieces from the past decade, offering a nuanced understanding of the evolving intersection between AI and neurosurgical ethics. MATERIALS AND METHODS Following PRISMA guidelines, a systematic review was conducted to identify studies addressing AI in neurosurgery, emphasizing ethical dimensions. The search strategy employed keywords related to AI, neurosurgery, and ethics. Inclusion criteria encompassed empirical studies, reviews, and ethical analyses published in the last decade, with English language restriction. Quality assessment using Joanna Briggs Institute tools ensured methodological rigor. RESULTS Eight key studies were identified, each contributing unique insights to the ethical considerations associated with AI in neurosurgery. Findings highlighted limitations of AI technologies, challenges in data bias, transparency, and legal responsibilities. The studies emphasized the need for responsible AI systems, regulatory oversight, and transparent decision-making in neurosurgical practices. CONCLUSIONS The synthesis of findings underscores the complexity of ethical considerations in the integration of AI in neurosurgery. Transparent and responsible AI use, regulatory oversight, and mitigation of biases emerged as recurring themes. The review calls for the establishment of comprehensive ethical guidelines to ensure safe and equitable AI integration into neurosurgical practices. Ongoing research, educational initiatives, and a culture of responsible innovation are crucial for navigating the evolving landscape of AI-driven advancements in neurosurgery.
Collapse
Affiliation(s)
- Muhammad Mohsin khan
- Department of NeurosurgeryHamad General HospitalDohaQatar
- Department of Clinical ResearchDresden international universityDresdenGermany
| | - Gianluca Scalia
- Neurosurgery Unit, Department of Nead and Neck SurgeryGaribaldi HospitalCataniaItaly
| | - Noman Shah
- Department of NeurosurgeryHamad General HospitalDohaQatar
| | | | - Vishal Chavda
- Department of MedicineMultispeciality, Trauma and ICCU Centre, Sardar HospitalAhmedabadGujaratIndia
| | | |
Collapse
|
7
|
Abrahams JM. The Basics of Artificial Intelligence with Applications in Healthcare and Neurosurgery. World Neurosurg 2025; 193:171-175. [PMID: 39489333 DOI: 10.1016/j.wneu.2024.10.105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2024] [Revised: 10/24/2024] [Accepted: 10/25/2024] [Indexed: 11/05/2024]
Affiliation(s)
- John M Abrahams
- Department of Neurosurgery, New York Brain & Spine Surgery, West Harrison, New York, USA.
| |
Collapse
|
8
|
Patil A, Serrato P, Chisvo N, Arnaout O, See PA, Huang KT. Large language models in neurosurgery: a systematic review and meta-analysis. Acta Neurochir (Wien) 2024; 166:475. [PMID: 39579215 DOI: 10.1007/s00701-024-06372-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2024] [Accepted: 11/18/2024] [Indexed: 11/25/2024]
Abstract
BACKGROUND Large Language Models (LLMs) have garnered increasing attention in neurosurgery and possess significant potential to improve the field. However, the breadth and performance of LLMs across diverse neurosurgical tasks have not been systematically examined, and LLMs come with their own challenges and unique terminology. We seek to identify key models, establish reporting guidelines for replicability, and highlight progress in key application areas of LLM use in the neurosurgical literature. METHODS We searched PubMed and Google Scholar using terms related to LLMs and neurosurgery ("large language model" OR "LLM" OR "ChatGPT" OR "GPT-3" OR "GPT3" OR "GPT-3.5" OR "GPT3.5" OR "GPT-4" OR "GPT4" OR "LLAMA" OR "MISTRAL" OR "BARD") AND "neurosurgery". The final set of articles was reviewed for publication year, application area, specific LLM(s) used, control/comparison groups used to evaluate LLM performance, whether the article reported specific LLM prompts, prompting strategy types used, whether the LLM query could be reproduced in its entirety (including both the prompt used and any adjoining data), measures of hallucination, and reported performance measures. RESULTS Fifty-one articles met inclusion criteria, and were categorized into six application areas, with the most common being Generation of Text for Direct Clinical Use (n = 14, 27.5%), Answering Standardized Exam Questions (n = 12, 23.5%), and Clinical Judgement and Decision-Making Support (n = 11, 21.6%). The most frequently used LLMs were GPT-3.5 (n = 30, 58.8%), GPT-4 (n = 20, 39.2%), Bard (n = 9, 17.6%), and Bing (n = 6, 11.8%). Most studies (n = 43, 84.3%) used LLMs directly out-of-the-box, while 8 studies (15.7%) conducted advanced pre-training or fine-tuning. CONCLUSIONS Large language models show advanced capabilities in complex tasks and hold potential to transform neurosurgery. However, research typically addresses basic applications and overlooks enhancing LLM performance, facing reproducibility issues. Standardizing detailed reporting, considering LLM stochasticity, and using advanced methods beyond basic validation are essential for progress.
Collapse
Affiliation(s)
- Advait Patil
- Harvard Medical School, Harvard University, Boston, MA, 02115, USA.
- Computational Neuroscience Outcomes Center, Department of Neurosurgery, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
- Department of Neurosurgery, Brigham and Women's Hospital, 60 Fenwood Road, Hale Building, 4th Floor, Boston, MA, 02115, USA.
- Department of Neurosurgery, Boston Children's Hospital, 02115, Boston, MA, USA.
| | - Paul Serrato
- Yale School of Medicine, Yale University, New Haven, CT, 06510, USA
- Harvard T.H. Chan School of Public Health, Harvard University, Boston, CT, 02115, USA
- Computational Neuroscience Outcomes Center, Department of Neurosurgery, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Nathan Chisvo
- Computational Neuroscience Outcomes Center, Department of Neurosurgery, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Omar Arnaout
- Harvard Medical School, Harvard University, Boston, MA, 02115, USA
- Department of Neurosurgery, Brigham and Women's Hospital, 60 Fenwood Road, Hale Building, 4th Floor, Boston, MA, 02115, USA
| | - Pokmeng Alfred See
- Harvard Medical School, Harvard University, Boston, MA, 02115, USA
- Department of Neurosurgery, Boston Children's Hospital, 02115, Boston, MA, USA
| | - Kevin T Huang
- Harvard Medical School, Harvard University, Boston, MA, 02115, USA
- Department of Neurosurgery, Brigham and Women's Hospital, 60 Fenwood Road, Hale Building, 4th Floor, Boston, MA, 02115, USA
| |
Collapse
|
9
|
Chen A, Qilleri A, Foster T, Rao AS, Gopalakrishnan S, Niezgoda J, Oropallo A. Generative Artificial Intelligence: Applications in Scientific Writing and Data Analysis in Wound Healing Research. Adv Skin Wound Care 2024; 37:601-607. [PMID: 39792511 DOI: 10.1097/asw.0000000000000226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2025]
Abstract
ABSTRACT Generative artificial intelligence (AI) models are a new technological development with vast research use cases among medical subspecialties. These powerful large language models offer a wide range of possibilities in wound care, from personalized patient support to optimized treatment plans and improved scientific writing. They can also assist in efficiently navigating the literature and selecting and summarizing articles, enabling researchers to focus on impactful studies relevant to wound care management and enhancing response quality through prompt-learning iterations. For nonnative English-speaking medical practitioners and authors, generative AI may aid in grammar and vocabulary selection. Although reports have suggested limitations of the conversational agent on medical translation pertaining to the precise interpretation of medical context, when used with verified resources, this language model can breach language barriers and promote practice-changing advancements in global wound care. Further, AI-powered chatbots can enable continuous monitoring of wound healing progress and real-time insights into treatment responses through frequent, readily available remote patient follow-ups.However, implementing AI in wound care research requires careful consideration of potential limitations, especially in accurately translating complex medical terms and workflows. Ethical considerations are vital to ensure reliable and credible wound care research when using AI technologies. Although ChatGPT shows promise for transforming wound care management, the authors warn against overreliance on the technology. Considering the potential limitations and risks, proper validation and oversight are essential to unlock its true potential while ensuring patient safety and the effectiveness of wound care treatments.
Collapse
Affiliation(s)
- Adrian Chen
- At the Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, Hempstead, New York, United States, Adrian Chen, BS, Aleksandra Qilleri, BS, and Timothy Foster, BS, are Medical Students. Amit S. Rao, MD, is Project Manager, Department of Surgery, Wound Care Division, Northwell Wound Healing Center and Hyperbarics, Northwell Health, Hempstead. Sandeep Gopalakrishnan, PhD, MAPWCA, is Associate Professor and Director, Wound Healing and Tissue Repair Analytics Laboratory, School of Nursing, College of Health Professions, University of Wisconsin-Milwaukee. Jeffrey Niezgoda, MD, MAPWCA, is Founder and President Emeritus, AZH Wound Care and Hyperbaric Oxygen Therapy Center, Milwaukee, and President and Chief Medical Officer, WebCME, Greendale, Wisconsin. Alisha Oropallo, MD, is Professor of Surgery, Donald and Barbara Zucker School of Medicine and The Feinstein Institutes for Medical Research, Manhasset New York; Director, Comprehensive Wound Healing Center, Northwell Health; and Program Director, Wound and Burn Fellowship program, Northwell Health
| | | | | | | | | | | | | |
Collapse
|
10
|
Mastrokostas PG, Mastrokostas LE, Emara AK, Wellington IJ, Ginalis E, Houten JK, Khalsa AS, Saleh A, Razi AE, Ng MK. GPT-4 as a Source of Patient Information for Anterior Cervical Discectomy and Fusion: A Comparative Analysis Against Google Web Search. Global Spine J 2024; 14:2389-2398. [PMID: 38513636 PMCID: PMC11529100 DOI: 10.1177/21925682241241241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 03/23/2024] Open
Abstract
STUDY DESIGN Comparative study. OBJECTIVES This study aims to compare Google and GPT-4 in terms of (1) question types, (2) response readability, (3) source quality, and (4) numerical response accuracy for the top 10 most frequently asked questions (FAQs) about anterior cervical discectomy and fusion (ACDF). METHODS "Anterior cervical discectomy and fusion" was searched on Google and GPT-4 on December 18, 2023. Top 10 FAQs were classified according to the Rothwell system. Source quality was evaluated using JAMA benchmark criteria and readability was assessed using Flesch Reading Ease and Flesch-Kincaid grade level. Differences in JAMA scores, Flesch-Kincaid grade level, Flesch Reading Ease, and word count between platforms were analyzed using Student's t-tests. Statistical significance was set at the .05 level. RESULTS Frequently asked questions from Google were varied, while GPT-4 focused on technical details and indications/management. GPT-4 showed a higher Flesch-Kincaid grade level (12.96 vs 9.28, P = .003), lower Flesch Reading Ease score (37.07 vs 54.85, P = .005), and higher JAMA scores for source quality (3.333 vs 1.800, P = .016). Numerically, 6 out of 10 responses varied between platforms, with GPT-4 providing broader recovery timelines for ACDF. CONCLUSIONS This study demonstrates GPT-4's ability to elevate patient education by providing high-quality, diverse information tailored to those with advanced literacy levels. As AI technology evolves, refining these tools for accuracy and user-friendliness remains crucial, catering to patients' varying literacy levels and information needs in spine surgery.
Collapse
Affiliation(s)
- Paul G. Mastrokostas
- College of Medicine, State University of New York (SUNY) Downstate, Brooklyn, NY, USA
| | | | - Ahmed K. Emara
- Department of Orthopaedic Surgery, Cleveland Clinic, Cleveland, OH, USA
| | - Ian J. Wellington
- Department of Orthopaedic Surgery, University of Connecticut, Hartford, CT, USA
| | | | - John K. Houten
- Department of Neurosurgery, Mount Sinai School of Medicine, New York, NY, USA
| | - Amrit S. Khalsa
- Department of Orthopaedic Surgery, University of Pennsylvania, Philadelphia, PA, USA
| | - Ahmed Saleh
- Department of Orthopaedic Surgery, Maimonides Medical Center, Brooklyn, NY, USA
| | - Afshin E. Razi
- Department of Orthopaedic Surgery, Maimonides Medical Center, Brooklyn, NY, USA
| | - Mitchell K. Ng
- Department of Orthopaedic Surgery, Maimonides Medical Center, Brooklyn, NY, USA
| |
Collapse
|
11
|
Isch EL, Sarikonda A, Sambangi A, Carreras A, Sircar A, Self DM, Habarth-Morales TE, Caterson EJ, Aycart M. Evaluating the Efficacy of Large Language Models in CPT Coding for Craniofacial Surgery: A Comparative Analysis. J Craniofac Surg 2024:00001665-990000000-01868. [PMID: 39221924 DOI: 10.1097/scs.0000000000010575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2024] [Accepted: 07/23/2024] [Indexed: 09/04/2024] Open
Abstract
BACKGROUND The advent of Large Language Models (LLMs) like ChatGPT has introduced significant advancements in various surgical disciplines. These developments have led to an increased interest in the utilization of LLMs for Current Procedural Terminology (CPT) coding in surgery. With CPT coding being a complex and time-consuming process, often exacerbated by the scarcity of professional coders, there is a pressing need for innovative solutions to enhance coding efficiency and accuracy. METHODS This observational study evaluated the effectiveness of 5 publicly available large language models-Perplexity.AI, Bard, BingAI, ChatGPT 3.5, and ChatGPT 4.0-in accurately identifying CPT codes for craniofacial procedures. A consistent query format was employed to test each model, ensuring the inclusion of detailed procedure components where necessary. The responses were classified as correct, partially correct, or incorrect based on their alignment with established CPT coding for the specified procedures. RESULTS The results indicate that while there is no overall significant association between the type of AI model and the correctness of CPT code identification, there are notable differences in performance for simple and complex CPT codes among the models. Specifically, ChatGPT 4.0 showed higher accuracy for complex codes, whereas Perplexity.AI and Bard were more consistent with simple codes. DISCUSSION The use of AI chatbots for CPT coding in craniofacial surgery presents a promising avenue for reducing the administrative burden and associated costs of manual coding. Despite the lower accuracy rates compared with specialized, trained algorithms, the accessibility and minimal training requirements of the AI chatbots make them attractive alternatives. The study also suggests that priming AI models with operative notes may enhance their accuracy, offering a resource-efficient strategy for improving CPT coding in clinical practice. CONCLUSIONS This study highlights the feasibility and potential benefits of integrating LLMs into the CPT coding process for craniofacial surgery. The findings advocate for further refinement and training of AI models to improve their accuracy and practicality, suggesting a future where AI-assisted coding could become a standard component of surgical workflows, aligning with the ongoing digital transformation in health care.
Collapse
Affiliation(s)
- Emily L Isch
- Department of General Surgery, Thomas Jefferson University
| | | | | | | | - Adrija Sircar
- Sidney Kimmel Medical College at Thomas Jefferson University
| | - D Mitchell Self
- Department of Neurosurgery, Thomas Jefferson University and Jefferson Hospital for Neuroscience, Philadelphia, PA
| | | | - E J Caterson
- Department of Surgery, Division of Plastic Surgery, Nemours Children's Hospital Wilmington, DE
| | - Mario Aycart
- Department of Surgery, Division of Plastic Surgery, Nemours Children's Hospital Wilmington, DE
| |
Collapse
|
12
|
Ward M, Unadkat P, Toscano D, Kashanian A, Lynch DG, Horn AC, D'Amico RS, Mittler M, Baum GR. A Quantitative Assessment of ChatGPT as a Neurosurgical Triaging Tool. Neurosurgery 2024; 95:487-495. [PMID: 38353523 DOI: 10.1227/neu.0000000000002867] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2023] [Accepted: 12/26/2023] [Indexed: 07/16/2024] Open
Abstract
BACKGROUND AND OBJECTIVES ChatGPT is a natural language processing chatbot with increasing applicability to the medical workflow. Although ChatGPT has been shown to be capable of passing the American Board of Neurological Surgery board examination, there has never been an evaluation of the chatbot in triaging and diagnosing novel neurosurgical scenarios without defined answer choices. In this study, we assess ChatGPT's capability to determine the emergent nature of neurosurgical scenarios and make diagnoses based on information one would find in a neurosurgical consult. METHODS Thirty clinical scenarios were given to 3 attendings, 4 residents, 2 physician assistants, and 2 subinterns. Participants were asked to determine if the scenario constituted an urgent neurosurgical consultation and what the most likely diagnosis was. Attending responses provided a consensus to use as the answer key. Generative pretraining transformer (GPT) 3.5 and GPT 4 were given the same questions, and their responses were compared with the other participants. RESULTS GPT 4 was 100% accurate in both diagnosis and triage of the scenarios. GPT 3.5 had an accuracy of 92.59%, slightly below that of a PGY1 (96.3%), an 88.24% sensitivity, 100% specificity, 100% positive predictive value, and 83.3% negative predicative value in triaging each situation. When making a diagnosis, GPT 3.5 had an accuracy of 92.59%, which was higher than the subinterns and similar to resident responders. CONCLUSION GPT 4 is able to diagnose and triage neurosurgical scenarios at the level of a senior neurosurgical resident. There has been a clear improvement between GPT 3.5 and 4. It is likely that the recent updates in internet access and directing the functionality of ChatGPT will further improve its utility in neurosurgical triage.
Collapse
Affiliation(s)
- Max Ward
- Department of Neurological Surgery, Zucker School of Medicine at Hofstra/Northwell, Hempstead , New York , USA
| | - Prashin Unadkat
- Department of Neurological Surgery, Zucker School of Medicine at Hofstra/Northwell, Hempstead , New York , USA
- Elmezzi Graduate School of Molecular Medicine, Feinstein Institutes of Medical Research, Northwell Health, Manhasset , New York , USA
| | - Daniel Toscano
- Department of Neurological Surgery, Zucker School of Medicine at Hofstra/Northwell, Hempstead , New York , USA
| | - Alon Kashanian
- Department of Neurological Surgery, Zucker School of Medicine at Hofstra/Northwell, Hempstead , New York , USA
| | - Daniel G Lynch
- Department of Neurological Surgery, Zucker School of Medicine at Hofstra/Northwell, Hempstead , New York , USA
| | - Alexander C Horn
- Department of Neurological Surgery, Wake Forest School of Medicine, Winston-Salem , North Carolina , USA
| | - Randy S D'Amico
- Department of Neurological Surgery, Zucker School of Medicine at Hofstra/Northwell, Hempstead , New York , USA
- Department of Neurological Surgery, Lenox Hill Hospital, New York , New York , USA
| | - Mark Mittler
- Department of Neurological Surgery, Zucker School of Medicine at Hofstra/Northwell, Hempstead , New York , USA
- Department of Pediatric Neurosurgery, Cohens Childrens Medical Center, Queens , New York , USA
| | - Griffin R Baum
- Department of Neurological Surgery, Zucker School of Medicine at Hofstra/Northwell, Hempstead , New York , USA
- Department of Neurological Surgery, Lenox Hill Hospital, New York , New York , USA
| |
Collapse
|
13
|
Tuttle JJ, Moshirfar M, Garcia J, Altaf AW, Omidvarnia S, Hoopes PC. Learning the Randleman Criteria in Refractive Surgery: Utilizing ChatGPT-3.5 Versus Internet Search Engine. Cureus 2024; 16:e64768. [PMID: 39156271 PMCID: PMC11329333 DOI: 10.7759/cureus.64768] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/15/2024] [Indexed: 08/20/2024] Open
Abstract
Introduction Large language models such as OpenAI's (San Francisco, CA) ChatGPT-3.5 hold immense potential to augment self-directed learning in medicine, but concerns have risen regarding its accuracy in specialized fields. This study compares ChatGPT-3.5 with an internet search engine in their ability to define the Randleman criteria and its five parameters within a self-directed learning environment. Methods Twenty-three medical students gathered information on the Randleman criteria. Each student was allocated 10 minutes to interact with ChatGPT-3.5, followed by 10 minutes to search the internet independently. Each ChatGPT-3.5 conversation, student summary, and internet reference were subsequently analyzed for accuracy, efficiency, and reliability. Results ChatGPT-3.5 provided the correct definition for 26.1% of students (6/23, 95% CI: 12.3% to 46.8%), while an independent internet search resulted in sources containing the correct definition for 100% of students (23/23, 95% CI: 87.5% to 100%, p = 0.0001). ChatGPT-3.5 incorrectly identified the Randleman criteria as a corneal ectasia staging system for 17.4% of students (4/23), fabricated a "Randleman syndrome" for 4.3% of students (1/23), and gave no definition for 52.2% of students (12/23). When a definition was given (47.8%, 11/23), a median of two of the five correct parameters was provided along with a median of two additional falsified parameters. Conclusion Internet search engine outperformed ChatGPT-3.5 in providing accurate and reliable information on the Randleman criteria. ChatGPT-3.5 gave false information, required excessive prompting, and propagated misunderstandings. Learners should exercise discernment when using ChatGPT-3.5. Future initiatives should evaluate the implementation of prompt engineering and updated large-language models.
Collapse
Affiliation(s)
- Jared J Tuttle
- Ophthalmology, University of Texas Health Science Center at San Antonio, San Antonio, USA
| | - Majid Moshirfar
- Hoopes Vision Research Center, Hoopes Vision, Draper, USA
- John A. Moran Eye Center, University of Utah School of Medicine, Salt Lake City, USA
- Eye Banking and Corneal Transplantation, Utah Lions Eye Bank, Murray, USA
| | - James Garcia
- Ophthalmology, University of Texas Health Science Center at San Antonio, San Antonio, USA
| | - Amal W Altaf
- Medicine, University of Arizona College of Medicine - Phoenix, Phoenix, USA
| | | | | |
Collapse
|
14
|
Şahin Ş, Tekin MS, Yigit YE, Erkmen B, Duymaz YK, Bahşi İ. Evaluating the Success of ChatGPT in Addressing Patient Questions Concerning Thyroid Surgery. J Craniofac Surg 2024:00001665-990000000-01698. [PMID: 38861337 DOI: 10.1097/scs.0000000000010395] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2024] [Accepted: 05/15/2024] [Indexed: 06/13/2024] Open
Abstract
OBJECTIVE This study aimed to evaluate the utility and efficacy of ChatGPT in addressing questions related to thyroid surgery, taking into account accuracy, readability, and relevance. METHODS A simulated physician-patient consultation on thyroidectomy surgery was conducted by posing 21 hypothetical questions to ChatGPT. Responses were evaluated using the DISCERN score by 3 independent ear, nose and throat specialists. Readability measures including Flesch Reading Ease), Flesch-Kincaid Grade Level, Gunning Fog Index, Simple Measure of Gobbledygook, Coleman-Liau Index, and Automated Readability Index were also applied. RESULTS The majority of ChatGPT responses were rated fair or above using the DISCERN system, with an average score of 45.44 ± 11.24. However, the readability scores were consistently higher than the recommended grade 6 level, indicating the information may not be easily comprehensible to the general public. CONCLUSION While ChatGPT exhibits potential in answering patient queries related to thyroid surgery, its current formulation is not yet optimally tailored for patient comprehension. Further refinements are necessary for its efficient application in the medical domain.
Collapse
Affiliation(s)
- Şamil Şahin
- Ear Nose and Throat Specialist, Private Practice
| | | | - Yesim Esen Yigit
- Department of Otolaryngology, Umraniye Training and Research Hospital, University of Health Sciences, Istanbul
| | - Burak Erkmen
- Ear Nose and Throat Specialist, Private Practice
| | - Yasar Kemal Duymaz
- Department of Otolaryngology, Umraniye Training and Research Hospital, University of Health Sciences, Istanbul
| | - İlhan Bahşi
- Department of Anatomy, Faculty of Medicine, Gaziantep University, Gaziantep, Turkey
| |
Collapse
|
15
|
Valerio JE, Ramirez-Velandia F, Fernandez-Gomez MP, Rea NS, Alvarez-Pinzon AM. Bridging the Global Technology Gap in Neurosurgery: Disparities in Access to Advanced Tools for Brain Tumor Resection. NEUROSURGERY PRACTICE 2024; 5:e00090. [PMID: 39958239 PMCID: PMC11783611 DOI: 10.1227/neuprac.0000000000000090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/11/2023] [Accepted: 02/08/2024] [Indexed: 02/18/2025]
Abstract
BACKGROUND AND OBJECTIVES The advent of advanced technologies has brought unprecedented precision and efficacy to neurosurgical procedures for brain tumor resection. Despite the remarkable progress, disparities in technology access across different nations persist, creating significant challenges in providing equitable neurosurgical care. The purpose of the following work was to comprehensively analyze the existing disparities in access to innovative neurosurgical technologies and the impact of such disparities on patient outcomes and research. We seek to shed light on the extent of the problem, the underlying causes, and propose strategies for mitigating these disparities. METHODS A systematic review of published articles, including clinical studies, reports, and healthcare infrastructure assessments, was conducted to gather data on the availability and utilization of advanced neurosurgical technologies in various countries. RESULTS Disparities in technology access in neurosurgery are evident, with high-income countries benefiting from widespread implementation, while low- and middle-income countries face significant challenges in technology adoption. These disparities contribute to variations in surgical outcomes and patient experiences. The root causes of these disparities encompass financial constraints, inadequate infrastructure, and insufficient training and expertise. CONCLUSION Disparities in access to advanced neurosurgical technology remain a critical concern in global neurosurgery. Bridging this gap is essential to ensure that all patients, regardless of their geographic location, can benefit from the advancements in neurosurgical care. A concerted effort involving governments, healthcare institutions, and the international community is required to achieve this goal, advancing the quality of care for patients with brain tumors worldwide.
Collapse
Affiliation(s)
- Jose E. Valerio
- Department of Neurological Surgery, Palmetto General Hospital, Miami, Florida, USA
- Neurosurgery Oncology Center of Excellence, Department of Neurosurgery, Miami Neuroscience Center at Larkin, South Miami, Florida, USA
- GW School of Business, The George Washington University, Washington, District of Columbia, USA
| | | | | | - Noe S. Rea
- Clinical Research Associate, Latino America Valerio Foundation, Weston, Florida, USA
| | - Andres M. Alvarez-Pinzon
- The Institute of Neuroscience of Castilla y León (INCYL), Cancer Neuroscience, University of Salamanca (USAL), Salamanca, Spain
- Stanford LEAD Program, Graduate School of Business, Stanford University, Palo Alto, California, USA
- Institute for Human Health and Disease Intervention (I-HEALTH), Florida Atlantic University, Jupiter, Florida, USA
| |
Collapse
|
16
|
Huang KT, Mehta NH, Gupta S, See AP, Arnaout O. Evaluation of the safety, accuracy, and helpfulness of the GPT-4.0 Large Language Model in neurosurgery. J Clin Neurosci 2024; 123:151-156. [PMID: 38574687 DOI: 10.1016/j.jocn.2024.03.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2024] [Revised: 03/19/2024] [Accepted: 03/22/2024] [Indexed: 04/06/2024]
Abstract
BACKGROUND Although prior work demonstrated the surprising accuracy of Large Language Models (LLMs) on neurosurgery board-style questions, their use in day-to-day clinical situations warrants further investigation. This study assessed GPT-4.0's responses to common clinical questions across various subspecialties of neurosurgery. METHODS A panel of attending neurosurgeons formulated 35 general neurosurgical questions spanning neuro-oncology, spine, vascular, functional, pediatrics, and trauma. All questions were input into GPT-4.0 with a prespecified, standard prompt. Responses were evaluated by two attending neurosurgeons, each on a standardized scale for accuracy, safety, and helpfulness. Citations were indexed and evaluated against identifiable database references. RESULTS GPT-4.0 responses were consistent with current medical guidelines and accounted for recent advances in the field 92.8 % and 78.6 % of the time respectively. Neurosurgeons reported GPT-4.0 responses providing unrealistic information or potentially risky information 14.3 % and 7.1 % of the time respectively. Assessed on 5-point scales, responses suggested that GPT-4.0 was clinically useful (4.0 ± 0.6), relevant (4.7 ± 0.3), and coherent (4.9 ± 0.2). The depth of clinical responses varied (3.7 ± 0.6), and "red flag" symptoms were missed 7.1 % of the time. Moreover, GPT-4.0 cited 86 references (2.46 citations per answer), of which only 50 % were deemed valid, and 77.1 % of responses contained at least one inappropriate citation. CONCLUSION Current general LLM technology can offer generally accurate, safe, and helpful neurosurgical information, but may not fully evaluate medical literature or recent field advances. Citation generation and usage remains unreliable. As this technology becomes more ubiquitous, clinicians will need to exercise caution when dealing with it in practice.
Collapse
Affiliation(s)
- Kevin T Huang
- Brigham and Women's Hospital, Department of Neurosurgery, 60 Fenwood Road, Hale Building, 4th FloorBoston, MA 02115, United States; Harvard Medical School, Department of Neurosurgery, 25 Shattuck StreetBoston, MA 02115, United States.
| | - Neel H Mehta
- Harvard Medical School, Department of Neurosurgery, 25 Shattuck StreetBoston, MA 02115, United States
| | - Saksham Gupta
- Brigham and Women's Hospital, Department of Neurosurgery, 60 Fenwood Road, Hale Building, 4th FloorBoston, MA 02115, United States; Harvard Medical School, Department of Neurosurgery, 25 Shattuck StreetBoston, MA 02115, United States
| | - Alfred P See
- Harvard Medical School, Department of Neurosurgery, 25 Shattuck StreetBoston, MA 02115, United States; Boston Children's Hospital, Department of Neurosurgery, 300 Longwood AvenueBoston, MA 02115, United States
| | - Omar Arnaout
- Brigham and Women's Hospital, Department of Neurosurgery, 60 Fenwood Road, Hale Building, 4th FloorBoston, MA 02115, United States; Harvard Medical School, Department of Neurosurgery, 25 Shattuck StreetBoston, MA 02115, United States
| |
Collapse
|
17
|
Parikh AO, Oca MC, Conger JR, McCoy A, Chang J, Zhang-Nunes S. Accuracy and Bias in Artificial Intelligence Chatbot Recommendations for Oculoplastic Surgeons. Cureus 2024; 16:e57611. [PMID: 38707042 PMCID: PMC11069401 DOI: 10.7759/cureus.57611] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/30/2024] [Indexed: 05/07/2024] Open
Abstract
Purpose The purpose of this study is to assess the accuracy of and bias in recommendations for oculoplastic surgeons from three artificial intelligence (AI) chatbot systems. Methods ChatGPT, Microsoft Bing Balanced, and Google Bard were asked for recommendations for oculoplastic surgeons practicing in 20 cities with the highest population in the United States. Three prompts were used: "can you help me find (an oculoplastic surgeon)/(a doctor who does eyelid lifts)/(an oculofacial plastic surgeon) in (city)." Results A total of 672 suggestions were made between (oculoplastic surgeon; doctor who does eyelid lifts; oculofacial plastic surgeon); 19.8% suggestions were excluded, leaving 539 suggested physicians. Of these, 64.1% were oculoplastics specialists (of which 70.1% were American Society of Ophthalmic Plastic and Reconstructive Surgery (ASOPRS) members); 16.1% were general plastic surgery trained, 9.0% were ENT trained, 8.8% were ophthalmology but not oculoplastics trained, and 1.9% were trained in another specialty. 27.7% of recommendations across all AI systems were female. Conclusions Among the chatbot systems tested, there were high rates of inaccuracy: up to 38% of recommended surgeons were nonexistent or not practicing in the city requested, and 35.9% of those recommended as oculoplastic/oculofacial plastic surgeons were not oculoplastics specialists. Choice of prompt affected the result, with requests for "a doctor who does eyelid lifts" resulting in more plastic surgeons and ENTs and fewer oculoplastic surgeons. It is important to identify inaccuracies and biases in recommendations provided by AI systems as more patients may start using them to choose a surgeon.
Collapse
Affiliation(s)
- Alomi O Parikh
- Ophthalmology, USC Roski Eye Institute, Keck School of Medicine, University of Southern California, Los Angeles, USA
| | - Michael C Oca
- Ophthalmology, University of California San Diego School of Medicine, La Jolla, USA
| | - Jordan R Conger
- Oculofacial Plastic Surgery, USC Roski Eye Institute, Keck School of Medicine, University of Southern California, Los Angeles, USA
| | - Allison McCoy
- Oculofacial Plastic Surgery, Del Mar Plastic Surgery, San Diego, USA
| | - Jessica Chang
- Oculofacial Plastic Surgery, USC Roski Eye Institute, Keck School Medicine, University of Southern California, Los Angeles, USA
| | - Sandy Zhang-Nunes
- Ophthalmology, USC Roski Eye Institute, Keck School Medicine, University of Southern California, Los Angeles, USA
| |
Collapse
|
18
|
Lee KH, Lee RW. ChatGPT's Accuracy on Magnetic Resonance Imaging Basics: Characteristics and Limitations Depending on the Question Type. Diagnostics (Basel) 2024; 14:171. [PMID: 38248048 PMCID: PMC10814518 DOI: 10.3390/diagnostics14020171] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2023] [Revised: 01/04/2024] [Accepted: 01/11/2024] [Indexed: 01/23/2024] Open
Abstract
Our study aimed to assess the accuracy and limitations of ChatGPT in the domain of MRI, focused on evaluating ChatGPT's performance in answering simple knowledge questions and specialized multiple-choice questions related to MRI. A two-step approach was used to evaluate ChatGPT. In the first step, 50 simple MRI-related questions were asked, and ChatGPT's answers were categorized as correct, partially correct, or incorrect by independent researchers. In the second step, 75 multiple-choice questions covering various MRI topics were posed, and the answers were similarly categorized. The study utilized Cohen's kappa coefficient for assessing interobserver agreement. ChatGPT demonstrated high accuracy in answering straightforward MRI questions, with over 85% classified as correct. However, its performance varied significantly across multiple-choice questions, with accuracy rates ranging from 40% to 66.7%, depending on the topic. This indicated a notable gap in its ability to handle more complex, specialized questions requiring deeper understanding and context. In conclusion, this study critically evaluates the accuracy of ChatGPT in addressing questions related to Magnetic Resonance Imaging (MRI), highlighting its potential and limitations in the healthcare sector, particularly in radiology. Our findings demonstrate that ChatGPT, while proficient in responding to straightforward MRI-related questions, exhibits variability in its ability to accurately answer complex multiple-choice questions that require more profound, specialized knowledge of MRI. This discrepancy underscores the nuanced role AI can play in medical education and healthcare decision-making, necessitating a balanced approach to its application.
Collapse
Affiliation(s)
| | - Ro-Woon Lee
- Department of Radiology, Inha University College of Medicine, Incheon 22212, Republic of Korea;
| |
Collapse
|
19
|
Singh A, Das S, Mishra RK, Agrawal A. Artificial intelligence and machine learning in healthcare: Scope and opportunities to use ChatGPT. J Neurosci Rural Pract 2023; 14:391-392. [PMID: 37692807 PMCID: PMC10483215 DOI: 10.25259/jnrp_391_2023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2023] [Accepted: 07/19/2023] [Indexed: 09/12/2023] Open
Affiliation(s)
- Ajai Singh
- Executive Director and CEO, All India Institute of Medical Sciences, Bhopal, Madhya Pradesh, India
| | - Saikat Das
- Department of Radiation Oncology, All India Institute of Medical Sciences, Bhopal, Madhya Pradesh, India
| | - Rakesh Kumar Mishra
- Department of Neurosurgery, Institute of Medical Sciences, Banaras Hindu University, Varanasi, Uttar Pradesh, India
| | - Amit Agrawal
- Department of Neurosurgery, All India Institute of Medical Sciences, Bhopal, Madhya Pradesh, India
| |
Collapse
|