1
|
Patil A, Serrato P, Cleaver G, Limbania D, See AP, Huang KT. Employing large language models safely and effectively as a practicing neurosurgeon. Acta Neurochir (Wien) 2025; 167:101. [PMID: 40202682 PMCID: PMC11982131 DOI: 10.1007/s00701-025-06515-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2025] [Accepted: 04/03/2025] [Indexed: 04/10/2025]
Abstract
BACKGROUND Large Language Models (LLMs) have demonstrated significant capabilities to date in working with a neurosurgical knowledge-base and have the potential to enhance neurosurgical practice and education. However, their role in the clinical workspace is still being actively explored. As many neurosurgeons seek to incorporate this technology into their local practice environments, we explore pertinent questions about how to deploy these systems in a safe and efficacious manner. METHODS The authors performed a literature search of LLM studies in neurosurgery in the PubMed database ("LLM" and "neurosurgery"). Papers were reviewed for LLM use cases, considerations taken for selection of specific LLMs, and challenges encountered, including processing of private health information. RESULTS The authors provide a review of core principles underpinning model selection, including technical considerations such as model access, context windows, multimodality, retrieval-augmented generation, and benchmark performance, as well as relative advantages of current LLMs. Additionally, the authors discuss safety considerations and paths for institutional support in safe LLM inference on private health data. The resulting discussion forms a framework for key dimensions neurosurgeons employing LLMs should consider. CONCLUSIONS LLMs present promising opportunities to advance neurosurgical practice, but their clinical adoption necessitates careful consideration of technical, ethical, and regulatory hurdles. By thoughtfully evaluating model selection, deployment approaches, and compliance requirements, neurosurgeons can leverage the benefits of LLMs while minimizing potential risks.
Collapse
Affiliation(s)
- Advait Patil
- Department of Neurosurgery, Harvard Medical School, Boston, MA, 02467, USA
- Department of Neurosurgery, Mass General Brigham, Boston, MA, 02467, USA
| | - Paul Serrato
- Department of Neurosurgery, Mass General Brigham, Boston, MA, 02467, USA
- Department of Neurosurgery, Yale School of Medicine, New Haven, CT, 06510, USA
| | - Gracie Cleaver
- Department of Neurosurgery, Harvard Medical School, Boston, MA, 02467, USA
- Department of Neurosurgery, Mass General Brigham, Boston, MA, 02467, USA
| | - Daniela Limbania
- Department of Neurosurgery, Harvard Medical School, Boston, MA, 02467, USA
- Department of Neurosurgery, Mass General Brigham, Boston, MA, 02467, USA
| | - Alfred Pokmeng See
- Department of Neurosurgery, Harvard Medical School, Boston, MA, 02467, USA
- Department of Neurosurgery, Boston Children's Hospital, Boston, MA, 02467, USA
| | - Kevin T Huang
- Department of Neurosurgery, Harvard Medical School, Boston, MA, 02467, USA.
- Department of Neurosurgery, Mass General Brigham, Boston, MA, 02467, USA.
- Department of Neurosurgery, Brigham and Women's Hospital, 60 Fenwood Road, Hale Building, 4th Floor, Boston, MA, 02115, USA.
| |
Collapse
|
2
|
Isch EL, Lee J, Self DM, Sambangi A, Habarth-Morales TE, Vaile J, Caterson EJ. Artificial Intelligence in Surgical Coding: Evaluating Large Language Models for Current Procedural Terminology Accuracy in Hand Surgery. JOURNAL OF HAND SURGERY GLOBAL ONLINE 2025; 7:181-185. [PMID: 40182863 PMCID: PMC11963066 DOI: 10.1016/j.jhsg.2024.11.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2024] [Accepted: 11/21/2024] [Indexed: 04/05/2025] Open
Abstract
Purpose The advent of large language models (LLMs) like ChatGPT has introduced notable advancements in various surgical disciplines. These developments have led to an increased interest in the use of LLMs for Current Procedural Terminology (CPT) coding in surgery. With CPT coding being a complex and time-consuming process, often exacerbated by the scarcity of professional coders, there is a pressing need for innovative solutions to enhance coding efficiency and accuracy. Methods This observational study evaluated the effectiveness of five publicly available large language models-Perplexity.AI, Bard, BingAI, ChatGPT 3.5, and ChatGPT 4.0-in accurately identifying CPT codes for hand surgery procedures. A consistent query format was employed to test each model, ensuring the inclusion of detailed procedure components where necessary. The responses were classified as correct, partially correct, or incorrect based on their alignment with established CPT coding for the specified procedures. Results In the evaluation of artificial intelligence (AI) model performance on simple procedures, Perplexity.AI achieved the highest number of correct outcomes (15), followed by Bard and Bing AI (14 each). ChatGPT 4 and ChatGPT 3.5 yielded 8 and 7 correct outcomes, respectively. For complex procedures, Perplexity.AI and Bard each had three correct outcomes, whereas ChatGPT models had none. Bing AI had the highest number of partially correct outcomes (5). There were significant associations between AI models and performance outcomes for both simple and complex procedures. Conclusions This study highlights the feasibility and potential benefits of integrating LLMs into the CPT coding process for hand surgery. The findings advocate for further refinement and training of AI models to improve their accuracy and practicality, suggesting a future where AI-assisted coding could become a standard component of surgical workflows, aligning with the ongoing digital transformation in health care. Type of study/level of evidence Observational, IIIb.
Collapse
Affiliation(s)
- Emily L. Isch
- Department of General Surgery, Thomas Jefferson University, Philadelphia, PA
| | - Jamie Lee
- Drexel University College of Medicine, Philadelphia, PA
| | - D. Mitchell Self
- Department of Neurosurgery, Thomas Jefferson University and Jefferson Hospital for Neuroscience, Philadelphia, PA
| | - Abhijeet Sambangi
- Sidney Kimmel Medical College at Thomas Jefferson University, Philadelphia, PA
| | | | - John Vaile
- Sidney Kimmel Medical College at Thomas Jefferson University, Philadelphia, PA
| | - EJ Caterson
- Department of Surgery, Division of Plastic Surgery, Nemours Children’s Hospital Wilmington, DE
| |
Collapse
|
3
|
Aster A, Laupichler MC, Rockwell-Kollmann T, Masala G, Bala E, Raupach T. ChatGPT and Other Large Language Models in Medical Education - Scoping Literature Review. MEDICAL SCIENCE EDUCATOR 2025; 35:555-567. [PMID: 40144083 PMCID: PMC11933646 DOI: 10.1007/s40670-024-02206-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Accepted: 10/24/2024] [Indexed: 03/28/2025]
Abstract
This review aims to provide a summary of all scientific publications on the use of large language models (LLMs) in medical education over the first year of their availability. A scoping literature review was conducted in accordance with the PRISMA recommendations for scoping reviews. Five scientific literature databases were searched using predefined search terms. The search yielded 1509 initial results, of which 145 studies were ultimately included. Most studies assessed LLMs' capabilities in passing medical exams. Some studies discussed advantages, disadvantages, and potential use cases of LLMs. Very few studies conducted empirical research. Many published studies lack methodological rigor. We therefore propose a research agenda to improve the quality of studies on LLM.
Collapse
Affiliation(s)
- Alexandra Aster
- Institute of Medical Education, University Hospital Bonn, Venusberg-Campus 1, 53127 Bonn, Germany
| | - Matthias Carl Laupichler
- Institute of Medical Education, University Hospital Bonn, Venusberg-Campus 1, 53127 Bonn, Germany
| | - Tamina Rockwell-Kollmann
- Institute of Medical Education, University Hospital Bonn, Venusberg-Campus 1, 53127 Bonn, Germany
| | - Gilda Masala
- Institute of Medical Education, University Hospital Bonn, Venusberg-Campus 1, 53127 Bonn, Germany
| | - Ebru Bala
- Institute of Medical Education, University Hospital Bonn, Venusberg-Campus 1, 53127 Bonn, Germany
| | - Tobias Raupach
- Institute of Medical Education, University Hospital Bonn, Venusberg-Campus 1, 53127 Bonn, Germany
| |
Collapse
|
4
|
Almekkawi AK, Caruso JP, Anand S, Hawkins AM, Rauf R, Al-Shaikhli M, Aoun SG, Bagley CA. Comparative Analysis of Large Language Models and Spine Surgeons in Surgical Decision-Making and Radiological Assessment for Spine Pathologies. World Neurosurg 2025; 194:123531. [PMID: 39622288 DOI: 10.1016/j.wneu.2024.11.114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2024] [Revised: 11/25/2024] [Accepted: 11/26/2024] [Indexed: 12/26/2024]
Abstract
OBJECTIVE This study aimed to investigate the accuracy of large language models (LLMs), specifically ChatGPT and Claude, in surgical decision-making and radiological assessment for spine pathologies compared to experienced spine surgeons. METHODS The study employed a comparative analysis between the LLMs and a panel of attending spine surgeons. Five written clinical scenarios encompassing various spine pathologies were presented to the LLMs and surgeons, who provided recommended surgical treatment plans. Additionally, magnetic resonance imaging images depicting spine pathologies were analyzed by the LLMs and surgeons to assess their radiological interpretation abilities. Spino-pelvic parameters were estimated from a scoliosis radiograph by the LLMs. RESULTS Qualitative content analysis revealed limitations in the LLMs' consideration of patient-specific factors and the breadth of treatment options. Both ChatGPT and Claude provided detailed descriptions of magnetic resonance imaging findings but differed from the surgeons in terms of specific levels and severity of pathologies. The LLMs acknowledged the limitations of accurately measuring spino-pelvic parameters without specialized tools. The accuracy of surgical decision-making for the LLMs (20%) was lower than that of the attending surgeons (100%). Statistical analysis showed no significant differences in accuracy between the groups. CONCLUSIONS The study highlights the potential of LLMs in assisting with radiological interpretation and surgical decision-making in spine surgery. However, the current limitations, such as the lack of consideration for patient-specific factors and inaccuracies in treatment recommendations, emphasize the need for further refinement and validation of these artificial intelligence (AI) models. Continued collaboration between AI researchers and clinical experts is crucial to address these challenges and realize the full potential of AI in spine surgery.
Collapse
Affiliation(s)
- Ahmad K Almekkawi
- Saint Luke's Marion Bloch Neuroscience Institute Department of Neurosurgery, Kansas City, Missouri, USA.
| | - James P Caruso
- The University of Texas Southwestern Department of Neurosurgery, Dallas, Texas, USA
| | - Soummitra Anand
- The University of Texas Southwestern Department of Neurosurgery, Dallas, Texas, USA
| | - Angela M Hawkins
- Saint Luke's Marion Bloch Neuroscience Institute Department of Neurosurgery, Kansas City, Missouri, USA
| | - Rayaan Rauf
- The University of Missouri-Kansas City, School of Medicine, Kansas City, Missouri, USA
| | - Mayar Al-Shaikhli
- The University of Missouri-Kansas City, School of Medicine, Kansas City, Missouri, USA
| | - Salah G Aoun
- The University of Texas Southwestern Department of Neurosurgery, Dallas, Texas, USA
| | - Carlos A Bagley
- Saint Luke's Marion Bloch Neuroscience Institute Department of Neurosurgery, Kansas City, Missouri, USA
| |
Collapse
|
5
|
Koga S. Advancing large language models in nephrology: bridging the gap in image interpretation. Clin Exp Nephrol 2025; 29:128-129. [PMID: 39465433 DOI: 10.1007/s10157-024-02581-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2024] [Accepted: 10/18/2024] [Indexed: 10/29/2024]
Affiliation(s)
- Shunsuke Koga
- Department of Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania, 3400 Spruce Street, Philadelphia, PA, 19104, USA.
| |
Collapse
|
6
|
Patil A, Serrato P, Chisvo N, Arnaout O, See PA, Huang KT. Large language models in neurosurgery: a systematic review and meta-analysis. Acta Neurochir (Wien) 2024; 166:475. [PMID: 39579215 DOI: 10.1007/s00701-024-06372-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2024] [Accepted: 11/18/2024] [Indexed: 11/25/2024]
Abstract
BACKGROUND Large Language Models (LLMs) have garnered increasing attention in neurosurgery and possess significant potential to improve the field. However, the breadth and performance of LLMs across diverse neurosurgical tasks have not been systematically examined, and LLMs come with their own challenges and unique terminology. We seek to identify key models, establish reporting guidelines for replicability, and highlight progress in key application areas of LLM use in the neurosurgical literature. METHODS We searched PubMed and Google Scholar using terms related to LLMs and neurosurgery ("large language model" OR "LLM" OR "ChatGPT" OR "GPT-3" OR "GPT3" OR "GPT-3.5" OR "GPT3.5" OR "GPT-4" OR "GPT4" OR "LLAMA" OR "MISTRAL" OR "BARD") AND "neurosurgery". The final set of articles was reviewed for publication year, application area, specific LLM(s) used, control/comparison groups used to evaluate LLM performance, whether the article reported specific LLM prompts, prompting strategy types used, whether the LLM query could be reproduced in its entirety (including both the prompt used and any adjoining data), measures of hallucination, and reported performance measures. RESULTS Fifty-one articles met inclusion criteria, and were categorized into six application areas, with the most common being Generation of Text for Direct Clinical Use (n = 14, 27.5%), Answering Standardized Exam Questions (n = 12, 23.5%), and Clinical Judgement and Decision-Making Support (n = 11, 21.6%). The most frequently used LLMs were GPT-3.5 (n = 30, 58.8%), GPT-4 (n = 20, 39.2%), Bard (n = 9, 17.6%), and Bing (n = 6, 11.8%). Most studies (n = 43, 84.3%) used LLMs directly out-of-the-box, while 8 studies (15.7%) conducted advanced pre-training or fine-tuning. CONCLUSIONS Large language models show advanced capabilities in complex tasks and hold potential to transform neurosurgery. However, research typically addresses basic applications and overlooks enhancing LLM performance, facing reproducibility issues. Standardizing detailed reporting, considering LLM stochasticity, and using advanced methods beyond basic validation are essential for progress.
Collapse
Affiliation(s)
- Advait Patil
- Harvard Medical School, Harvard University, Boston, MA, 02115, USA.
- Computational Neuroscience Outcomes Center, Department of Neurosurgery, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
- Department of Neurosurgery, Brigham and Women's Hospital, 60 Fenwood Road, Hale Building, 4th Floor, Boston, MA, 02115, USA.
- Department of Neurosurgery, Boston Children's Hospital, 02115, Boston, MA, USA.
| | - Paul Serrato
- Yale School of Medicine, Yale University, New Haven, CT, 06510, USA
- Harvard T.H. Chan School of Public Health, Harvard University, Boston, CT, 02115, USA
- Computational Neuroscience Outcomes Center, Department of Neurosurgery, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Nathan Chisvo
- Computational Neuroscience Outcomes Center, Department of Neurosurgery, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Omar Arnaout
- Harvard Medical School, Harvard University, Boston, MA, 02115, USA
- Department of Neurosurgery, Brigham and Women's Hospital, 60 Fenwood Road, Hale Building, 4th Floor, Boston, MA, 02115, USA
| | - Pokmeng Alfred See
- Harvard Medical School, Harvard University, Boston, MA, 02115, USA
- Department of Neurosurgery, Boston Children's Hospital, 02115, Boston, MA, USA
| | - Kevin T Huang
- Harvard Medical School, Harvard University, Boston, MA, 02115, USA
- Department of Neurosurgery, Brigham and Women's Hospital, 60 Fenwood Road, Hale Building, 4th Floor, Boston, MA, 02115, USA
| |
Collapse
|
7
|
Waldock WJ, Zhang J, Guni A, Nabeel A, Darzi A, Ashrafian H. The Accuracy and Capability of Artificial Intelligence Solutions in Health Care Examinations and Certificates: Systematic Review and Meta-Analysis. J Med Internet Res 2024; 26:e56532. [PMID: 39499913 PMCID: PMC11576595 DOI: 10.2196/56532] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 06/26/2024] [Accepted: 09/25/2024] [Indexed: 11/20/2024] Open
Abstract
BACKGROUND Large language models (LLMs) have dominated public interest due to their apparent capability to accurately replicate learned knowledge in narrative text. However, there is a lack of clarity about the accuracy and capability standards of LLMs in health care examinations. OBJECTIVE We conducted a systematic review of LLM accuracy, as tested under health care examination conditions, as compared to known human performance standards. METHODS We quantified the accuracy of LLMs in responding to health care examination questions and evaluated the consistency and quality of study reporting. The search included all papers up until September 10, 2023, with all LLMs published in English journals that report clear LLM accuracy standards. The exclusion criteria were as follows: the assessment was not a health care exam, there was no LLM, there was no evaluation of comparable success accuracy, and the literature was not original research.The literature search included the following Medical Subject Headings (MeSH) terms used in all possible combinations: "artificial intelligence," "ChatGPT," "GPT," "LLM," "large language model," "machine learning," "neural network," "Generative Pre-trained Transformer," "Generative Transformer," "Generative Language Model," "Generative Model," "medical exam," "healthcare exam," and "clinical exam." Sensitivity, accuracy, and precision data were extracted, including relevant CIs. RESULTS The search identified 1673 relevant citations. After removing duplicate results, 1268 (75.8%) papers were screened for titles and abstracts, and 32 (2.5%) studies were included for full-text review. Our meta-analysis suggested that LLMs are able to perform with an overall medical examination accuracy of 0.61 (CI 0.58-0.64) and a United States Medical Licensing Examination (USMLE) accuracy of 0.51 (CI 0.46-0.56), while Chat Generative Pretrained Transformer (ChatGPT) can perform with an overall medical examination accuracy of 0.64 (CI 0.6-0.67). CONCLUSIONS LLMs offer promise to remediate health care demand and staffing challenges by providing accurate and efficient context-specific information to critical decision makers. For policy and deployment decisions about LLMs to advance health care, we proposed a new framework called RUBRICC (Regulatory, Usability, Bias, Reliability [Evidence and Safety], Interoperability, Cost, and Codesign-Patient and Public Involvement and Engagement [PPIE]). This presents a valuable opportunity to direct the clinical commissioning of new LLM capabilities into health services, while respecting patient safety considerations. TRIAL REGISTRATION OSF Registries osf.io/xqzkw; https://osf.io/xqzkw.
Collapse
Affiliation(s)
| | - Joe Zhang
- Imperial College London, London, United Kingdom
| | - Ahmad Guni
- Imperial College London, London, United Kingdom
| | - Ahmad Nabeel
- Institute of Global Health Innovation, Imperial College London, London, United Kingdom
| | - Ara Darzi
- Imperial College London, London, United Kingdom
| | - Hutan Ashrafian
- Institute of Global Health Innovation, Imperial College London, London, United Kingdom
| |
Collapse
|
8
|
Isch EL, Sarikonda A, Sambangi A, Carreras A, Sircar A, Self DM, Habarth-Morales TE, Caterson EJ, Aycart M. Evaluating the Efficacy of Large Language Models in CPT Coding for Craniofacial Surgery: A Comparative Analysis. J Craniofac Surg 2024:00001665-990000000-01868. [PMID: 39221924 DOI: 10.1097/scs.0000000000010575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2024] [Accepted: 07/23/2024] [Indexed: 09/04/2024] Open
Abstract
BACKGROUND The advent of Large Language Models (LLMs) like ChatGPT has introduced significant advancements in various surgical disciplines. These developments have led to an increased interest in the utilization of LLMs for Current Procedural Terminology (CPT) coding in surgery. With CPT coding being a complex and time-consuming process, often exacerbated by the scarcity of professional coders, there is a pressing need for innovative solutions to enhance coding efficiency and accuracy. METHODS This observational study evaluated the effectiveness of 5 publicly available large language models-Perplexity.AI, Bard, BingAI, ChatGPT 3.5, and ChatGPT 4.0-in accurately identifying CPT codes for craniofacial procedures. A consistent query format was employed to test each model, ensuring the inclusion of detailed procedure components where necessary. The responses were classified as correct, partially correct, or incorrect based on their alignment with established CPT coding for the specified procedures. RESULTS The results indicate that while there is no overall significant association between the type of AI model and the correctness of CPT code identification, there are notable differences in performance for simple and complex CPT codes among the models. Specifically, ChatGPT 4.0 showed higher accuracy for complex codes, whereas Perplexity.AI and Bard were more consistent with simple codes. DISCUSSION The use of AI chatbots for CPT coding in craniofacial surgery presents a promising avenue for reducing the administrative burden and associated costs of manual coding. Despite the lower accuracy rates compared with specialized, trained algorithms, the accessibility and minimal training requirements of the AI chatbots make them attractive alternatives. The study also suggests that priming AI models with operative notes may enhance their accuracy, offering a resource-efficient strategy for improving CPT coding in clinical practice. CONCLUSIONS This study highlights the feasibility and potential benefits of integrating LLMs into the CPT coding process for craniofacial surgery. The findings advocate for further refinement and training of AI models to improve their accuracy and practicality, suggesting a future where AI-assisted coding could become a standard component of surgical workflows, aligning with the ongoing digital transformation in health care.
Collapse
Affiliation(s)
- Emily L Isch
- Department of General Surgery, Thomas Jefferson University
| | | | | | | | - Adrija Sircar
- Sidney Kimmel Medical College at Thomas Jefferson University
| | - D Mitchell Self
- Department of Neurosurgery, Thomas Jefferson University and Jefferson Hospital for Neuroscience, Philadelphia, PA
| | | | - E J Caterson
- Department of Surgery, Division of Plastic Surgery, Nemours Children's Hospital Wilmington, DE
| | - Mario Aycart
- Department of Surgery, Division of Plastic Surgery, Nemours Children's Hospital Wilmington, DE
| |
Collapse
|
9
|
Aamir A, Hafsa H. "Incorporating large language models into academic neurosurgery: embracing the new era". Neurosurg Rev 2024; 47:211. [PMID: 38724772 DOI: 10.1007/s10143-024-02452-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2024] [Revised: 04/25/2024] [Accepted: 05/06/2024] [Indexed: 03/05/2025]
Abstract
This correspondence examines how LLMs, such as ChatGPT, have an effect on academic neurosurgery. It emphasises the potential of LLMs in enhancing clinical decision-making, medical education, and surgical practice by providing real-time access to extensive medical literature and data analysis. Although this correspondence acknowledges the opportunities that come with the incorporation of LLMs, it also discusses challenges, such as data privacy, ethical considerations, and regulatory compliance. Additionally, recent studies have assessed the effectiveness of LLMs in perioperative patient communication and medical education, and stressed the need for cooperation between neurosurgeons, data scientists, and AI experts to address these challenges and fully exploit the potential of LLMs in improving patient care and outcomes in neurosurgery.
Collapse
Affiliation(s)
- Ali Aamir
- Department of Medicine, Dow University of Health Sciences, Karachi, Pakistan.
| | - Hafiza Hafsa
- Department of Medicine, Dow University of Health Sciences, Karachi, Pakistan
| |
Collapse
|
10
|
Huang KT, Mehta NH, Gupta S, See AP, Arnaout O. Evaluation of the safety, accuracy, and helpfulness of the GPT-4.0 Large Language Model in neurosurgery. J Clin Neurosci 2024; 123:151-156. [PMID: 38574687 DOI: 10.1016/j.jocn.2024.03.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2024] [Revised: 03/19/2024] [Accepted: 03/22/2024] [Indexed: 04/06/2024]
Abstract
BACKGROUND Although prior work demonstrated the surprising accuracy of Large Language Models (LLMs) on neurosurgery board-style questions, their use in day-to-day clinical situations warrants further investigation. This study assessed GPT-4.0's responses to common clinical questions across various subspecialties of neurosurgery. METHODS A panel of attending neurosurgeons formulated 35 general neurosurgical questions spanning neuro-oncology, spine, vascular, functional, pediatrics, and trauma. All questions were input into GPT-4.0 with a prespecified, standard prompt. Responses were evaluated by two attending neurosurgeons, each on a standardized scale for accuracy, safety, and helpfulness. Citations were indexed and evaluated against identifiable database references. RESULTS GPT-4.0 responses were consistent with current medical guidelines and accounted for recent advances in the field 92.8 % and 78.6 % of the time respectively. Neurosurgeons reported GPT-4.0 responses providing unrealistic information or potentially risky information 14.3 % and 7.1 % of the time respectively. Assessed on 5-point scales, responses suggested that GPT-4.0 was clinically useful (4.0 ± 0.6), relevant (4.7 ± 0.3), and coherent (4.9 ± 0.2). The depth of clinical responses varied (3.7 ± 0.6), and "red flag" symptoms were missed 7.1 % of the time. Moreover, GPT-4.0 cited 86 references (2.46 citations per answer), of which only 50 % were deemed valid, and 77.1 % of responses contained at least one inappropriate citation. CONCLUSION Current general LLM technology can offer generally accurate, safe, and helpful neurosurgical information, but may not fully evaluate medical literature or recent field advances. Citation generation and usage remains unreliable. As this technology becomes more ubiquitous, clinicians will need to exercise caution when dealing with it in practice.
Collapse
Affiliation(s)
- Kevin T Huang
- Brigham and Women's Hospital, Department of Neurosurgery, 60 Fenwood Road, Hale Building, 4th FloorBoston, MA 02115, United States; Harvard Medical School, Department of Neurosurgery, 25 Shattuck StreetBoston, MA 02115, United States.
| | - Neel H Mehta
- Harvard Medical School, Department of Neurosurgery, 25 Shattuck StreetBoston, MA 02115, United States
| | - Saksham Gupta
- Brigham and Women's Hospital, Department of Neurosurgery, 60 Fenwood Road, Hale Building, 4th FloorBoston, MA 02115, United States; Harvard Medical School, Department of Neurosurgery, 25 Shattuck StreetBoston, MA 02115, United States
| | - Alfred P See
- Harvard Medical School, Department of Neurosurgery, 25 Shattuck StreetBoston, MA 02115, United States; Boston Children's Hospital, Department of Neurosurgery, 300 Longwood AvenueBoston, MA 02115, United States
| | - Omar Arnaout
- Brigham and Women's Hospital, Department of Neurosurgery, 60 Fenwood Road, Hale Building, 4th FloorBoston, MA 02115, United States; Harvard Medical School, Department of Neurosurgery, 25 Shattuck StreetBoston, MA 02115, United States
| |
Collapse
|
11
|
Dabbas WF, Odeibat YM, Alhazaimeh M, Hiasat MY, Alomari AA, Marji A, Samara QA, Ibrahim B, Al Arabiyat RM, Momani G. Accuracy of ChatGPT in Neurolocalization. Cureus 2024; 16:e59143. [PMID: 38803743 PMCID: PMC11129669 DOI: 10.7759/cureus.59143] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/27/2024] [Indexed: 05/29/2024] Open
Abstract
Introduction ChatGPT (OpenAI Incorporated, Mission District, San Francisco, United States) is an artificial intelligence (AI) chatbot with advanced communication skills and a massive knowledge database. However, its application in medicine, specifically in neurolocalization, necessitates clinical reasoning in addition to deep neuroanatomical knowledge. This article examines ChatGPT's capabilities in neurolocalization. Methods Forty-six text-based neurolocalization case scenarios were presented to ChatGPT-3.5 from November 6th, 2023, to November 16th, 2023. Seven neurosurgeons evaluated ChatGPT's responses to these cases, utilizing a 5-point scoring system recommended by ChatGPT, to score the accuracy of these responses. Results ChatGPT-3.5 achieved an accuracy score of 84.8% in generating "completely correct" and "mostly correct" responses. ANOVA analysis suggested a consistent scoring approach between different evaluators. The mean length of the case text was 69.8 tokens (SD 20.8). Conclusion While this accuracy score is promising, it is not yet reliable for routine patient care. We recommend keeping interactions with ChatGPT concise, precise, and simple to improve response accuracy. As AI continues to evolve, it will hold significant and innovative breakthroughs in medicine.
Collapse
Affiliation(s)
- Waleed F Dabbas
- Division of Neurosurgery, Department of Special Surgery, Faculty of Medicine, Al-Balqa Applied University, Al-Salt, JOR
| | | | - Mohammad Alhazaimeh
- Division of Neurosurgery, Department of Clinical Sciences, Faculty of Medicine, Yarmouk University, Irbid, JOR
| | | | - Amer A Alomari
- Department of Neurosurgery, San Filippo Neri Hospital/Azienda Sanitaria Locale (ASL) Roma 1, Rome, ITA
- Division of Neurosurgery, Department of Special Surgery, Faculty of Medicine, Mutah University, Al-Karak, JOR
| | - Ala Marji
- Department of Neurosurgery, King Hussein Cancer Center, Amman, JOR
- Department of Neurosurgery, San Filippo Neri Hospital/Azienda Sanitaria Locale (ASL) Roma 1, Rome, ITA
| | - Qais A Samara
- Division of Neurosurgery, Department of Special Surgery, Faculty of Medicine, Al-Balqa Applied University, Al-Salt, JOR
| | - Bilal Ibrahim
- Division of Neurosurgery, Department of Special Surgery, Faculty of Medicine, Al-Balqa Applied University, Al-Salt, JOR
| | - Rashed M Al Arabiyat
- Department of General Practice, Al-Hussein Salt New Hospital, Ministry of Health, Al-Salt, JOR
| | - Ghena Momani
- Faculty of Medicine, The Hashemite University, Zarqa, JOR
| |
Collapse
|
12
|
Powers AY, McCandless MG, Taussky P, Vega RA, Shutran MS, Moses ZB. Educational Limitations of ChatGPT in Neurosurgery Board Preparation. Cureus 2024; 16:e58639. [PMID: 38770467 PMCID: PMC11104278 DOI: 10.7759/cureus.58639] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/20/2024] [Indexed: 05/22/2024] Open
Abstract
Objective This study evaluated the potential of Chat Generative Pre-trained Transformer (ChatGPT) as an educational tool for neurosurgery residents preparing for the American Board of Neurological Surgery (ABNS) primary examination. Methods Non-imaging questions from the Congress of Neurological Surgeons (CNS) Self-Assessment in Neurological Surgery (SANS) online question bank were input into ChatGPT. Accuracy was evaluated and compared to human performance across subcategories. To quantify ChatGPT's educational potential, the concordance and insight of explanations were assessed by multiple neurosurgical faculty. Associations among these metrics as well as question length were evaluated. Results ChatGPT had an accuracy of 50.4% (1,068/2,120), with the highest and lowest accuracies in the pharmacology (81.2%, 13/16) and vascular (32.9%, 91/277) subcategories, respectively. ChatGPT performed worse than humans overall, as well as in the functional, other, peripheral, radiology, spine, trauma, tumor, and vascular subcategories. There were no subjects in which ChatGPT performed better than humans and its accuracy was below that required to pass the exam. The mean concordance was 93.4% (198/212) and the mean insight score was 2.7. Accuracy was negatively associated with question length (R2=0.29, p=0.03) but positively associated with both concordance (p<0.001, q<0.001) and insight (p<0.001, q<0.001). Conclusions The current study provides the largest and most comprehensive assessment of the accuracy and explanatory quality of ChatGPT in answering ABNS primary exam questions. The findings demonstrate shortcomings regarding ChatGPT's ability to pass, let alone teach, the neurosurgical boards.
Collapse
Affiliation(s)
- Andrew Y Powers
- Neurosurgery, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, USA
| | | | - Philipp Taussky
- Neurosurgery, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, USA
| | - Rafael A Vega
- Neurosurgery, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, USA
| | - Max S Shutran
- Neurosurgery, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, USA
| | - Ziev B Moses
- Neurosurgery, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, USA
| |
Collapse
|