1
|
Hadjiathanasiou A, Goelz L, Muhn F, Heinz R, Kreißl L, Sparenberg P, Lemcke J, Schmehl I, Mutze S, Schuss P. Artificial intelligence in neurovascular decision-making: a comparative analysis of ChatGPT-4 and multidisciplinary expert recommendations for unruptured intracranial aneurysms. Neurosurg Rev 2025; 48:261. [PMID: 39982556 DOI: 10.1007/s10143-025-03341-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2024] [Revised: 01/05/2025] [Accepted: 02/01/2025] [Indexed: 02/22/2025]
Abstract
In the multidisciplinary treatment of cerebrovascular diseases, specialists from different disciplines strive to develop patient-specific treatment recommendations. ChatGPT is a natural language processing chatbot with increasing applicability in medical practice. This study evaluates ChatGPT's ability to provide treatment recommendations for patients with unruptured intracranial aneurysms (UIA). Anonymized patient data and radiological reports of 20 patients with UIAs were provided to GPT-4 in a standardized format and used to generate a treatment recommendation for different clinical scenarios. GPT-4 responses were evaluated by a multidisciplinary panel of specialists by means of the Likert scale and subsequently benchmarked against the Unruptured Intracranial Aneurysm Treatment Score (UIATS) as well as the actual treatment decision made by the multidisciplinary institutional neurovascular board (INVB). Agreement between expert raters was measured using linear weighted Fleiss-Kappa coefficient. GPT-4 analyzed individual pathological features of the radiological reports and formulated a corresponding assessment for each aspect. None of the recommendations generated reflected evidence of factual hallucination, although in 25% of the case studies no specific recommendation could be derived from the GPT-4 responses. The expert panel rated the overall quality of the GPT-4 recommendations with a median of 3.4 out of 5 points. The GPT-4 recommendations were congruent with those of the INBI in 65% of cases. Interrater reliability among experts showed moderate to low agreement in the assessment of AI-assisted decision making. GPT-4 appears to be able to process clinical information about UIAs and generate treatment recommendations. However, the level of ambiguity and the utilization of scientific evidence in the recommendations are not yet patient/case specific enough to substitute the decision-making of a multidisciplinary neurovascular board. A prospective evaluation of GPT-4 competence as a companion in decision-making panels is deemed necessary.
Collapse
Affiliation(s)
| | - Leonie Goelz
- Department of Radiology and Neuroradiology, BG Klinikum Unfallkrankenhaus Berlin, Berlin, Germany
- Institut for Diagnostic Radiology and Neuroradiology, Universitätsmedizin Greifswald, Greifswald, Germany
| | - Florian Muhn
- Department of Neurology, BG Klinikum Unfallkrankenhaus Berlin, Berlin, Germany
| | - Rebecca Heinz
- Department of Neurosurgery, BG Klinikum Unfallkrankenhaus Berlin, Berlin, Germany
| | - Lutz Kreißl
- Department of Radiology and Neuroradiology, BG Klinikum Unfallkrankenhaus Berlin, Berlin, Germany
| | - Paul Sparenberg
- Department of Neurology, BG Klinikum Unfallkrankenhaus Berlin, Berlin, Germany
| | - Johannes Lemcke
- Department of Neurosurgery, BG Klinikum Unfallkrankenhaus Berlin, Berlin, Germany
| | - Ingo Schmehl
- Department of Neurology, BG Klinikum Unfallkrankenhaus Berlin, Berlin, Germany
| | - Sven Mutze
- Department of Radiology and Neuroradiology, BG Klinikum Unfallkrankenhaus Berlin, Berlin, Germany
- Institut for Diagnostic Radiology and Neuroradiology, Universitätsmedizin Greifswald, Greifswald, Germany
| | - Patrick Schuss
- Department of Neurosurgery, BG Klinikum Unfallkrankenhaus Berlin, Berlin, Germany
| |
Collapse
|
2
|
Patil A, Serrato P, Chisvo N, Arnaout O, See PA, Huang KT. Large language models in neurosurgery: a systematic review and meta-analysis. Acta Neurochir (Wien) 2024; 166:475. [PMID: 39579215 DOI: 10.1007/s00701-024-06372-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2024] [Accepted: 11/18/2024] [Indexed: 11/25/2024]
Abstract
BACKGROUND Large Language Models (LLMs) have garnered increasing attention in neurosurgery and possess significant potential to improve the field. However, the breadth and performance of LLMs across diverse neurosurgical tasks have not been systematically examined, and LLMs come with their own challenges and unique terminology. We seek to identify key models, establish reporting guidelines for replicability, and highlight progress in key application areas of LLM use in the neurosurgical literature. METHODS We searched PubMed and Google Scholar using terms related to LLMs and neurosurgery ("large language model" OR "LLM" OR "ChatGPT" OR "GPT-3" OR "GPT3" OR "GPT-3.5" OR "GPT3.5" OR "GPT-4" OR "GPT4" OR "LLAMA" OR "MISTRAL" OR "BARD") AND "neurosurgery". The final set of articles was reviewed for publication year, application area, specific LLM(s) used, control/comparison groups used to evaluate LLM performance, whether the article reported specific LLM prompts, prompting strategy types used, whether the LLM query could be reproduced in its entirety (including both the prompt used and any adjoining data), measures of hallucination, and reported performance measures. RESULTS Fifty-one articles met inclusion criteria, and were categorized into six application areas, with the most common being Generation of Text for Direct Clinical Use (n = 14, 27.5%), Answering Standardized Exam Questions (n = 12, 23.5%), and Clinical Judgement and Decision-Making Support (n = 11, 21.6%). The most frequently used LLMs were GPT-3.5 (n = 30, 58.8%), GPT-4 (n = 20, 39.2%), Bard (n = 9, 17.6%), and Bing (n = 6, 11.8%). Most studies (n = 43, 84.3%) used LLMs directly out-of-the-box, while 8 studies (15.7%) conducted advanced pre-training or fine-tuning. CONCLUSIONS Large language models show advanced capabilities in complex tasks and hold potential to transform neurosurgery. However, research typically addresses basic applications and overlooks enhancing LLM performance, facing reproducibility issues. Standardizing detailed reporting, considering LLM stochasticity, and using advanced methods beyond basic validation are essential for progress.
Collapse
Affiliation(s)
- Advait Patil
- Harvard Medical School, Harvard University, Boston, MA, 02115, USA.
- Computational Neuroscience Outcomes Center, Department of Neurosurgery, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA.
- Department of Neurosurgery, Brigham and Women's Hospital, 60 Fenwood Road, Hale Building, 4th Floor, Boston, MA, 02115, USA.
- Department of Neurosurgery, Boston Children's Hospital, 02115, Boston, MA, USA.
| | - Paul Serrato
- Yale School of Medicine, Yale University, New Haven, CT, 06510, USA
- Harvard T.H. Chan School of Public Health, Harvard University, Boston, CT, 02115, USA
- Computational Neuroscience Outcomes Center, Department of Neurosurgery, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Nathan Chisvo
- Computational Neuroscience Outcomes Center, Department of Neurosurgery, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Omar Arnaout
- Harvard Medical School, Harvard University, Boston, MA, 02115, USA
- Department of Neurosurgery, Brigham and Women's Hospital, 60 Fenwood Road, Hale Building, 4th Floor, Boston, MA, 02115, USA
| | - Pokmeng Alfred See
- Harvard Medical School, Harvard University, Boston, MA, 02115, USA
- Department of Neurosurgery, Boston Children's Hospital, 02115, Boston, MA, USA
| | - Kevin T Huang
- Harvard Medical School, Harvard University, Boston, MA, 02115, USA
- Department of Neurosurgery, Brigham and Women's Hospital, 60 Fenwood Road, Hale Building, 4th Floor, Boston, MA, 02115, USA
| |
Collapse
|
3
|
Brown EDL, Ward M, Maity A, Mittler MA, Larry Lo SF, D'Amico RS. Enhancing Diagnostic Support for Chiari Malformation and Syringomyelia: A Comparative Study of Contextualized ChatGPT Models. World Neurosurg 2024; 189:e86-e107. [PMID: 38830507 DOI: 10.1016/j.wneu.2024.05.172] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Accepted: 05/28/2024] [Indexed: 06/05/2024]
Abstract
OBJECTIVES The rapidly increasing adoption of large language models in medicine has drawn attention to potential applications within the field of neurosurgery. This study evaluates the effects of various contextualization methods on ChatGPT's ability to provide expert-consensus aligned recommendations on the diagnosis and management of Chiari Malformation and Syringomyelia. METHODS Native GPT4 and GPT4 models contextualized using various strategies were asked questions revised from the 2022 Chiari and Syringomyelia Consortium International Consensus Document. ChatGPT-provided responses were then compared to consensus statements using reviewer assessments of 1) responding to the prompt, 2) agreement of ChatGPT response with consensus statements, 3) recommendation to consult with a medical professional, and 4) presence of supplementary information. Flesch-Kincaid, SMOG, word count, and Gunning-Fog readability scores were calculated for each model using the quanteda package in R. RESULTS Relative to GPT4, all contextualized GPTs demonstrated increased agreement with consensus statements. PDF+Prompting and Prompting models provided the most elevated agreement scores of 19 of 24 and 23 of 24, respectively, versus 9 of 24 for GPT4 (p=.021, p=.001). A trend toward improved readability was observed when comparing contextualized models at large to ChatGPT4, with significant decreases in average word count (180.7 vs 382.3, p<.001) and Flesch-Kincaid Reading Ease score (11.7 vs 17.2, p=.033). CONCLUSIONS The enhanced performance observed in response to ChatGPT4 contextualization suggests broader applications of large language models in neurosurgery than what the current literature indicates. This study provides proof of concept for the use of contextualized GPT models in neurosurgical contexts and showcases the easy accessibility of improved model performance.
Collapse
Affiliation(s)
- Ethan D L Brown
- Department of Neurologic Surgery, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, New York, USA.
| | - Max Ward
- Department of Neurologic Surgery, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, New York, USA
| | - Apratim Maity
- Department of Neurologic Surgery, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, New York, USA
| | - Mark A Mittler
- Department of Neurologic Surgery, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, New York, USA
| | - Sheng-Fu Larry Lo
- Department of Neurologic Surgery, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, New York, USA
| | - Randy S D'Amico
- Department of Neurologic Surgery, Donald and Barbara Zucker School of Medicine at Hofstra/Northwell, New York, USA
| |
Collapse
|
4
|
Mohamed AA, Lucke-Wold B. Apple Intelligence in neurosurgery. Neurosurg Rev 2024; 47:327. [PMID: 39004685 DOI: 10.1007/s10143-024-02568-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2024] [Revised: 06/26/2024] [Accepted: 07/07/2024] [Indexed: 07/16/2024]
Abstract
With the current artificial intelligence (AI) boom, new innovative and accessible applications requiring minimal computer science expertise have been developed for discipline specific and mainstream purposes. Apple Intelligence, a new AI model developed by Apple, aims to enhance user experiences with new functionalities across many of its product offerings. Although designed for the everyday user, many of these advances have potential applications in neurosurgery. These include functionalities for writing, image generation, and upgraded integrations to the voice command assistant Siri. Future integrations may also include other Apple products such as the vision pro for preoperative and intraoperative applications. Considering the popularity of Apple products, particularly the iPhone, it is important to appraise this new technology and how it can be leveraged to enhance patient care, improve neurosurgical education, and facilitate more efficiency for the neurosurgeon.
Collapse
Affiliation(s)
- Ali A Mohamed
- Charles E. Schmidt College of Medicine, Florida Atlantic University, Boca Raton, FL, USA.
- College of Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA.
| | - Brandon Lucke-Wold
- Lillian S. Wells Department of Neurosurgery, University of Florida, Gainesville, FL, USA
| |
Collapse
|
5
|
Mohamed AA, Lucke-Wold B. Text-to-video generative artificial intelligence: sora in neurosurgery. Neurosurg Rev 2024; 47:272. [PMID: 38867134 DOI: 10.1007/s10143-024-02514-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2024] [Revised: 06/01/2024] [Accepted: 06/09/2024] [Indexed: 06/14/2024]
Abstract
Artificial intelligence (AI) has increased in popularity in neurosurgery, with recent interest in generative AI algorithms such as the Large Language Model (LLM) ChatGPT. Sora, an innovation in generative AI, leverages natural language processing, deep learning, and computer vision to generate impressive videos from text prompts. This new tool has many potential applications in neurosurgery. These include patient education, public health, surgical training and planning, and research dissemination. However, there are considerable limitations to the current model such as physically implausible motion generation, spontaneous generation of subjects, unnatural object morphing, inaccurate physical interactions, and abnormal behavior presentation when many subjects are generated. Other typical concerns are with respect to patient privacy, bias, and ethics. Further, appropriate investigation is required to determine how effective generative videos are compared to their non-generated counterparts, irrespective of any limitations. Despite these challenges, Sora and other iterations of its text-to-video generative application may have many benefits to the neurosurgical community.
Collapse
Affiliation(s)
- Ali A Mohamed
- Charles E. Schmidt College of Medicine, Florida Atlantic University, Boca Raton, FL, USA.
- College of Engineering and Computer Science, Florida Atlantic University, Boca Raton, FL, USA.
| | - Brandon Lucke-Wold
- Lillian S. Wells Department of Neurosurgery, University of Florida, Gainesville, FL, USA
| |
Collapse
|
6
|
Huang KT, Mehta NH, Gupta S, See AP, Arnaout O. Evaluation of the safety, accuracy, and helpfulness of the GPT-4.0 Large Language Model in neurosurgery. J Clin Neurosci 2024; 123:151-156. [PMID: 38574687 DOI: 10.1016/j.jocn.2024.03.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2024] [Revised: 03/19/2024] [Accepted: 03/22/2024] [Indexed: 04/06/2024]
Abstract
BACKGROUND Although prior work demonstrated the surprising accuracy of Large Language Models (LLMs) on neurosurgery board-style questions, their use in day-to-day clinical situations warrants further investigation. This study assessed GPT-4.0's responses to common clinical questions across various subspecialties of neurosurgery. METHODS A panel of attending neurosurgeons formulated 35 general neurosurgical questions spanning neuro-oncology, spine, vascular, functional, pediatrics, and trauma. All questions were input into GPT-4.0 with a prespecified, standard prompt. Responses were evaluated by two attending neurosurgeons, each on a standardized scale for accuracy, safety, and helpfulness. Citations were indexed and evaluated against identifiable database references. RESULTS GPT-4.0 responses were consistent with current medical guidelines and accounted for recent advances in the field 92.8 % and 78.6 % of the time respectively. Neurosurgeons reported GPT-4.0 responses providing unrealistic information or potentially risky information 14.3 % and 7.1 % of the time respectively. Assessed on 5-point scales, responses suggested that GPT-4.0 was clinically useful (4.0 ± 0.6), relevant (4.7 ± 0.3), and coherent (4.9 ± 0.2). The depth of clinical responses varied (3.7 ± 0.6), and "red flag" symptoms were missed 7.1 % of the time. Moreover, GPT-4.0 cited 86 references (2.46 citations per answer), of which only 50 % were deemed valid, and 77.1 % of responses contained at least one inappropriate citation. CONCLUSION Current general LLM technology can offer generally accurate, safe, and helpful neurosurgical information, but may not fully evaluate medical literature or recent field advances. Citation generation and usage remains unreliable. As this technology becomes more ubiquitous, clinicians will need to exercise caution when dealing with it in practice.
Collapse
Affiliation(s)
- Kevin T Huang
- Brigham and Women's Hospital, Department of Neurosurgery, 60 Fenwood Road, Hale Building, 4th FloorBoston, MA 02115, United States; Harvard Medical School, Department of Neurosurgery, 25 Shattuck StreetBoston, MA 02115, United States.
| | - Neel H Mehta
- Harvard Medical School, Department of Neurosurgery, 25 Shattuck StreetBoston, MA 02115, United States
| | - Saksham Gupta
- Brigham and Women's Hospital, Department of Neurosurgery, 60 Fenwood Road, Hale Building, 4th FloorBoston, MA 02115, United States; Harvard Medical School, Department of Neurosurgery, 25 Shattuck StreetBoston, MA 02115, United States
| | - Alfred P See
- Harvard Medical School, Department of Neurosurgery, 25 Shattuck StreetBoston, MA 02115, United States; Boston Children's Hospital, Department of Neurosurgery, 300 Longwood AvenueBoston, MA 02115, United States
| | - Omar Arnaout
- Brigham and Women's Hospital, Department of Neurosurgery, 60 Fenwood Road, Hale Building, 4th FloorBoston, MA 02115, United States; Harvard Medical School, Department of Neurosurgery, 25 Shattuck StreetBoston, MA 02115, United States
| |
Collapse
|