Kerr WT, McFarlane KN, Pucci GF, Carns DR, Israel A, Vighetti L, Pennell PB, Stern JM, Xia Z, Wang Y. Supervised machine learning compared to large language models for identifying functional seizures from medical records.
Epilepsia 2025;
66:1155-1164. [PMID:
39960122 PMCID:
PMC11997926 DOI:
10.1111/epi.18272]
[Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2024] [Revised: 01/07/2025] [Accepted: 01/08/2025] [Indexed: 04/16/2025]
Abstract
OBJECTIVE
The Functional Seizures Likelihood Score (FSLS) is a supervised machine learning-based diagnostic score that was developed to differentiate functional seizures (FS) from epileptic seizures (ES). In contrast to this targeted approach, large language models (LLMs) can identify patterns in data for which they were not specifically trained. To evaluate the relative benefits of each approach, we compared the diagnostic performance of the FSLS to two LLMs: ChatGPT and GPT-4.
METHODS
In total, 114 anonymized cases were constructed based on patients with documented FS, ES, mixed ES and FS, or physiologic seizure-like events (PSLEs). Text-based data were presented in three sequential prompts to the LLMs, showing the history of present illness (HPI), electroencephalography (EEG) results, and neuroimaging results. We compared the accuracy (number of correct predictions/number of cases) and area under the receiver-operating characteristic (ROC) curves (AUCs) of the LLMs to the FSLS using mixed-effects logistic regression.
RESULTS
The accuracy of FSLS was 74% (95% confidence interval [CI] 65%-82%) and the AUC was 85% (95% CI 77%-92%). GPT-4 was superior to both the FSLS and ChatGPT (p <.001), with an accuracy of 85% (95% CI 77%-91%) and AUC of 87% (95% CI 79%-95%). Cohen's kappa between the FSLS and GPT-4 was 40% (fair). The LLMs provided different predictions on different days when the same note was provided for 33% of patients, and the LLM's self-rated certainty was moderately correlated with this observed variability (Spearman's rho2: 30% [fair, ChatGPT] and 63% [substantial, GPT-4]).
SIGNIFICANCE
Both GPT-4 and the FSLS identified a substantial subset of patients with FS based on clinical history. The fair agreement in predictions highlights that the LLMs identified patients differently from the structured score. The inconsistency of the LLMs' predictions across days and incomplete insight into their own consistency was concerning. This comparison highlights both benefits and cautions about how machine learning and artificial intelligence could identify patients with FS in clinical practice.
Collapse