Mani K, Scharfenberger T, Goldman SN, Kleinbart E, Mostafa E, Ramos RDLG, Fourman MS, Eleswarapu A. Multimodal machine learning for predicting perioperative safety indicators in spinal surgery.
Spine J 2025:S1529-9430(25)00158-5. [PMID:
40164437 DOI:
10.1016/j.spinee.2025.03.021]
[Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/12/2024] [Revised: 01/28/2025] [Accepted: 03/22/2025] [Indexed: 04/02/2025]
Abstract
BACKGROUND CONTEXT
Machine learning (ML) algorithms can utilize the large amount of tabular data in electronic health records (EHRs) to predict perioperative safety indicators. Integrating unstructured free-text inputs via natural language processing (NLP) may further enhance predictive accuracy.
PURPOSE
To design and validate a preoperative multimodal ML architecture that integrates structured EHR data (patient demographics, comorbidities, and clinical covariates) with unstructured free-text inputs (past medical and surgical history, medications, and problem lists) via NLP. The multimodal models aim to improve the prediction of perioperative safety indicators compared to baseline ML models that only use structured tabular EHR data.
STUDY DESIGN
Retrospective cohort study.
PATIENT SAMPLE
1,898 patients admitted for elective or emergency spine surgery at four separate large urban academic spine centers during a 5-year period from 2018 to 2023.
OUTCOME MEASURES
Numerical outputs between 0 and 1 corresponding to the likelihood of (I) extended length of stay (LOS), (II) 90-day reoperation, and (III) perioperative intensive care unit (ICU) admission.
METHODS
We predicted the following safety indicators (I) extended length of stay (LOS), (II) 90-day reoperation, and (III) perioperative intensive care unit (ICU) admission. The quanteda package for NLP within the R environment was utilized to preprocess free-text EHR inputs. The refined text was tokenized and transformed into numerical vectors using a bag-of-words approach and integrated with the tabular EHR data to create a document-feature matrix. Two extreme gradient boosted (XGBoost) ML models were trained: a base model utilizing only structured tabular EHR data and a combined multimodal model that leveraged both combined structured tabular EHR data with numerical vectors derived from free-text NLP inputs. Hyperparameter tuning was performed via grid search, and the models were validated using 10-fold cross validation with an 80:20 training/testing split. Word clouds were generated for the free-text data and explainable artificial intelligence (XAI) techniques were employed for feature importance. Metrics calculated for model performance included Area Under the Receiving-Operating Characteristic Curve (AUC-ROC), Brier score, Calibration slope, Calibration Intercept, Precision, Recall and F1-Score.
RESULTS
1,898 patients (60.7% female) were extracted from January 2018 to September 2023, with a median age of 60.0 (IQR: 52.0-68.0) and median body mass index (BMI) of 30.3 kgm2 (IQR: 26.3-34.6). Extended LOS was defined as ≥ 14.4 days, constituting 10.1% of all individuals. The median LOS for the entire cohort was 4.0 days (IQR: 2.0-7.0), while the 90-day reoperation rate was 10.54%, and the ICU admission rate was 7.74%. The preoperative tabular EHR models predicted perioperative safety indicators with AUC ranging from 0.770 to 0.779, Brier scores ranging from 0.074 to 0.099, and calibration slopes ranging from 2.279 to 2.418. Precision and recall for this model ranged from 0.918 to 0.973 and 0.988 to 0.994, respectively, resulting in F1-scores between 0.954 and 0.973. The combined multimodal models predicted perioperative safety indicators with AUC ranging from 0.827 to 0.903, Brier scores ranging from 0.056 to 0.083, and calibration slopes ranging from 0.755 to 1.217. The multimodal models achieved precision ranging from 0.909 to 0.933 and recall ranging from 0.979 to 0.994, leading to F1-scores between 0.943 and 0.962. Important tabular predictors included patient age, BMI, hemoglobin level, white blood cell count, platelet count, and a combined anterior/posterior spinal fusion approach. Important free-text inputs included vertebral osteomyelitis, radiculopathy, myelopathy, and spinal metastasis.
CONCLUSIONS
The multimodal NLP model exhibited superior performance in all outcome measures when compared to the baseline tabular model. Future work includes incorporating additional model dimensions, such as the history of present illness, physical exam, and spinal imaging, and clinically implementing the models into our informed consent and preoperative optimization pathway.
Collapse