Peng HY, Duan SJ, Pan L, Wang MY, Chen JL, Wang YC, Yao SK. Development and validation of machine learning models for nonalcoholic fatty liver disease.
Hepatobiliary Pancreat Dis Int 2023;
22:615-621. [PMID:
37005147 DOI:
10.1016/j.hbpd.2023.03.009]
[Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Accepted: 03/20/2023] [Indexed: 04/04/2023]
Abstract
BACKGROUND
Nonalcoholic fatty liver disease (NAFLD) had become the most prevalent liver disease worldwide. Early diagnosis could effectively reduce NAFLD-related morbidity and mortality. This study aimed to combine the risk factors to develop and validate a novel model for predicting NAFLD.
METHODS
We enrolled 578 participants completing abdominal ultrasound into the training set. The least absolute shrinkage and selection operator (LASSO) regression combined with random forest (RF) was conducted to screen significant predictors for NAFLD risk. Five machine learning models including logistic regression (LR), RF, extreme gradient boosting (XGBoost), gradient boosting machine (GBM), and support vector machine (SVM) were developed. To further improve model performance, we conducted hyperparameter tuning with train function in Python package 'sklearn'. We included 131 participants completing magnetic resonance imaging into the testing set for external validation.
RESULTS
There were 329 participants with NAFLD and 249 without in the training set, while 96 with NAFLD and 35 without were in the testing set. Visceral adiposity index, abdominal circumference, body mass index, alanine aminotransferase (ALT), ALT/AST (aspartate aminotransferase), age, high-density lipoprotein cholesterol (HDL-C) and elevated triglyceride (TG) were important predictors for NAFLD risk. The area under curve (AUC) of LR, RF, XGBoost, GBM, SVM were 0.915 [95% confidence interval (CI): 0.886-0.937], 0.907 (95% CI: 0.856-0.938), 0.928 (95% CI: 0.873-0.944), 0.924 (95% CI: 0.875-0.939), and 0.900 (95% CI: 0.883-0.913), respectively. XGBoost model presented the best predictive performance, and its AUC was enhanced to 0.938 (95% CI: 0.870-0.950) with further parameter tuning.
CONCLUSIONS
This study developed and validated five novel machine learning models for NAFLD prediction, among which XGBoost presented the best performance and was considered a reliable reference for early identification of high-risk patients with NAFLD in clinical practice.
Collapse