1
|
Machine Learning-Based Prediction of Suicidal Thinking in Adolescents by Derivation and Validation in 3 Independent Worldwide Cohorts: Algorithm Development and Validation Study. J Med Internet Res 2024; 26:e55913. [PMID: 38758578 DOI: 10.2196/55913] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Revised: 03/24/2024] [Accepted: 03/25/2024] [Indexed: 05/18/2024] Open
Abstract
BACKGROUND Suicide is the second-leading cause of death among adolescents and is associated with clusters of suicides. Despite numerous studies on this preventable cause of death, the focus has primarily been on single nations and traditional statistical methods. OBJECTIVE This study aims to develop a predictive model for adolescent suicidal thinking using multinational data sets and machine learning (ML). METHODS We used data from the Korea Youth Risk Behavior Web-based Survey with 566,875 adolescents aged between 13 and 18 years and conducted external validation using the Youth Risk Behavior Survey with 103,874 adolescents and Norway's University National General Survey with 19,574 adolescents. Several tree-based ML models were developed, and feature importance and Shapley additive explanations values were analyzed to identify risk factors for adolescent suicidal thinking. RESULTS When trained on the Korea Youth Risk Behavior Web-based Survey data from South Korea with a 95% CI, the XGBoost model reported an area under the receiver operating characteristic (AUROC) curve of 90.06% (95% CI 89.97-90.16), displaying superior performance compared to other models. For external validation using the Youth Risk Behavior Survey data from the United States and the University National General Survey from Norway, the XGBoost model achieved AUROCs of 83.09% and 81.27%, respectively. Across all data sets, XGBoost consistently outperformed the other models with the highest AUROC score, and was selected as the optimal model. In terms of predictors of suicidal thinking, feelings of sadness and despair were the most influential, accounting for 57.4% of the impact, followed by stress status at 19.8%. This was followed by age (5.7%), household income (4%), academic achievement (3.4%), sex (2.1%), and others, which contributed less than 2% each. CONCLUSIONS This study used ML by integrating diverse data sets from 3 countries to address adolescent suicide. The findings highlight the important role of emotional health indicators in predicting suicidal thinking among adolescents. Specifically, sadness and despair were identified as the most significant predictors, followed by stressful conditions and age. These findings emphasize the critical need for early diagnosis and prevention of mental health issues during adolescence.
Collapse
|
2
|
Explainable cancer factors discovery: Shapley additive explanation for machine learning models demonstrates the best practices in the case of pancreatic cancer. Pancreatology 2024; 24:404-423. [PMID: 38342661 DOI: 10.1016/j.pan.2024.02.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 01/07/2024] [Accepted: 02/05/2024] [Indexed: 02/13/2024]
Abstract
Pancreatic cancer is one of digestive tract cancers with high mortality rate. Despite the wide range of available treatments and improvements in surgery, chemotherapy, and radiation therapy, the five-year prognosis for individuals diagnosed pancreatic cancer remains poor. There is still research to be done to see if immunotherapy may be used to treat pancreatic cancer. The goals of our research were to comprehend the tumor microenvironment of pancreatic cancer, found a useful biomarker to assess the prognosis of patients, and investigated its biological relevance. In this paper, machine learning methods such as random forest were fused with weighted gene co-expression networks for screening hub immune-related genes (hub-IRGs). LASSO regression model was used to further work. Thus, we got eight hub-IRGs. Based on hub-IRGs, we created a prognosis risk prediction model for PAAD that can stratify accurately and produce a prognostic risk score (IRG_Score) for each patient. In the raw data set and the validation data set, the five-year area under the curve (AUC) for this model was 0.9 and 0.7, respectively. And shapley additive explanation (SHAP) portrayed the importance of prognostic risk prediction influencing factors from a machine learning perspective to obtain the most influential certain gene (or clinical factor). The five most important factors were TRIM67, CORT, PSPN, SCAMP5, RFXAP, all of which are genes. In summary, the eight hub-IRGs had accurate risk prediction performance and biological significance, which was validated in other cancers. The result of SHAP helped to understand the molecular mechanism of pancreatic cancer.
Collapse
|
3
|
Explainable AI-based Deep-SHAP for mapping the multivariate relationships between regional neuroimaging biomarkers and cognition. Eur J Radiol 2024; 174:111403. [PMID: 38452732 DOI: 10.1016/j.ejrad.2024.111403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Revised: 01/16/2024] [Accepted: 03/01/2024] [Indexed: 03/09/2024]
Abstract
BACKGROUND Mild cognitive impairment (MCI)/Alzheimer's disease (AD) is associated with cognitive decline beyond normal aging and linked to the alterations of brain volume quantified by magnetic resonance imaging (MRI) and amyloid-beta (Aβ) quantified by positron emission tomography (PET). Yet, the complex relationships between these regional imaging measures and cognition in MCI/AD remain unclear. Explainable artificial intelligence (AI) may uncover such relationships. METHOD We integrate the AI-based deep learning neural network and Shapley additive explanations (SHAP) approaches and introduce the Deep-SHAP method to investigate the multivariate relationships between regional imaging measures and cognition. After validating this approach on simulated data, we apply it to real experimental data from MCI/AD patients. RESULTS Deep-SHAP significantly predicted cognition using simulated regional features and identified the ground-truth simulated regions as the most significant multivariate predictors. When applied to experimental MRI data, Deep-SHAP revealed that the insula, lateral occipital, medial frontal, temporal pole, and occipital fusiform gyrus are the primary contributors to global cognitive decline in MCI/AD. Furthermore, when applied to experimental amyloid Pittsburgh compound B (PiB)-PET data, Deep-SHAP identified the key brain regions for global cognitive decline in MCI/AD as the inferior temporal, parahippocampal, inferior frontal, supratemporal, and lateral frontal gray matter. CONCLUSION Deep-SHAP method uncovered the multivariate relationships between regional brain features and cognition, offering insights into the most critical modality-specific brain regions involved in MCI/AD mechanisms.
Collapse
|
4
|
Automated Machine Learning and Explainable AI (AutoML-XAI) for Metabolomics: Improving Cancer Diagnostics. JOURNAL OF THE AMERICAN SOCIETY FOR MASS SPECTROMETRY 2024. [PMID: 38690775 DOI: 10.1021/jasms.3c00403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2024]
Abstract
Metabolomics generates complex data necessitating advanced computational methods for generating biological insight. While machine learning (ML) is promising, the challenges of selecting the best algorithms and tuning hyperparameters, particularly for nonexperts, remain. Automated machine learning (AutoML) can streamline this process; however, the issue of interpretability could persist. This research introduces a unified pipeline that combines AutoML with explainable AI (XAI) techniques to optimize metabolomics analysis. We tested our approach on two data sets: renal cell carcinoma (RCC) urine metabolomics and ovarian cancer (OC) serum metabolomics. AutoML, using Auto-sklearn, surpassed standalone ML algorithms like SVM and k-Nearest Neighbors in differentiating between RCC and healthy controls, as well as OC patients and those with other gynecological cancers. The effectiveness of Auto-sklearn is highlighted by its AUC scores of 0.97 for RCC and 0.85 for OC, obtained from the unseen test sets. Importantly, on most of the metrics considered, Auto-sklearn demonstrated a better classification performance, leveraging a mix of algorithms and ensemble techniques. Shapley Additive Explanations (SHAP) provided a global ranking of feature importance, identifying dibutylamine and ganglioside GM(d34:1) as the top discriminative metabolites for RCC and OC, respectively. Waterfall plots offered local explanations by illustrating the influence of each metabolite on individual predictions. Dependence plots spotlighted metabolite interactions, such as the connection between hippuric acid and one of its derivatives in RCC, and between GM3(d34:1) and GM3(18:1_16:0) in OC, hinting at potential mechanistic relationships. Through decision plots, a detailed error analysis was conducted, contrasting feature importance for correctly versus incorrectly classified samples. In essence, our pipeline emphasizes the importance of harmonizing AutoML and XAI, facilitating both simplified ML application and improved interpretability in metabolomics data science.
Collapse
|
5
|
Prediction models for postoperative recurrence of non-lactating mastitis based on machine learning. BMC Med Inform Decis Mak 2024; 24:106. [PMID: 38649879 PMCID: PMC11036744 DOI: 10.1186/s12911-024-02499-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Accepted: 04/03/2024] [Indexed: 04/25/2024] Open
Abstract
OBJECTIVES This study aims to build a machine learning (ML) model to predict the recurrence probability for postoperative non-lactating mastitis (NLM) by Random Forest (RF) and XGBoost algorithms. It can provide the ability to identify the risk of NLM recurrence and guidance in clinical treatment plan. METHODS This study was conducted on inpatients who were admitted to the Mammary Department of Shuguang Hospital affiliated to Shanghai University of Traditional Chinese Medicine between July 2019 to December 2021. Inpatient data follow-up has been completed until December 2022. Ten features were selected in this study to build the ML model: age, body mass index (BMI), number of abortions, presence of inverted nipples, extent of breast mass, white blood cell count (WBC), neutrophil to lymphocyte ratio (NLR), albumin-globulin ratio (AGR) and triglyceride (TG) and presence of intraoperative discharge. We used two ML approaches (RF and XGBoost) to build models and predict the NLM recurrence risk of female patients. Totally 258 patients were randomly divided into a training set and a test set according to a 75%-25% proportion. The model performance was evaluated based on Accuracy, Precision, Recall, F1-score and AUC. The Shapley Additive Explanations (SHAP) method was used to interpret the model. RESULTS There were 48 (18.6%) NLM patients who experienced recurrence during the follow-up period. Ten features were selected in this study to build the ML model. For the RF model, BMI is the most important influence factor and for the XGBoost model is intraoperative discharge. The results of tenfold cross-validation suggest that both the RF model and the XGBoost model have good predictive performance, but the XGBoost model has a better performance than the RF model in our study. The trends of SHAP values of all features in our models are consistent with the trends of these features' clinical presentation. The inclusion of these ten features in the model is necessary to build practical prediction models for recurrence. CONCLUSIONS The results of tenfold cross-validation and SHAP values suggest that the models have predictive ability. The trend of SHAP value provides auxiliary validation in our models and makes it have more clinical significance.
Collapse
|
6
|
An Interpretable Screening Approach Derived Through XGBoost Regression for the Discovery of Hypolipidemic Contributors in Chinese Hawthorn Leaf and its Counterfeit Malus Doumeri Leaf. PLANT FOODS FOR HUMAN NUTRITION (DORDRECHT, NETHERLANDS) 2024; 79:209-218. [PMID: 38340238 DOI: 10.1007/s11130-024-01148-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Accepted: 01/23/2024] [Indexed: 02/12/2024]
Abstract
The active ingredient group is a prominent feature reflecting the inherent characteristics of plant-based functional foods. Chinese hawthorn leaf (CHL), a tea substitute possessing intrinsic nutritional properties in anti-hyperlipidemia, was first found to be adulterated with Malus doumeri leaf (MDL) owing to similar commercial labels. In this context, the above-mentioned two contrasting species were explored through phytochemical profiling and activity assessment. The amelioration effect of CHL on free fatty acids-elicited lipid deposition in HepG2 cells was significantly better than that of MDL. Molecular networking-based metabolic profiles identified 68 and 67 components in CHL and MDL, with 33 shared components. Extreme gradient boosting (XGBoost) algorithm with outstanding performance was selected to screen candidate components contributing to hypolipidemic activity, and the output was later interpreted by Shapley additive explanations (SHAP) method. Twelve and eight components were separately screened as hyperlipidemic inhibitors in CHL and MDL, while only four constituents were shared. The bioactivity evaluation of selected ingredients and combinations further confirmed their anti-hyperlipidemia capacity. These findings emphasized the feasibility of filtering bioactivity-related compounds using interpretable machine learning approaches and illustrated that related species may contain different hypolipidemic contributors, even if shared constituents existed.
Collapse
|
7
|
Machine learning prediction of higher heating value of biochar based on biomass characteristics and pyrolysis conditions. BIORESOURCE TECHNOLOGY 2024; 395:130364. [PMID: 38262543 DOI: 10.1016/j.biortech.2024.130364] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/15/2023] [Revised: 11/07/2023] [Accepted: 01/19/2024] [Indexed: 01/25/2024]
Abstract
The higher heating value of biochar is an important parameter for the utilization of biomass energy. In this work, extreme gradient boosting regression and artificial neural network were used to predict it based on the characteristics of biomass and pyrolysis conditions. Besides, empirical correlations were developed for comparison. Results showed that the extreme gradient boosting regression models showed better performance (R2 = 0.83-0.94). The shapley additive explanations and partial dependence plot indicated that lignin content and higher heating value of raw material were highly positively correlated with higher heating value of biochar, and found the better conditions such as pyrolysis temperature (>550 °C), lignin content (>40 wt%) for high-higher heating value biochar preparation. What's more, a program that predicted higher heating value of biochar was developed through PySimpleGUI library. It offered a new optimization idea for the directional preparation process of biochar.
Collapse
|
8
|
From statistical inference to machine learning: A paradigm shift in contemporary cardiovascular pharmacotherapy. Br J Clin Pharmacol 2024; 90:691-699. [PMID: 37845041 DOI: 10.1111/bcp.15927] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2023] [Accepted: 10/02/2023] [Indexed: 10/18/2023] Open
Abstract
AIMS Heart failure with reduced ejection fraction (HFrEF) poses significant challenges for clinicians and researchers, owing to its multifaceted aetiology and complex treatment regimens. In light of this, artificial intelligence methods offer an innovative approach to identifying relationships within complex clinical datasets. Our study aims to explore the potential for machine learning algorithms to provide deeper insights into datasets of HFrEF patients. METHODS To this end, we analysed a cohort of 386 HFrEF patients who had been initiated on sodium-glucose co-transporter-2 inhibitor treatment and had completed a minimum of a 6-month follow-up. RESULTS In traditional frequentist statistical analyses, patients receiving the highest doses of beta-blockers (BBs) (chi-square test, P = .036) and those newly initiated on sacubitril-valsartan (chi-square test, P = .023) showed better outcomes. However, none of these pharmacological features stood out as independent predictors of improved outcomes in the Cox proportional hazards model. In contrast, when employing eXtreme Gradient Boosting (XGBoost) algorithms in conjunction with the data using Shapley additive explanations (SHAP), we identified several models with significant predictive power. The XGBoost algorithm inherently accommodates non-linear distribution, multicollinearity and confounding. Within this framework, pharmacological categories like 'newly initiated treatment with sacubitril/valsartan' and 'BB dose escalation' emerged as strong predictors of long-term outcomes. CONCLUSIONS In this manuscript, we not only emphasize the strengths of this machine learning approach but also discuss its potential limitations and the risk of identifying statistically significant yet clinically irrelevant predictors.
Collapse
|
9
|
Research on mining land subsidence by intelligent hybrid model based on gradient boosting with categorical features support algorithm. JOURNAL OF ENVIRONMENTAL MANAGEMENT 2024; 354:120309. [PMID: 38377759 DOI: 10.1016/j.jenvman.2024.120309] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/06/2023] [Revised: 12/19/2023] [Accepted: 02/06/2024] [Indexed: 02/22/2024]
Abstract
Land subsidence induced by coal mining (MLS) has posed a huge threat to the ecological environment, buildings, roads, and other infrastructure safety in mining areas. However, the prediction and evaluation of MLS is relatively complex, and the reliability of the prediction results is closely related to factors such as the professional knowledge and engineering experience of researchers. This paper aims to combine intelligent optimization algorithms: ant lion optimizer (ALO), bald eagle search (BES), bird swarm algorithm (BSA), harris hawks optimization (HHO), and sparrow search algorithm (SSA), with machine learning model of gradient boosting with categorical features support algorithm (CatBoost) to predict MLS. To achieve this goal, five hybrid models based CatBoost were developed and the prediction accuracy and reliability of the models were compared and analyzed. The prediction performance of the hybrid models has been significantly improved on the basis of a single model, of which the SSA-CatBoost model has the most obvious improvement (from R2 = 0.927 to 0.965, RMSE = 0.541 to 0.377, MAE = 0.386 to 0.297, VAF = 92.720 to 95.837). The importance and predictive contribution of all input features to predictive labels were studied with the Shapley method. The research results indicate that hybrid model technology is a reliable MLS prediction method. This study can help mining technicians use machine learning methods to study the degree of MLS damage to the surface environment and provide scientific advanced prediction and evaluation for the protection and management of the ecological environment in mining areas and the formulation of safety production measures.
Collapse
|
10
|
Spatial distribution and source identification of potentially toxic elements in Yellow River Delta soils, China: An interpretable machine-learning approach. THE SCIENCE OF THE TOTAL ENVIRONMENT 2024; 912:169092. [PMID: 38056655 DOI: 10.1016/j.scitotenv.2023.169092] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/29/2023] [Revised: 11/15/2023] [Accepted: 12/02/2023] [Indexed: 12/08/2023]
Abstract
Identifying the driving factors and quantifying the sources of potentially toxic elements (PTEs) are essential for protecting the ecological environment of the Yellow River Delta. In this study, data from 201 surface soil samples and 16 environmental variables were collected, and the random forest (RF) and Shapley additive explanations (SHAP) methods were then combined to explore the key factors affecting soil PTEs. An innovative t-distributed random neighbor embedding-RF-SHAP model was then constructed, based on the absolute principal component score and multivariate linear regression model, to quantitatively determine PTE sources. Although average PTE concentrations did not exceed the risk control values, PTE distributions exhibited significant differences. It was found that sodium, soil organic matter, and phosphorus contents were the three most important factors affecting PTEs, and human activities and natural environmental factors both influence PTE contents by altering the soil properties. The proposed model successfully determined PTE sources in the soil, outperforming the original linear regression model with a significantly lower RMSE. Source analysis revealed that the parent material was the main contributor to soil PTEs, accounting for more than half of the total PTE content. Industrial and agricultural activities also contributed to an increase in soil PTEs, with average contributions of 19.91 % and 17.44 %, respectively. Unknown sources accounted for 10.83 % of the total PTE content. Thus, the proposed model provides innovative perspectives on source parsing. These findings provide valuable scientific insights for policymakers seeking to develop effective environmental protection measures and improve the quality of saline-alkali land in the Yellow River Delta.
Collapse
|
11
|
Interpretable machine learning model to predict surgical difficulty in laparoscopic resection for rectal cancer. Front Oncol 2024; 14:1337219. [PMID: 38380369 PMCID: PMC10878416 DOI: 10.3389/fonc.2024.1337219] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Accepted: 01/15/2024] [Indexed: 02/22/2024] Open
Abstract
Background Laparoscopic total mesorectal excision (LaTME) is standard surgical methods for rectal cancer, and LaTME operation is a challenging procedure. This study is intended to use machine learning to develop and validate prediction models for surgical difficulty of LaTME in patients with rectal cancer and compare these models' performance. Methods We retrospectively collected the preoperative clinical and MRI pelvimetry parameter of rectal cancer patients who underwent laparoscopic total mesorectal resection from 2017 to 2022. The difficulty of LaTME was defined according to the scoring criteria reported by Escal. Patients were randomly divided into training group (80%) and test group (20%). We selected independent influencing features using the least absolute shrinkage and selection operator (LASSO) and multivariate logistic regression method. Adopt synthetic minority oversampling technique (SMOTE) to alleviate the class imbalance problem. Six machine learning model were developed: light gradient boosting machine (LGBM); categorical boosting (CatBoost); extreme gradient boost (XGBoost), logistic regression (LR); random forests (RF); multilayer perceptron (MLP). The area under receiver operating characteristic curve (AUROC), accuracy, sensitivity, specificity and F1 score were used to evaluate the performance of the model. The Shapley Additive Explanations (SHAP) analysis provided interpretation for the best machine learning model. Further decision curve analysis (DCA) was used to evaluate the clinical manifestations of the model. Results A total of 626 patients were included. LASSO regression analysis shows that tumor height, prognostic nutrition index (PNI), pelvic inlet, pelvic outlet, sacrococcygeal distance, mesorectal fat area and angle 5 (the angle between the apex of the sacral angle and the lower edge of the pubic bone) are the predictor variables of the machine learning model. In addition, the correlation heatmap shows that there is no significant correlation between these seven variables. When predicting the difficulty of LaTME surgery, the XGBoost model performed best among the six machine learning models (AUROC=0.855). Based on the decision curve analysis (DCA) results, the XGBoost model is also superior, and feature importance analysis shows that tumor height is the most important variable among the seven factors. Conclusions This study developed an XGBoost model to predict the difficulty of LaTME surgery. This model can help clinicians quickly and accurately predict the difficulty of surgery and adopt individualized surgical methods.
Collapse
|
12
|
Interpretable machine learning model for early prediction of 28-day mortality in ICU patients with sepsis-induced coagulopathy: development and validation. Eur J Med Res 2024; 29:14. [PMID: 38172962 PMCID: PMC10763177 DOI: 10.1186/s40001-023-01593-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2022] [Accepted: 12/13/2023] [Indexed: 01/05/2024] Open
Abstract
OBJECTIVE Sepsis-induced coagulopathy (SIC) is extremely common in individuals with sepsis, significantly associated with poor outcomes. This study attempted to develop an interpretable and generalizable machine learning (ML) model for early predicting the risk of 28-day death in patients with SIC. METHODS In this retrospective cohort study, we extracted SIC patients from the Medical Information Mart for Intensive Care III (MIMIC-III), MIMIC-IV, and eICU-CRD database according to Toshiaki Iba's scale. And the overlapping in the MIMIC-IV was excluded for this study. Afterward, only the MIMIC-III cohort was randomly divided into the training set, and the internal validation set according to the ratio of 7:3, while the MIMIC-IV and eICU-CRD databases were considered the external validation sets. The predictive factors for 28-day mortality of SIC patients were determined using recursive feature elimination combined with tenfold cross-validation (RFECV). Then, we constructed models using ML algorithms. Multiple metrics were used for evaluation of performance of the models, including the area under the receiver operating characteristic curve (AUROC), area under the precision recall curve (AUPRC), accuracy, sensitivity, specificity, negative predictive value, positive predictive value, recall, and F1 score. Finally, Shapley Additive Explanations (SHAP), Local Interpretable Model-Agnostic Explanations (LIME) were employed to provide a reasonable interpretation for the prediction results. RESULTS A total of 3280, 2798, and 1668 SIC patients were screened from MIMIC-III, MIMIC-IV, and eICU-CRD databases, respectively. Seventeen features were selected to construct ML prediction models. XGBoost had the best performance in predicting the 28-day mortality of SIC patients, with AUC of 0.828, 0.913 and 0.923, the AUPRC of 0.807, 0.796 and 0.921, the accuracy of 0.785, 0.885 and 0.891, the F1 scores were 0.63, 0.69 and 0.70 in MIMIC-III (internal validation set), MIMIC-IV, and eICU-CRD databases. The importance ranking and SHAP analyses showed that initial SOFA score, red blood cell distribution width (RDW), and age were the top three critical features in the XGBoost model. CONCLUSIONS We developed an optimal and explainable ML model to predict the risk of 28-day death of SIC patients 28-day death risk. Compared with conventional scoring systems, the XGBoost model performed better. The model established will have the potential to improve the level of clinical practice for SIC patients.
Collapse
|
13
|
Predicting geogenic groundwater arsenic contamination risk in floodplains using interpretable machine-learning model. ENVIRONMENTAL POLLUTION (BARKING, ESSEX : 1987) 2024; 340:122787. [PMID: 37879555 DOI: 10.1016/j.envpol.2023.122787] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Revised: 09/17/2023] [Accepted: 10/21/2023] [Indexed: 10/27/2023]
Abstract
Long-term exposure to geogenic arsenic (As)-contaminated groundwater poses a severe threat to public health problems. Generally, elevated As concentrations have been observed with high amounts of ammonium in groundwater of floodplains. An extreme gradient boosting algorithm was conducted to develop a probability model based on hydrogeochemical data, which predicted the occurrence rates of groundwater As on a regional scale. Results showed that concentrations of NH4+, Eh, K, Cl-, SO42-, and NO3- were powerful predictive variables of As exposure. The model revealed the co-enrichment of As with NH4+, suggesting that the mineralization of nitrogen-containing organic matter promoted the reduction of As-bearing iron-oxides. The predicted distribution of high-As groundwater showed high consistency with known spatial distribution of As contamination, and the model also accurately predicted As concentrations in Jiangbei Plain of China and typical As-affected floodplains of Southeast Asia. The model can serve as a low-cost and rapid virtual sensor for detecting As concentrations in private or newly drilled wells, thereby providing critical information for informed management decisions, environmental protection and public health safety.
Collapse
|
14
|
Prediction Model of Ocular Metastases in Gastric Adenocarcinoma: Machine Learning-Based Development and Interpretation Study. Technol Cancer Res Treat 2024; 23:15330338231219352. [PMID: 38233736 PMCID: PMC10865948 DOI: 10.1177/15330338231219352] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Revised: 10/10/2023] [Accepted: 11/08/2023] [Indexed: 01/19/2024] Open
Abstract
Background: Although gastric adenocarcinoma (GA) related ocular metastasis (OM) is rare, its occurrence indicates a more severe disease. We aimed to utilize machine learning (ML) to analyze the risk factors of GA-related OM and predict its risks. Methods: This is a retrospective cohort study. The clinical data of 3532 GA patients were collected and randomly classified into training and validation sets in a ratio of 7:3. Those with or without OM were classified into OM and non-OM (NOM) groups. Univariate and multivariate logistic regression analyses and least absolute shrinkage and selection operator were conducted. We integrated the variables identified through feature importance ranking and further refined the selection process using forward sequential feature selection based on random forest (RF) algorithm before incorporating them into the ML model. We applied six ML algorithms to construct the predictive GA model. The area under the receiver operating characteristic (ROC) curve indicated the model's predictive ability. Also, we established a network risk calculator based on the best performance model. We used Shapley additive interpretation (SHAP) to identify risk factors and to confirm the interpretability of the black box model. We have de-identified all patient details. Results: The ML model, consisting of 13 variables, achieved an optimal predictive performance using the gradient boosting machine (GBM) model, with an impressive area under the curve (AUC) of 0.997 in the test set. Utilizing the SHAP method, we identified crucial factors for OM in GA patients, including LDL, CA724, CEA, AFP, CA125, Hb, CA153, and Ca2+. Additionally, we validated the model's reliability through an analysis of two patient cases and developed a functional online web prediction calculator based on the GBM model. Conclusion: We used the ML method to establish a risk prediction model for GA-related OM and showed that GBM performed best among the six ML models. The model may identify patients with GA-related OM to provide early and timely treatment.
Collapse
|
15
|
Machine learning and decision making in aortic arch repair. J Thorac Cardiovasc Surg 2023:S0022-5223(23)01108-X. [PMID: 38016622 DOI: 10.1016/j.jtcvs.2023.11.032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Revised: 11/16/2023] [Accepted: 11/19/2023] [Indexed: 11/30/2023]
Abstract
BACKGROUND Decision making during aortic arch surgery regarding cannulation strategy and nadir temperature are important in reducing risk, and there is a need to determine the best individualized strategy in a data-driven fashion. Using machine learning (ML), we modeled the risk of death or stroke in elective aortic arch surgery based on patient characteristics and intraoperative decisions. METHODS The study cohort comprised 1323 patients from 9 institutions who underwent an elective aortic arch procedure between 2002 and 2021. A total of 69 variables were used in developing a logistic regression and XGBoost ML model trained for binary classification of mortality and stroke. Shapely additive explanations (SHAP) values were studied to determine the importance of intraoperative decisions. RESULTS During the study period, 3.9% of patients died and 5.4% experienced stroke. XGBoost (area under the curve [AUC], 0.77 for death, 0.87 for stroke) demonstrated better discrimination than logistic regression (AUC, 0.65 for death, 0.75 for stroke). From SHAP analysis, intraoperative decisions are 3 of the top 20 predictors of death and 6 of the top 20 predictors of stroke. Predictor weights are patient-specific and reflect the patient's preoperative characteristics and other intraoperative decisions. Patient-level simulation also demonstrates the variable contribution of each decision in the context of the other choices that are made. CONCLUSIONS Using ML, we can more accurately identify patients at risk of death and stroke, as well as the strategy that better reduces the risk of adverse events compared to traditional prediction models. Operative decisions made may be tailored based on a patient's specific characteristics, allowing for maximized, personalized benefit.
Collapse
|
16
|
Prediction model for hepatocellular carcinoma recurrence after hepatectomy: Machine learning-based development and interpretation study. Heliyon 2023; 9:e22458. [PMID: 38034691 PMCID: PMC10687050 DOI: 10.1016/j.heliyon.2023.e22458] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Revised: 09/10/2023] [Accepted: 11/13/2023] [Indexed: 12/02/2023] Open
Abstract
Background Identifying patients with hepatocellular carcinoma (HCC) at high risk of recurrence after hepatectomy can help to implement timely interventional treatment. This study aimed to develop a machine learning (ML) model to predict the recurrence risk of HCC patients after hepatectomy. Methods We retrospectively collected 315 HCC patients who underwent radical hepatectomy at the Third Affiliated Hospital of Sun Yat-sen University from April 2013 to October 2017, and randomly divided them into the training and validation sets at a ratio of 7:3. According to the postoperative recurrence of HCC patients, the patients were divided into recurrence group and non-recurrence group, and univariate and multivariate logistic regression were performed for the two groups. We applied six machine learning algorithms to construct the prediction models and performed internal validation by 10-fold cross-validation. Shapley additive explanations (SHAP) method was applied to interpret the machine learning model. We also built a web calculator based on the best machine learning model to personalize the assessment of the recurrence risk of HCC patients after hepatectomy. Results A total of 13 variables were included in the machine learning models. The multilayer perceptron (MLP) machine learning model was proved to achieve optimal predictive value in test set (AUC = 0.680). The SHAP method displayed that γ-glutamyl transpeptidase (GGT), fibrinogen, neutrophil, aspartate aminotransferase (AST) and total bilirubin (TB) were the top 5 important factors for recurrence risk of HCC patients after hepatectomy. In addition, we further demonstrated the reliability of the model by analyzing two patients. Finally, we successfully constructed an online web prediction calculator based on the MLP machine learning model. Conclusion MLP was an optimal machine learning model for predicting the recurrence risk of HCC patients after hepatectomy. This predictive model can help identify HCC patients at high recurrence risk after hepatectomy to provide early and personalized treatment.
Collapse
|
17
|
Application and interpretation of machine learning models in predicting the risk of severe obstructive sleep apnea in adults. BMC Med Inform Decis Mak 2023; 23:230. [PMID: 37858225 PMCID: PMC10585776 DOI: 10.1186/s12911-023-02331-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Accepted: 10/10/2023] [Indexed: 10/21/2023] Open
Abstract
BACKGROUND Obstructive sleep apnea (OSA) is a globally prevalent disease with a complex diagnostic method. Severe OSA is associated with multi-system dysfunction. We aimed to develop an interpretable machine learning (ML) model for predicting the risk of severe OSA and analyzing the risk factors based on clinical characteristics and questionnaires. METHODS This was a retrospective study comprising 1656 subjects who presented and underwent polysomnography (PSG) between 2018 and 2021. A total of 23 variables were included, and after univariate analysis, 15 variables were selected for further preprocessing. Six types of classification models were used to evaluate the ability to predict severe OSA, namely logistic regression (LR), gradient boosting machine (GBM), extreme gradient boosting (XGBoost), adaptive boosting (AdaBoost), bootstrapped aggregating (Bagging), and multilayer perceptron (MLP). All models used the area under the receiver operating characteristic curve (AUC) was calculated as the performance metric. We also drew SHapley Additive exPlanations (SHAP) plots to interpret predictive results and to analyze the relative importance of risk factors. An online calculator was developed to estimate the risk of severe OSA in individuals. RESULTS Among the enrolled subjects, 61.47% (1018/1656) were diagnosed with severe OSA. Multivariate LR analysis showed that 10 of 23 variables were independent risk factors for severe OSA. The GBM model showed the best performance (AUC = 0.857, accuracy = 0.766, sensitivity = 0.798, specificity = 0.734). An online calculator was developed to estimate the risk of severe OSA based on the GBM model. Finally, waist circumference, neck circumference, the Epworth Sleepiness Scale, age, and the Berlin questionnaire were revealed by the SHAP plot as the top five critical variables contributing to the diagnosis of severe OSA. Additionally, two typical cases were analyzed to interpret the contribution of each variable to the outcome prediction in a single patient. CONCLUSIONS We established six risk prediction models for severe OSA using ML algorithms. Among them, the GBM model performed best. The model facilitates individualized assessment and further clinical strategies for patients with suspected severe OSA. This will help to identify patients with severe OSA as early as possible and ensure their timely treatment. TRIAL REGISTRATION Retrospectively registered.
Collapse
|
18
|
Prediction model of ocular metastasis from primary liver cancer: Machine learning-based development and interpretation study. Cancer Med 2023; 12:20482-20496. [PMID: 37795569 PMCID: PMC10652349 DOI: 10.1002/cam4.6540] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2022] [Revised: 08/21/2023] [Accepted: 09/05/2023] [Indexed: 10/06/2023] Open
Abstract
BACKGROUND Ocular metastasis (OM) is a rare metastatic site of primary liver cancer (PLC). The purpose of this study was to establish a clinical predictive model of OM in PLC patients based on machine learning (ML). METHODS We retrospectively collected the clinical data of 1540 PLC patients and divided it into a training set and an internal test set in a 7:3 proportion. PLC patients were divided into OM and non-ocular metastasis (NOM) groups, and univariate logistic regression analysis was performed between the two groups. The variables with univariate logistic analysis p < 0.05 were selected for the ML model. We constructed six ML models, which were internally verified by 10-fold cross-validation. The prediction performance of each ML model was evaluated by receiver operating characteristic curves (ROCs). We also constructed a web calculator based on the optimal performance ML model to personalize the risk probability for OM. RESULTS Six variables were selected for the ML model. The extreme gradient boost (XGB) ML model achieved the optimal differential diagnosis ability, with an area under the curve (AUC) = 0.993, accuracy = 0.992, sensitivity = 0.998, and specificity = 0.984. Based on these results, an online web calculator was constructed by using the XGB ML model to help clinicians diagnose and treat the risk probability of OM in PLC patients. Finally, the Shapley additive explanations (SHAP) library was used to obtain the six most important risk factors for OM in PLC patients: CA125, ALP, AFP, TG, CA199, and CEA. CONCLUSION We used the XGB model to establish a risk prediction model of OM in PLC patients. The predictive model can help identify PLC patients with a high risk of OM, provide early and personalized diagnosis and treatment, reduce the poor prognosis of OM patients, and improve the quality of life of PLC patients.
Collapse
|
19
|
Predicting brain age gap with radiomics and automl: A Promising approach for age-Related brain degeneration biomarkers. J Neuroradiol 2023:S0150-9861(23)00241-9. [PMID: 37722591 DOI: 10.1016/j.neurad.2023.09.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2023] [Revised: 09/14/2023] [Accepted: 09/15/2023] [Indexed: 09/20/2023]
Abstract
The Brain Age Gap (BAG), which refers to the difference between chronological age and predicted neuroimaging age, is proposed as a potential biomarker for age-related brain degeneration. However, existing brain age prediction models usually rely on a single marker and can not discover meaningful hidden information in radiographic images. This study focuses on the application of radiomics, an advanced imaging analysis technique, combined with automated machine learning to predict BAG. Our methods achieve a promising result with a mean absolute error of 1.509 using the Alzheimer's Disease Neuroimaging Initiative dataset. Furthermore, we find that the hippocampus and parahippocampal gyrus play a significant role in predicting age with interpretable method called SHapley Additive exPlanations. Additionally, our investigation of age prediction discrepancies between patients with Alzheimer's disease (AD) and those with mild cognitive impairment (MCI) reveals a notable correlation with clinical cognitive assessment scale scores. This suggests that BAG has the potential to serve as a biomarker to support the diagnosis of AD and MCI. Overall, this study presents valuable insights into the application of neuroimaging models in the diagnosis of neurodegenerative diseases.
Collapse
|
20
|
Prediction of subjective cognitive decline after corpus callosum infarction by an interpretable machine learning-derived early warning strategy. Front Neurol 2023; 14:1123607. [PMID: 37416313 PMCID: PMC10321713 DOI: 10.3389/fneur.2023.1123607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Accepted: 05/25/2023] [Indexed: 07/08/2023] Open
Abstract
Background and purpose Corpus callosum (CC) infarction is an extremely rare subtype of cerebral ischemic stroke, however, the symptoms of cognitive impairment often fail to attract early attention of patients, which seriously affects the long-term prognosis, such as high mortality, personality changes, mood disorders, psychotic reactions, financial burden and so on. This study seeks to develop and validate models for early predicting the risk of subjective cognitive decline (SCD) after CC infarction by machine learning (ML) algorithms. Methods This is a prospective study that enrolled 213 (only 3.7%) CC infarction patients from a nine-year cohort comprising 8,555 patients with acute ischemic stroke. Telephone follow-up surveys were carried out for the patients with definite diagnosis of CC infarction one-year after disease onset, and SCD was identified by Behavioral Risk Factor Surveillance System (BRFSS) questionnaire. Based on the significant features selected by the least absolute shrinkage and selection operator (LASSO), seven ML models including Extreme Gradient Boosting (XGBoost), Logistic Regression (LR), Light Gradient Boosting Machine (LightGBM), Adaptive Boosting (AdaBoost), Gaussian Naïve Bayes (GNB), Complement Naïve Bayes (CNB), and Support vector machine (SVM) were established and their predictive performances were compared by different metrics. Importantly, the SHapley Additive exPlanations (SHAP) was also utilized to examine internal behavior of the highest-performance ML classifier. Results The Logistic Regression (LR)-model performed better than other six ML-models in SCD predictability after the CC infarction, with the area under the receiver characteristic operator curve (AUC) of 77.1% in the validation set. Using LASSO and SHAP analysis, we found that infarction subregions of CC infarction, female, 3-month modified Rankin Scale (mRS) score, age, homocysteine, location of angiostenosis, neutrophil to lymphocyte ratio, pure CC infarction, and number of angiostenosis were the top-nine significant predictors in the order of importance for the output of LR-model. Meanwhile, we identified that infarction subregion of CC, female, 3-month mRS score and pure CC infarction were the factors which independently associated with the cognitive outcome. Conclusion Our study firstly demonstrated that the LR-model with 9 common variables has the best-performance to predict the risk of post-stroke SCD due to CC infarcton. Particularly, the combination of LR-model and SHAP-explainer could aid in achieving personalized risk prediction and be served as a decision-making tool for early intervention since its poor long-term outcome.
Collapse
|
21
|
Deep learning mapping of surface MDA8 ozone: The impact of predictor variables on ozone levels over the contiguous United States. ENVIRONMENTAL POLLUTION (BARKING, ESSEX : 1987) 2023; 326:121508. [PMID: 36967006 DOI: 10.1016/j.envpol.2023.121508] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Revised: 03/19/2023] [Accepted: 03/22/2023] [Indexed: 06/18/2023]
Abstract
The limited number of ozone monitoring stations imposes uncertainty in various applications, calling for accurate approaches to capturing ozone values in all regions, particularly those with no in-situ measurements. This study uses deep learning (DL) to accurately estimate daily maximum 8-hr average (MDA8) ozone and examines the spatial contribution of several factors on ozone levels over the contiguous U.S. (CONUS) in 2019. A comparison between in-situ observations and DL-estimated MDA8 ozone values shows a correlation coefficient (R) of 0.95, an index of agreement (IOA) of 0.97, and a mean absolute bias (MAB) of 2.79 ppb, highlighting the promising performance of the deep convolutional neural network (Deep-CNN) at estimating surface MDA8 ozone. Spatial cross-validation also confirms the high spatial accuracy of the model, which obtains an R of 0.91, and IOA of 0.96 and an MAB of 3.46 ppb when it is trained and tested on separate stations. To interpret the black-box nature of our DL model, we use Shapley additive explanations (SHAP) to generate a spatial feature contribution map (SFCM), the results of which confirm an advanced ability of Deep-CNN to capture the interactions between most predictor variables and ozone. For instance, the model shows that solar radiation (SRad) SFCM, with higher values, enhances the formation of ozone, particularly in the south and southwestern CONUS. As SRad triggers ozone precursors to produce ozone via photochemical reactions, it increases ozone concentrations. The model also shows that humidity, with its low values, increases ozone concentrations in the western mountainous regions. The negative correlation between humidity and ozone levels can be attributed to factors such as higher ozone decomposition resulting from increased levels of humidity and OH radicals. This study is the first to introduce the SFCM to investigate the spatial role of predictor variables on changes in estimated MDA8 ozone levels.
Collapse
|
22
|
Quantification of the antagonistic and synergistic effects of Pb 2+, Cu 2+, and Zn 2+ bioaccumulation by living Bacillus subtilis biomass using XGBoost and SHAP. JOURNAL OF HAZARDOUS MATERIALS 2023; 446:130635. [PMID: 36584648 DOI: 10.1016/j.jhazmat.2022.130635] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Revised: 11/25/2022] [Accepted: 12/17/2022] [Indexed: 06/17/2023]
Abstract
Bioaccumulation and adsorption are efficient methods for removing heavy metal ions (HMIs) from aqueous environments. However, methods to quantifiably characterize the removal selectivity for co-existing HMIs are limited. In this study, we applied Shapley additive explanations (SHAP) following extreme gradient boosting (XGBoost) modeling, to generate SHAP values. We used these values to create an affinity interference index (AII) that quantitatively represented the interference between metal ions in a multi-metal bioaccumulation system. The selectivity for simultaneous bioaccumulation of Pb2+, Cu2+, and Zn2+ by living Bacillus subtilis biomass was then characterized as a proof of concept. The AII indicated that the bioaccumulation of Zn2+ was more strongly inhibited by Pb2+/Cu2+ (AII = 1) than that of Pb2+/Cu2+ by Zn2+. Moreover, the presence of Zn2+ promoted the bioaccumulation of Pb2+ (AII = 0.39), which was confirmed in further experiments where the bioaccumulation of Pb2+ (300 μM) was increased by 38% with Zn2+ (300 μM). This study demonstrated that the combination of XGBoost and SHAP is effective in the quantifiable characterization of the antagonistic and synergistic effects in a multi-metal simultaneous bioaccumulation system. This method could also be generalized to similar tasks for analyzing the selectivity effects in a multi-component system.
Collapse
|
23
|
Explainable machine learning model reveals its decision-making process in identifying patients with paroxysmal atrial fibrillation at high risk for recurrence after catheter ablation. BMC Cardiovasc Disord 2023; 23:91. [PMID: 36803424 PMCID: PMC9936738 DOI: 10.1186/s12872-023-03087-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2022] [Accepted: 01/23/2023] [Indexed: 02/19/2023] Open
Abstract
BACKGROUND A number of models have been reported for predicting atrial fibrillation (AF) recurrence after catheter ablation. Although many machine learning (ML) models were developed among them, black-box effect existed widely. It was always difficult to explain how variables affect model output. We sought to implement an explainable ML model and then reveal its decision-making process in identifying patients with paroxysmal AF at high risk for recurrence after catheter ablation. METHODS Between January 2018 and December 2020, 471 consecutive patients with paroxysmal AF who had their first catheter ablation procedure were retrospectively enrolled. Patients were randomly assigned into training cohort (70%) and testing cohort (30%). The explainable ML model based on Random Forest (RF) algorithm was developed and modified on training cohort, and tested on testing cohort. In order to gain insight into the association between observed values and model output, Shapley additive explanations (SHAP) analysis was used to visualize the ML model. RESULTS In this cohort, 135 patients experienced tachycardias recurrences. With hyperparameters adjusted, the ML model predicted AF recurrence with an area under the curve of 66.7% in the testing cohort. Summary plots listed the top 15 features in descending order and preliminary showed the association between features and outcome prediction. Early recurrence of AF showed the most positive impact on model output. Dependence plots combined with force plots showed the impact of single feature on model output, and helped determine high risk cut-off points. The thresholds of CHA2DS2-VASc score, systolic blood pressure, AF duration, HAS-BLED score, left atrial diameter and age were 2, 130 mmHg, 48 months, 2, 40 mm and 70 years, respectively. Decision plot recognized significant outliers. CONCLUSION An explainable ML model effectively revealed its decision-making process in identifying patients with paroxysmal atrial fibrillation at high risk for recurrence after catheter ablation by listing important features, showing the impact of every feature on model output, determining appropriate thresholds and identifying significant outliers. Physicians can combine model output, visualization of model and clinical experience to make better decision.
Collapse
|
24
|
Molecular modeling of C1-inhibitor as SARS-CoV-2 target identified from the immune signatures of multiple tissues: An integrated bioinformatics study. Cell Biochem Funct 2023; 41:112-127. [PMID: 36517964 DOI: 10.1002/cbf.3769] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2022] [Revised: 11/02/2022] [Accepted: 11/27/2022] [Indexed: 12/16/2022]
Abstract
The expeditious transmission of the severe acute respiratory coronavirus 2 (SARS-CoV-2), a strain of COVID-19, crumbled the global economic strength and caused a veritable collapse in health infrastructure. The molecular modeling of the novel coronavirus research sounds promising and equips more evidence about the pragmatic therapeutic options. This article proposes a machine-learning framework for identifying potential COVID-19 transcriptomic signatures. The transcriptomics data contains immune-related genes collected from multiple tissues (blood, nasal, and buccal) with accession number: GSE183071. Extensive bioinformatics work was carried out to identify the potential candidate markers, including differential expression analysis, protein interactions, gene ontology, and KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway enrichment studies. The overlapping investigation found SERPING1, the gene that encodes a glycosylated plasma protein C1-INH, in all three datasets. Furthermore, the immuno-informatics study was conducted on the C1-INH protein. 5DU3, the protein identifier of C1-INH, was fetched to identify the antigenicity, major histocompatibility (MHC) Class I and II binding epitopes, allergenicity, toxicity, and immunogenicity. The screening of peptides satisfying the vaccine-design criteria based on the metrics mentioned above is performed. The drug-gene interaction study reported that Rhucin is strongly associated with SERPING1. HSIC-Lasso (Hilbert-Schmidt independence criterion-least absolute shrinkage and selection operator), a model-free biomarker selection technique, was employed to identify the genes having a nonlinear relationship with the target class. The gene subset is trained with supervised machine learning models by a leave-one-out cross-validation method. Explainable artificial intelligence techniques perform the model interpretation analysis.
Collapse
|
25
|
Interpretable machine learning approach to analyze the effects of landscape and meteorological factors on mosquito occurrences in Seoul, South Korea. ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL 2023; 30:532-546. [PMID: 35900627 PMCID: PMC9813121 DOI: 10.1007/s11356-022-22099-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Accepted: 07/14/2022] [Indexed: 06/15/2023]
Abstract
Mosquitoes are the underlying cause of various public health and economic problems. In this study, patterns of mosquito occurrence were analyzed based on landscape and meteorological factors in the metropolitan city of Seoul. We evaluated the influence of environmental factors on mosquito occurrence through the interpretation of prediction models with a machine learning algorithm. Through hierarchical cluster analysis, the study areas were classified into waterside and non-waterside areas, according to the landscape patterns. The mosquito occurrence was higher in the waterside area, and mosquito abundance was negatively affected by rainfall at the waterside. The mosquito occurrence was predicted in each cluster area based on the landscape and cumulative meteorological variables using a random forest algorithm. Both models exhibited good performance (both accuracy and AUROC > 0.8) in predicting the level of mosquito occurrence. The embedded relationship between the mosquito occurrence and the environmental factors in the models was explained using the Shapley additive explanation method. According to the variable importance and the partial dependence plots for each model, the waterside area was more influenced by the meteorological and land cover variables than the non-waterside area. Therefore, mosquito control strategies should consider the effects of landscape and meteorological conditions, including the temperature, rainfall, and the landscape heterogeneity. The present findings can contribute to the development of mosquito forecasting systems in metropolitan cities for the promotion of public health.
Collapse
|
26
|
Prediction model of obstructive sleep apnea-related hypertension: Machine learning-based development and interpretation study. Front Cardiovasc Med 2022; 9:1042996. [PMID: 36545020 PMCID: PMC9760810 DOI: 10.3389/fcvm.2022.1042996] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Accepted: 11/21/2022] [Indexed: 12/11/2022] Open
Abstract
Background Obstructive sleep apnea (OSA) is a globally prevalent disease closely associated with hypertension. To date, no predictive model for OSA-related hypertension has been established. We aimed to use machine learning (ML) to construct a model to analyze risk factors and predict OSA-related hypertension. Materials and methods We retrospectively collected the clinical data of OSA patients diagnosed by polysomnography from October 2019 to December 2021 and randomly divided them into training and validation sets. A total of 1,493 OSA patients with 27 variables were included. Independent risk factors for the risk of OSA-related hypertension were screened by the multifactorial logistic regression models. Six ML algorithms, including the logistic regression (LR), the gradient boosting machine (GBM), the extreme gradient boosting (XGBoost), adaptive boosting (AdaBoost), bootstrapped aggregating (Bagging), and the multilayer perceptron (MLP), were used to develop the model on the training set. The validation set was used to tune the model hyperparameters to determine the final prediction model. We compared the accuracy and discrimination of the models to identify the best machine learning algorithm for predicting OSA-related hypertension. In addition, a web-based tool was developed to promote its clinical application. We used permutation importance and Shapley additive explanations (SHAP) to determine the importance of the selected features and interpret the ML models. Results A total of 18 variables were selected for the models. The GBM model achieved the most extraordinary discriminatory ability (area under the receiver operating characteristic curve = 0.873, accuracy = 0.885, sensitivity = 0.713), and on the basis of this model, an online tool was built to help clinicians optimize OSA-related hypertension patient diagnosis. Finally, age, family history of hypertension, minimum arterial oxygen saturation, body mass index, and percentage of time of SaO2 < 90% were revealed by the SHAP method as the top five critical variables contributing to the diagnosis of OSA-related hypertension. Conclusion We established a risk prediction model for OSA-related hypertension patients using the ML method and demonstrated that among the six ML models, the gradient boosting machine model performs best. This prediction model could help to identify high-risk OSA-related hypertension patients, provide early and individualized diagnoses and treatment plans, protect patients from the serious consequences of OSA-related hypertension, and minimize the burden on society.
Collapse
|
27
|
Decoding river pollution trends and their landscape determinants in an ecologically fragile karst basin using a machine learning model. ENVIRONMENTAL RESEARCH 2022; 214:113843. [PMID: 35931190 DOI: 10.1016/j.envres.2022.113843] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/05/2022] [Revised: 04/27/2022] [Accepted: 07/04/2022] [Indexed: 06/15/2023]
Abstract
Karst watersheds accommodate high landscape complexity and are influenced by both human-induced and natural activity, which affects the formation and process of runoff, sediment connectivity and contaminant transport and alters natural hydrological and nutrient cycling. However, physical monitoring stations are costly and labor-intensive, which has confined the assessment of water quality impairments on spatial scale. The geographical characteristics of catchments are potential influencing factors of water quality, often overlooked in previous studies of highly heterogeneous karst landscape. To solve this problem, we developed a machining learning method and applied Extreme Gradient Boosting (XGBoost) to predict the spatial distribution of water quality in the world's most ecologically fragile karst watershed. We used the Shapley Addition interpretation (SHAP) to explain the potential determinants. Before this process, we first used the water quality damage index (WQI-DET) to evaluate the water quality impairment status and determined that CODMn, TN and TP were causing river water quality impairments in the WRB. Second, we selected 46 watershed features based on the three key processes (sources-mobilization-transport) which affect the temporal and spatial variation of river pollutants to predict water quality in unmonitored reaches and decipher the potential determinants of river impairments. The predicting range of CODMn spanned from 1.39 mg/L to 17.40 mg/L. The predictions of TP and TN ranged from 0.02 to 1.31 mg/L and 0.25-5.72 mg/L, respectively. In general, the XGBoost model performs well in predicting the concentration of water quality in the WRB. SHAP explained that pollutant levels may be driven by three factors: anthropogenic sources (agricultural pollution inputs), fragile soils (low organic carbon content and high soil permeability to water flow), and pollutant transport mechanisms (TWI, carbonate rocks). Our study provides key data to support decision-making for water quality restoration projects in the WRB and information to help bridge the science:policy gap.
Collapse
|
28
|
Analysis of ecosystem service drivers based on interpretive machine learning: a case study of Zhejiang Province, China. ENVIRONMENTAL SCIENCE AND POLLUTION RESEARCH INTERNATIONAL 2022; 29:64060-64076. [PMID: 35469384 DOI: 10.1007/s11356-022-20311-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/30/2021] [Accepted: 04/13/2022] [Indexed: 06/14/2023]
Abstract
A systematic understanding of the driving mechanisms of ecosystem services (ESs) and the relationships among them is critical for successful ecosystem management. However, the impact of driving factors on the relationships between ESs and the formation of ecosystem service bundles (ESBs) remains unclear. To address this gap, we developed a modeling process that used random forest (RF) to model the ESs and ESBs of Zhejiang Province, China, in regression and classification mode, respectively, and the Shapley Additive Explanations (SHAP) method to interpret the underlying driving forces. We first mapped the spatial distribution of seven ESs in Zhejiang Province at a 1 × 1 km spatial resolution and then used the K-means clustering algorithm to obtain four ESBs. Combining the RF models with SHAP analysis, the results showed that each ES had key driving factors, and the relationships of synergy and trade-off between ESs were determined by the driving direction and intensity of the key factors. The driving factors affect the relationships of ESs and consequently affect the formation of ESBs. Thus, managing the dominant drivers is key to improving the supply capacity of ESs.
Collapse
|
29
|
How Validation Methodology Influences Human Activity Recognition Mobile Systems. SENSORS 2022; 22:s22062360. [PMID: 35336529 PMCID: PMC8954513 DOI: 10.3390/s22062360] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/12/2022] [Revised: 03/12/2022] [Accepted: 03/15/2022] [Indexed: 02/01/2023]
Abstract
In this article, we introduce explainable methods to understand how Human Activity Recognition (HAR) mobile systems perform based on the chosen validation strategies. Our results introduce a new way to discover potential bias problems that overestimate the prediction accuracy of an algorithm because of the inappropriate choice of validation methodology. We show how the SHAP (Shapley additive explanations) framework, used in literature to explain the predictions of any machine learning model, presents itself as a tool that can provide graphical insights into how human activity recognition models achieve their results. Now it is possible to analyze which features are important to a HAR system in each validation methodology in a simplified way. We not only demonstrate that the validation procedure k-folds cross-validation (k-CV), used in most works to evaluate the expected error in a HAR system, can overestimate by about 13% the prediction accuracy in three public datasets but also choose a different feature set when compared with the universal model. Combining explainable methods with machine learning algorithms has the potential to help new researchers look inside the decisions of the machine learning algorithms, avoiding most times the overestimation of prediction accuracy, understanding relations between features, and finding bias before deploying the system in real-world scenarios.
Collapse
|
30
|
Spatial heterogeneity modeling of water quality based on random forest regression and model interpretation. ENVIRONMENTAL RESEARCH 2021; 202:111660. [PMID: 34265353 DOI: 10.1016/j.envres.2021.111660] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/08/2021] [Revised: 06/28/2021] [Accepted: 07/04/2021] [Indexed: 06/13/2023]
Abstract
A systematic understanding of the spatial distribution of water quality is critical for successful watershed management; however, the limited number of physical monitoring stations has restricted the evaluation of spatial water quality distribution and the identification of features impacting the water quality. To fill this gap, we developed a modeling process that employed the random forest regression (RFR) to model the water quality distribution for the Taihu Lake basin in Zhejiang Province, China, and adopted the Shapley Additive exPlanations (SHAP) method to interpret the underlying driving forces. We first used RFR to model three water quality parameters: permanganate index (CODMn), total phosphorus (TP), and total nitrogen (TN), based on 16 watershed features. We then applied the built models to generate water quality distribution maps for the basin, with the CODMn ranging from 1.39 to 6.40 mg/L, TP from 0.02 to 0.23 mg/L, and TN from 1.43 to 4.27 mg/L. These maps showed generally consistent patterns among the CODMn, TN, and TP with minor differences in the spatial distribution. The SHAP analysis showed that the TN was mainly affected by agricultural non-point sources, while the CODMn and TP were affected by agricultural and domestic sources. Due to differences in sewage collection and treatment between urban and rural areas, the water quality in highly populated urban areas was better than that in rural areas, which led to an unexpected positive relationship between water quality and population density. Overall, with the RFR models and SHAP interpretation, we obtained a continuous distribution pattern of the water quality and identified its driving forces in the basin. These findings provided important information to assist water quality restoration projects.
Collapse
|
31
|
Automated biomarker candidate discovery in imaging mass spectrometry data through spatially localized Shapley additive explanations. Anal Chim Acta 2021; 1177:338522. [PMID: 34482894 PMCID: PMC10124144 DOI: 10.1016/j.aca.2021.338522] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2021] [Revised: 04/04/2021] [Accepted: 04/11/2021] [Indexed: 01/09/2023]
Abstract
The search for molecular species that are differentially expressed between biological states is an important step towards discovering promising biomarker candidates. In imaging mass spectrometry (IMS), performing this search manually is often impractical due to the large size and high-dimensionality of IMS datasets. Instead, we propose an interpretable machine learning workflow that automatically identifies biomarker candidates by their mass-to-charge ratios, and that quantitatively estimates their relevance to recognizing a given biological class using Shapley additive explanations (SHAP). The task of biomarker candidate discovery is translated into a feature ranking problem: given a classification model that assigns pixels to different biological classes on the basis of their mass spectra, the molecular species that the model uses as features are ranked in descending order of relative predictive importance such that the top-ranking features have a higher likelihood of being useful biomarkers. Besides providing the user with an experiment-wide measure of a molecular species' biomarker potential, our workflow delivers spatially localized explanations of the classification model's decision-making process in the form of a novel representation called SHAP maps. SHAP maps deliver insight into the spatial specificity of biomarker candidates by highlighting in which regions of the tissue sample each feature provides discriminative information and in which regions it does not. SHAP maps also enable one to determine whether the relationship between a biomarker candidate and a biological state of interest is correlative or anticorrelative. Our automated approach to estimating a molecular species' potential for characterizing a user-provided biological class, combined with the untargeted and multiplexed nature of IMS, allows for the rapid screening of thousands of molecular species and the obtention of a broader biomarker candidate shortlist than would be possible through targeted manual assessment. Our biomarker candidate discovery workflow is demonstrated on mouse-pup and rat kidney case studies.
Collapse
|