1
|
Zhang X, Tseo Y, Bai Y, Chen F, Uhler C. Prediction of protein subcellular localization in single cells. Nat Methods 2025; 22:1265-1275. [PMID: 40360932 DOI: 10.1038/s41592-025-02696-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2024] [Accepted: 04/09/2025] [Indexed: 05/15/2025]
Abstract
The subcellular localization of a protein is important for its function, and its mislocalization is linked to numerous diseases. Existing datasets capture limited pairs of proteins and cell lines, and existing protein localization prediction models either miss cell-type specificity or cannot generalize to unseen proteins. Here we present a method for Prediction of Unseen Proteins' Subcellular localization (PUPS). PUPS combines a protein language model and an image inpainting model to utilize both protein sequence and cellular images. We demonstrate that the protein sequence input enables generalization to unseen proteins, and the cellular image input captures single-cell variability, enabling cell-type-specific predictions. Experimental validation shows that PUPS can predict protein localization in newly performed experiments outside the Human Protein Atlas used for training. Collectively, PUPS provides a framework for predicting differential protein localization across cell lines and single cells within a cell line, including changes in protein localization driven by mutations.
Collapse
Affiliation(s)
- Xinyi Zhang
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
- Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Yitong Tseo
- Computational and Systems Biology Program, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Yunhao Bai
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Fei Chen
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA.
| | - Caroline Uhler
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA.
- Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| |
Collapse
|
2
|
Zheng L, Wu X, Gu W, Wang R, Wang J, He H, Wang Z, Yi B, Zhang Y. Development and validation of a hypoxemia prediction model in middle-aged and elderly outpatients undergoing painless gastroscopy. Sci Rep 2025; 15:17965. [PMID: 40410303 PMCID: PMC12102271 DOI: 10.1038/s41598-025-02540-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2024] [Accepted: 05/14/2025] [Indexed: 05/25/2025] Open
Abstract
Hypoxemia is a common complication associated with anesthesia in painless gastroscopy. With the aging of the social population, the number of cases of hypoxemia among middle-aged and elderly patients is increasing. However, tools for predicting hypoxemia in middle-aged and elderly patients are lacking. In this study, we investigated the risk factors for hypoxemia in middle-aged and elderly outpatients undergoing painless gastroscopy based on machine learning and constructed a risk prediction model. In this retrospective study, we included the data on 1,348 outpatients undergoing painless gastroscopy. In total, 26 characteristic variables, including demographic information, past medical history, and clinical data of the patients were included, and BorutaShap was used for feature selection. Five machine learning algorithm models, including logistic regression (LR), support vector machine (SVM), random forest (RF), extreme gradient boosting (XGB), and light gradient boosting machine (LightGBM), were selected. The best models were selected based on the area under the receiver operating characteristic curve (AUROC). Model feature importance was explained and analyzed using Shapley Additive Explanations (SHAP). The endpoint event of this study was considered to be hypoxemia during the procedure, defined as at least one occurrence of pulse oxygen saturation below 90% without probe misalignment or interference from the beginning of anesthesia induction to the end of painless gastroscopy. In the final cohort of 984 patients, 11% of patients (108/984) experienced hypoxemia during the painless gastroscopy procedure. The AUROCs of the five models were as follows: Logistic Regression (AUROC = 0.893, 95CI: 0.881-0.899), SVM (AUROC = 0.855, 95CI: 0.812-0.884), Random Forest (AUROC = 0.914, 95CI: 0.889-0.924), XGB (AUROC = 0.902, 95CI: 0.865-0.919), and LightGBM (AUROC = 0.891, 95CI: 0.847-0.917). Regarding the explanation of the importance of SHAP features, preoperative variables (baseline SpO2, body mass index, and micrognathia) and intraoperative variables (operating time of gastroscopy, induction dose of etomidate and propofol mixture, append anesthetic, cough, and repeated pharyngeal irritation) significantly contributed to the model. We identified eight potential risk factors related to the occurrence of hypoxemia in middle-aged and elderly patients undergoing painless gastroscopy, based on machine learning feature engineering. Among the five machine learning algorithms, RF exhibited the best predictive performance in the internal test set and had a certain degree of generalization ability in the external validation set, which indicated that the RF model was more suitable for the data framework of this study. This model was more likely to enhance the accuracy of hypoxemia prediction in middle-aged and elderly patients undergoing painless gastroscopy, and thus, it is suitable for assisting anesthesiologists in clinical decision-making.
Collapse
Affiliation(s)
- Leilei Zheng
- Department of Anesthesiology, Second Affiliated Hospital of Zunyi Medical University, Zunyi, 563000, Guizhou, China
| | - Xinyan Wu
- Department of Anesthesiology, Second Affiliated Hospital of Zunyi Medical University, Zunyi, 563000, Guizhou, China
| | - Wei Gu
- Department of Anesthesiology, Minhang Hospital of Fudan University, Shanghai, 201199, China
| | - Rui Wang
- Department of Anesthesiology, Third Affiliated Hospital of Zunyi Medical University, Zunyi, 563000, Guizhou, China
| | - Jing Wang
- Department of Anesthesiology, Second Affiliated Hospital of Zunyi Medical University, Zunyi, 563000, Guizhou, China
| | - Hongying He
- Department of Anesthesiology, Second Affiliated Hospital of Zunyi Medical University, Zunyi, 563000, Guizhou, China
| | - Zhao Wang
- Department of Anesthesiology, Second Affiliated Hospital of Zunyi Medical University, Zunyi, 563000, Guizhou, China
| | - Bin Yi
- Department of Anesthesiology, First Affiliated Hospital of Army Medical University, Chongqing, 400038, China.
| | - Yi Zhang
- Department of Anesthesiology, Second Affiliated Hospital of Zunyi Medical University, Zunyi, 563000, Guizhou, China.
| |
Collapse
|
3
|
Huang RN, Luo SY, Huang T, Li XS, Zhou FC, Yin WH, Chen ZR, Yuan SZ, Li LY, Tang B, Qiao JD. The interaction of UBR4, LRP1, and OPHN1 in refractory epilepsy: Drosophila model to investigate the oligogenic effect on epilepsy. Neurobiol Dis 2025; 212:106955. [PMID: 40374006 DOI: 10.1016/j.nbd.2025.106955] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2025] [Revised: 05/11/2025] [Accepted: 05/12/2025] [Indexed: 05/17/2025] Open
Abstract
Refractory epilepsy is an intractable neurological disorder that can be associated with oligogenic/polygenic etiologies. Through trio-based whole-exome sequencing analysis, we identified a clinical case of refractory epilepsy with three candidate gene variants: UBR4, LRP1, and OPHN1. Utilizing the Gal4-UAS system and double-balancer tool, we generated single, double, and triple knockdown Drosophila models to investigate the interactions of the three candidate genes. Seizure behavioral experiments combined with logistic regression analysis revealed the individual epileptogenicity and significant synergistic epileptogenic effects of the three mutations. By constructing a SHAP-XGBoost machine learning model integrating seizure behavior data with knockdown efficiency metrics, we discovered that LRP1 mutation served as the primary effector in the oligogenic system. Based on transcriptome analysis, main related processes of oxidative stress and metabolic imbalance together with expressional dysregulation separately of 48, 52, and 43 epilepsy-associated genes were discovered to confirm the epileptogenicity of OPHN1 knockdown, UBR4-LRP1 knockdown, and UBR4-LRP1-OPHN1 knockdown. Up-regulation of COX7AL and ND-B8 enriched in metabolic pathways and down-regulation of Diedel enriched in extracellular space component were indicated to be responsible for the significant epileptogenicity of the oligogenic knockdown. For this clinical instance, epileptic pharmacoresistance was considered to be triggered by a combination of KIF gene family, SLC gene family, and ASIC gene family. This study established a novel framework to clarify the multiple genetic structure of epileptogenicity in refractory epilepsy with oligogenic background, which could be critical to translational medicine and precision therapy development.
Collapse
Affiliation(s)
- Rui-Na Huang
- Department of Neurology, Institute of Neuroscience, Key Laboratory of Neurogenetics and Channelopathies of Guangdong Province and the Ministry of Education of China, The Second Affiliated Hospital of Guangzhou Medical University, Changgang Dong Road, Guangzhou 510000, China; The Second Clinical Medicine School, Guangzhou Medical University, Guangzhou 510000, China
| | - Si-Yuan Luo
- The Second Clinical Medicine School, Guangzhou Medical University, Guangzhou 510000, China
| | - Tao Huang
- The Second Clinical Medicine School, Guangzhou Medical University, Guangzhou 510000, China
| | - Xiong-Sheng Li
- The Second Clinical Medicine School, Guangzhou Medical University, Guangzhou 510000, China
| | - Fan-Chao Zhou
- The Second Clinical Medicine School, Guangzhou Medical University, Guangzhou 510000, China
| | - Wei-Hao Yin
- The Second Clinical Medicine School, Guangzhou Medical University, Guangzhou 510000, China
| | - Ze-Ru Chen
- The Second Clinical Medicine School, Guangzhou Medical University, Guangzhou 510000, China
| | - Shi-Zhan Yuan
- Department of Neurology, Institute of Neuroscience, Key Laboratory of Neurogenetics and Channelopathies of Guangdong Province and the Ministry of Education of China, The Second Affiliated Hospital of Guangzhou Medical University, Changgang Dong Road, Guangzhou 510000, China
| | - Ling-Ying Li
- Department of Neurology, Institute of Neuroscience, Key Laboratory of Neurogenetics and Channelopathies of Guangdong Province and the Ministry of Education of China, The Second Affiliated Hospital of Guangzhou Medical University, Changgang Dong Road, Guangzhou 510000, China
| | - Bin Tang
- Department of Neurology, Institute of Neuroscience, Key Laboratory of Neurogenetics and Channelopathies of Guangdong Province and the Ministry of Education of China, The Second Affiliated Hospital of Guangzhou Medical University, Changgang Dong Road, Guangzhou 510000, China.
| | - Jing-Da Qiao
- Department of Neurology, Institute of Neuroscience, Key Laboratory of Neurogenetics and Channelopathies of Guangdong Province and the Ministry of Education of China, The Second Affiliated Hospital of Guangzhou Medical University, Changgang Dong Road, Guangzhou 510000, China.
| |
Collapse
|
4
|
Angelis J, Schröder EA, Xiao Z, Gabriel W, Wilhelm M. Peptide Property Prediction for Mass Spectrometry Using AI: An Introduction to State of the Art Models. Proteomics 2025; 25:e202400398. [PMID: 40211610 PMCID: PMC12076536 DOI: 10.1002/pmic.202400398] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2024] [Revised: 03/14/2025] [Accepted: 03/17/2025] [Indexed: 05/15/2025]
Abstract
This review explores state of the art machine learning and deep learning models for peptide property prediction in mass spectrometry-based proteomics, including, but not limited to, models for predicting digestibility, retention time, charge state distribution, collisional cross section, fragmentation ion intensities, and detectability. The combination of these models enables not only the in silico generation of spectral libraries but also finds many additional use cases in the design of targeted assays or data-driven rescoring. This review serves as both an introduction for newcomers and an update for experienced researchers aiming to develop accessible and reproducible models for peptide property predictions. Key limitations of the current models, including difficulties in handling diverse post-translational modifications and instrument variability, highlight the need for large-scale, harmonized datasets, and standardized evaluation metrics for benchmarking.
Collapse
Affiliation(s)
- Jesse Angelis
- Computational Mass SpectrometryTechnical University of MunichFreisingGermany
| | - Eva Ayla Schröder
- Computational Mass SpectrometryTechnical University of MunichFreisingGermany
| | - Zixuan Xiao
- Computational Mass SpectrometryTechnical University of MunichFreisingGermany
| | - Wassim Gabriel
- Computational Mass SpectrometryTechnical University of MunichFreisingGermany
| | - Mathias Wilhelm
- Computational Mass SpectrometryTechnical University of MunichFreisingGermany
- Munich Data Science Institute (MDSI)Technical University of MunichGarchingGermany
| |
Collapse
|
5
|
Arina P, Ferrari D, Kaczorek MR, Tetlow N, Dewar A, Stephens R, Martin D, Moonesinghe R, Singer M, Whittle J, Mazomenos EB. Assessing perioperative risks in a mixed elderly surgical population using machine learning: A multi-objective symbolic regression approach to cardiorespiratory fitness derived from cardiopulmonary exercise testing. PLOS DIGITAL HEALTH 2025; 4:e0000851. [PMID: 40378351 DOI: 10.1371/journal.pdig.0000851] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/13/2024] [Accepted: 04/05/2025] [Indexed: 05/18/2025]
Abstract
Accurate preoperative risk assessment is of great value to both patients and clinical teams. Several risk scores have been developed but are often not calibrated to the local institution, limited in terms of data input into the underlying models, and/or lack individual precision. Machine Learning (ML) models have the potential to address limitations in existing scoring systems. A database of 1190 elderly patients who underwent major elective surgery was analyzed retrospectively. Preoperative cardiorespiratory fitness data from cardiopulmonary exercise testing (CPET), demographic and clinical data were extracted and integrated into advanced machine learning (ML) algorithms. Multi-Objective-Symbolic-Regression (MOSR), a novel algorithm utilizing Genetic Programming to generate mathematical formulae for learning tasks, was employed to predict patient morbidity at Postoperative Day 3, as defined by the PostOperative Morbidity Survey (POMS). Shapley-Additive-exPlanations (SHAP) was subsequently used to analyze feature contributions. Model performance was benchmarked against existing risk prediction scores, namely the Portsmouth-Physiological-and-Operative-Severity-Score-for-the-Enumeration-of-Mortality-and-Morbidity (PPOSSUM) and the Duke-Activity-Status-Index, as well as linear regression using CPET features. A model was also developed for the same task using data directly extracted from the CPET time-series. The incorporation of cardiorespiratory fitness data enhanced the performance of all models for predicting postoperative morbidity by 20% compared to sole reliance on clinical data. Cardiorespiratory fitness features demonstrated greater importance than clinical features in the SHAP analysis. Models utilizing data taken directly from the CPET time-series demonstrated a 12% improvement over the cardiorespiratory fitness models. MOSR model surpassed all other models in every experiment, demonstrating excellent robustness and generalization capabilities. Integrating cardiorespiratory fitness data with ML models enables improved preoperative prediction of postoperative morbidity in elective surgical patients. The MOSR model stands out for its capacity to pinpoint essential features and build models that are both simple and accurate, showing excellent generalizability.
Collapse
Affiliation(s)
- Pietro Arina
- Bloomsbury Institute of Intensive Care Medicine, University College London, London, United Kingdom
- Human Physiology and Performance Laboratory, Centre for Perioperative Medicine, Department of Targeted Intervention, University College London, London, United Kingdom
| | - Davide Ferrari
- Department of Population Health Sciences, King's College London, London, United Kingdom
| | - Maciej R Kaczorek
- Wellcome/EPSRC Centre of Interventional and Surgical Sciences and Department of Medical Physics and Biomedical Engineering, University College London, London, United Kingdom
| | - Nicholas Tetlow
- Human Physiology and Performance Laboratory, Centre for Perioperative Medicine, Department of Targeted Intervention, University College London, London, United Kingdom
| | - Amy Dewar
- Human Physiology and Performance Laboratory, Centre for Perioperative Medicine, Department of Targeted Intervention, University College London, London, United Kingdom
| | - Robert Stephens
- Human Physiology and Performance Laboratory, Centre for Perioperative Medicine, Department of Targeted Intervention, University College London, London, United Kingdom
| | - Daniel Martin
- Peninsula Medical School, University of Plymouth, Plymouth, Devon, United Kingdom
| | - Ramani Moonesinghe
- Human Physiology and Performance Laboratory, Centre for Perioperative Medicine, Department of Targeted Intervention, University College London, London, United Kingdom
| | - Mervyn Singer
- Bloomsbury Institute of Intensive Care Medicine, University College London, London, United Kingdom
| | - John Whittle
- Human Physiology and Performance Laboratory, Centre for Perioperative Medicine, Department of Targeted Intervention, University College London, London, United Kingdom
| | - Evangelos B Mazomenos
- Wellcome/EPSRC Centre of Interventional and Surgical Sciences and Department of Medical Physics and Biomedical Engineering, University College London, London, United Kingdom
| |
Collapse
|
6
|
Yu X, Wang W, Wu R, Gong X, Ji Y, Feng Z. Construction of a machine learning-based interpretable prediction model for acute kidney injury in hospitalized patients. Sci Rep 2025; 15:9313. [PMID: 40102467 PMCID: PMC11920398 DOI: 10.1038/s41598-025-90459-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2024] [Accepted: 02/13/2025] [Indexed: 03/20/2025] Open
Abstract
In this observational study, we used data from 59,936 hospitalized adults to construct a model. For the models constructed with all 53 variables, all five models achieved acceptable performance with the validation cohort, with the extreme gradient boosting (XGBoost) model showing the best predictive efficacy and stability (area under the curve (AUC), 0.9301). For the simpler models constructed with 39 significant variables screened by the random forest recursive feature elimination method, the XGBoost model also had the best performance (AUC, 0.9357). All the models showed significant net returns according to decision analysis curves, and the XGBoost model achieved the optimal results. In addition, the Shapley additive explanation (SHAP) importance matrices revealed that uric acid, colloidal solution, first creatinine value on admission, pulse and albumin represented the top five most important variables for both modeling strategies. With the external validation cohort based on 4022 hospitalized patients, the performance of all models declined, among which the Support vector machine (SVM) model showed the best predictive efficacy (AUC, 0.8230 and 0.8329), followed by the XGBoost model (0.8124 and 0.8316). Thus, our model can predict the occurrence and risk of acute kidney injury (AKI) up to 48 h in advance, enabling clinicians to assess the risk of AKI in hospitalized patients more accurately and intuitively and to develop necessary AKI management strategies.
Collapse
Affiliation(s)
- Xiang Yu
- First Medical Center of Chinese PLA General Hospital, Department of Nephrology, First Medical Center of Chinese PLA General Hospital, State Key Laboratory of Kidney Diseases, National Clinical Research Center for Kidney Diseases, Beijing Key Laboratory of Medical Devices and Integrated Traditional Chinese and Western Drug Development for Severe Kidney Diseases,Beijing Key Laboratory of Digital Intelligent TCM for the Preventionand Treatment of Pan-vascular Diseases,Key Disciplines of National Administration of Traditional Chinese Medicine(zyyzdxk-2023310), Beijing, 100853, China
| | - WanLing Wang
- Medical Innovation Research Division, Chinese PLA General Hospital, Beijing, 100853, China
| | - RiLiGe Wu
- Medical Innovation Research Division, Chinese PLA General Hospital, Beijing, 100853, China
| | - XinYan Gong
- First Medical Center of Chinese PLA General Hospital, Department of Nephrology, First Medical Center of Chinese PLA General Hospital, State Key Laboratory of Kidney Diseases, National Clinical Research Center for Kidney Diseases, Beijing Key Laboratory of Medical Devices and Integrated Traditional Chinese and Western Drug Development for Severe Kidney Diseases,Beijing Key Laboratory of Digital Intelligent TCM for the Preventionand Treatment of Pan-vascular Diseases,Key Disciplines of National Administration of Traditional Chinese Medicine(zyyzdxk-2023310), Beijing, 100853, China
| | - YuWei Ji
- First Medical Center of Chinese PLA General Hospital, Department of Nephrology, First Medical Center of Chinese PLA General Hospital, State Key Laboratory of Kidney Diseases, National Clinical Research Center for Kidney Diseases, Beijing Key Laboratory of Medical Devices and Integrated Traditional Chinese and Western Drug Development for Severe Kidney Diseases,Beijing Key Laboratory of Digital Intelligent TCM for the Preventionand Treatment of Pan-vascular Diseases,Key Disciplines of National Administration of Traditional Chinese Medicine(zyyzdxk-2023310), Beijing, 100853, China
| | - Zhe Feng
- First Medical Center of Chinese PLA General Hospital, Department of Nephrology, First Medical Center of Chinese PLA General Hospital, State Key Laboratory of Kidney Diseases, National Clinical Research Center for Kidney Diseases, Beijing Key Laboratory of Medical Devices and Integrated Traditional Chinese and Western Drug Development for Severe Kidney Diseases,Beijing Key Laboratory of Digital Intelligent TCM for the Preventionand Treatment of Pan-vascular Diseases,Key Disciplines of National Administration of Traditional Chinese Medicine(zyyzdxk-2023310), Beijing, 100853, China.
| |
Collapse
|
7
|
Charoenkwan P, Chumnanpuen P, Schaduangrat N, Shoombuatong W. Stack-AVP: A Stacked Ensemble Predictor Based on Multi-view Information for Fast and Accurate Discovery of Antiviral Peptides. J Mol Biol 2025; 437:168853. [PMID: 39510347 DOI: 10.1016/j.jmb.2024.168853] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2024] [Revised: 10/22/2024] [Accepted: 10/31/2024] [Indexed: 11/15/2024]
Abstract
AVPs, or antiviral peptides, are short chains of amino acids capable of inhibiting viral replication, preventing viral entry, or disrupting viral membranes. They represent a promising area of research for developing new antiviral therapies due to their potential to target a broad spectrum of viruses, incorporating those resistant to traditional antiviral drugs. However, traditional experimental methods for identifying AVPs are often costly and labour-intensive. Thus far, multiple computational methods have been introduced for the in silico identification of AVPs, but these methods still have certain shortcomings. In this study, we propose a novel stacked ensemble learning framework, termed Stack-AVP, for fast and accurate AVP identification. In Stack-AVP, we investigated heterogeneous prediction models, which were trained with 12 commonly used machine learning algorithms coupled with a wide range of multiple feature encoding schemes. Subsequently, these prediction models were adopted to generate multi-view features providing class information and probability information. Finally, we applied our feature selection method to determine the best feature subset for the construction of the final stacked model. Comparative assessments on the independent test dataset revealed that Stack-AVP surpassed the performance of current state-of-the-art methods, with an accuracy of 0.930, MCC of 0.860, and AUC of 0.975. Furthermore, it was found that our multi-view features exhibited a crucial mechanism to improve the prediction performance of AVPs. To facilitate experimental scientists in performing high-throughput identification of AVPs, the prediction sever Stack-AVP is publicly accessible at https://pmlabqsar.pythonanywhere.com/Stack-AVP.
Collapse
Affiliation(s)
- Phasit Charoenkwan
- Modern Management and Information Technology, College of Arts, Media and Technology, Chiang Mai University, Chiang Mai 50200, Thailand
| | - Pramote Chumnanpuen
- Department of Zoology, Faculty of Science, Kasetsart University, Bangkok 10900, Thailand; Kasetsart University International College (KUIC), Kasetsart University, Bangkok 10900, Thailand
| | - Nalini Schaduangrat
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand
| | - Watshara Shoombuatong
- Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand.
| |
Collapse
|
8
|
Zhang SY, Zhang YD, Li H, Wang QY, Ye QF, Wang XM, Xia TH, He YE, Rong X, Wu TT, Wu RZ. Explainable machine learning model for predicting decline in platelet count after interventional closure in children with patent ductus arteriosus. Front Pediatr 2025; 13:1519002. [PMID: 39981204 PMCID: PMC11839778 DOI: 10.3389/fped.2025.1519002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/29/2024] [Accepted: 01/20/2025] [Indexed: 02/22/2025] Open
Abstract
Background This study aimed to apply four machine learning algorithms to develop the optimal model to predict decline in platelet count (DPC) after interventional closure in children with patent ductus arteriosus (PDA). Methods Data from children with PDA who underwent successful transcatheter closure at the Second Affiliated Hospital of Wenzhou Medical University and Yuying Children's Hospital from January 2016, to December 2022, were collected. The cohort data were split into training and testing sets. DPC following the intervention is defined as a percentage DPC ≥25% [(baseline platelet count-nadir platelet count)/baseline platelet count]. The extra tree algorithm was used for feature selection and four ML algorithms [random forest (RF), adaptive boosting, extreme gradient boosting, and logistic regression] were established. Moreover, SHapley Additive exPlanation (SHAP) to explain the importance of features and the ML models. Results This study included 330 children who underwent successful transcatheter closure of PDA, of which 113 (34.2%) experienced DPC. After 62 clinical features were considered, the extra tree algorithm selected six clinical features to build the ML models. Amongst the four ML algorithms, the RF model achieved the greatest AUC. SHAP analysis revealed that pulmonary artery systolic pressure, size of defect and weight were the top three most important clinical features in the RF model. Furthermore, clinical descriptions of two children with PDA, with accurate predictions, and explanations of the prediction results were provided. Conclusion In this study, an ML model (RF) capable of predicting post-intervention DPC in children with PDA undergoing transcatheter closure was established.
Collapse
Affiliation(s)
- Song-Yue Zhang
- Children's Heart Center, The Second Affiliated Hospital and Yuying Children's Hospital of Wenzhou Medical University, Wenzhou, China
| | - Yi-Dong Zhang
- Children's Heart Center, The Second Affiliated Hospital and Yuying Children's Hospital of Wenzhou Medical University, Wenzhou, China
| | - Hao Li
- Children's Heart Center, The Second Affiliated Hospital and Yuying Children's Hospital of Wenzhou Medical University, Wenzhou, China
| | - Qiao-Yu Wang
- Children's Heart Center, The Second Affiliated Hospital and Yuying Children's Hospital of Wenzhou Medical University, Wenzhou, China
| | | | - Xun-Min Wang
- Children's Heart Center, The Second Affiliated Hospital and Yuying Children's Hospital of Wenzhou Medical University, Wenzhou, China
| | - Tian-He Xia
- Children's Heart Center, The Second Affiliated Hospital and Yuying Children's Hospital of Wenzhou Medical University, Wenzhou, China
| | - Yue-E He
- Children's Heart Center, The Second Affiliated Hospital and Yuying Children's Hospital of Wenzhou Medical University, Wenzhou, China
| | - Xing Rong
- Children's Heart Center, The Second Affiliated Hospital and Yuying Children's Hospital of Wenzhou Medical University, Wenzhou, China
| | - Ting-Ting Wu
- Children's Heart Center, The Second Affiliated Hospital and Yuying Children's Hospital of Wenzhou Medical University, Wenzhou, China
| | - Rong-Zhou Wu
- Children's Heart Center, The Second Affiliated Hospital and Yuying Children's Hospital of Wenzhou Medical University, Wenzhou, China
| |
Collapse
|
9
|
Lin TC, Chiueh PT, Hsiao TC. Challenges in Observation of Ultrafine Particles: Addressing Estimation Miscalculations and the Necessity of Temporal Trends. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2025; 59:565-577. [PMID: 39670560 PMCID: PMC11741106 DOI: 10.1021/acs.est.4c07460] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/21/2024] [Revised: 11/29/2024] [Accepted: 12/02/2024] [Indexed: 12/14/2024]
Abstract
Ultrafine particles (UFPs) pose a significant health risk, making comprehensive assessment essential. The influence of emission sources on particle concentrations is not only constrained by meteorological conditions but often intertwined with them, making it challenging to separate these effects. This study utilized valuable long-term particle number and size distribution (PNSD) data from 2018 to 2023 to develop a tree-based machine learning model enhanced with an interpretable component, incorporating temporal markers to characterize background or time series residuals. Our results demonstrated that, differing from PM2.5, which is significantly shaped by planetary boundary layer height, wind speed plays a crucial role in determining the particle number concentration (PNC), showing strong regional specificity. Furthermore, we systematically identified and analyzed anthropogenically influenced periodic trends. Notably, while Aitken mode observations are initially linked to traffic-related peaks, both Aitken and nucleation modes contribute to concentration peaks during rush hour periods on short-term impacts after deweather adjustment. Pollutant baseline concentrations are largely driven by human activities, with meteorological factors modulating their variability, and the secondary formation of UFPs is likely reflected in temporal residuals. This study provides a flexible framework for isolating meteorological effects, allowing more accurate assessment of anthropogenic impacts and targeted management strategies for UFP and PNC.
Collapse
Affiliation(s)
- Tzu-Chi Lin
- Graduate
Institute of Environmental Engineering, College of Engineering, National Taiwan University, 71 Chou-Shan Road, Taipei 106, Taiwan
| | - Pei-Te Chiueh
- Graduate
Institute of Environmental Engineering, College of Engineering, National Taiwan University, 71 Chou-Shan Road, Taipei 106, Taiwan
| | - Ta-Chih Hsiao
- Graduate
Institute of Environmental Engineering, College of Engineering, National Taiwan University, 71 Chou-Shan Road, Taipei 106, Taiwan
- Research
Center for Environmental Changes, Academia
Sinica, Taipei 115, Taiwan
| |
Collapse
|
10
|
Obeidat R, Alsmadi I, Baker QB, Al-Njadat A, Srinivasan S. Researching public health datasets in the era of deep learning: a systematic literature review. Health Informatics J 2025; 31:14604582241307839. [PMID: 39794941 DOI: 10.1177/14604582241307839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2025]
Abstract
Objective: Explore deep learning applications in predictive analytics for public health data, identify challenges and trends, and then understand the current landscape. Materials and Methods: A systematic literature review was conducted in June 2023 to search articles on public health data in the context of deep learning, published from the inception of medical and computer science databases through June 2023. The review focused on diverse datasets, abstracting applications, challenges, and advancements in deep learning. Results: 2004 articles were reviewed, identifying 14 disease categories. Observed trends include explainable-AI, patient embedding learning, and integrating different data sources and employing deep learning models in health informatics. Noted challenges were technical reproducibility and handling sensitive data. Discussion: There has been a notable surge in deep learning applications on public health data publications since 2015. Consistent deep learning applications and models continue to be applied across public health data. Despite the wide applications, a standard approach still does not exist for addressing the outstanding challenges and issues in this field. Conclusion: Guidelines are needed for applying deep learning and models in public health data to improve FAIRness, efficiency, transparency, comparability, and interoperability of research. Interdisciplinary collaboration among data scientists, public health experts, and policymakers is needed to harness the full potential of deep learning.
Collapse
Affiliation(s)
- Rand Obeidat
- Department of Management Information Systems, Bowie State University, Bowie, USA
| | - Izzat Alsmadi
- Department of Computational, Engineering and Mathematical Sciences, Texas A & M San Antonio, San Antonio, USA
| | - Qanita Bani Baker
- Department of Computer Science, Jordan University of Science and Technology, Irbid, Jordan
| | | | - Sriram Srinivasan
- Department of Management Information Systems, Bowie State University, Bowie, USA
| |
Collapse
|
11
|
Wang J, Xi R, Wang Y, Gao H, Gao M, Zhang X, Zhang L, Zhang Y. Toward molecular diagnosis of major depressive disorder by plasma peptides using a deep learning approach. Brief Bioinform 2024; 26:bbae554. [PMID: 39592240 PMCID: PMC11596692 DOI: 10.1093/bib/bbae554] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Revised: 09/30/2024] [Accepted: 10/01/2024] [Indexed: 11/28/2024] Open
Abstract
Major depressive disorder (MDD) is a severe psychiatric disorder that currently lacks any objective diagnostic markers. Here, we develop a deep learning approach to discover the mass spectrometric features that can discriminate MDD patients from health controls. Using plasma peptides, the neural network, termed as CMS-Net, can perform diagnosis and prediction with an accuracy of 0.9441. The sensitivity and specificity reached 0.9352 and 0.9517 respectively, and the area under the curve was enhanced to 0.9634. Using the gradient-based feature importance method to interpret crucial features, we identify 28 differential peptide sequences from 14 precursor proteins (e.g. hemoglobin, immunoglobulin, albumin, etc.). This work highlights the possibility of molecular diagnosis of MDD with the aid of chemical and computer science.
Collapse
Affiliation(s)
- Jiaqi Wang
- School of Traditional Chinese Materia Medica, Shenyang Pharmaceutical University, 103 Wenhua Road, Shenhe District, Shenyang 110016, China
- CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, 457 Zhongshan Road, Dalian 116023, Liaoning, China
| | - Ronggang Xi
- The 967th Hospital of the Joint Logistics Support Force of PLA, 80 Shengli Road, Xigang District, Dalian 116021, Liaoning, China
| | - Yi Wang
- CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, 457 Zhongshan Road, Dalian 116023, Liaoning, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Huiyuan Gao
- School of Traditional Chinese Materia Medica, Shenyang Pharmaceutical University, 103 Wenhua Road, Shenhe District, Shenyang 110016, China
| | - Ming Gao
- School of Management Science and Engineering, Key Laboratory of Big Data Management Optimization and Decision of Liaoning Province, Dongbei University of Finance and Economics, No. 217 Jianshan Street, Shahekou District, Dalian 116025, Liaoning, China
| | - Xiaozhe Zhang
- CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, 457 Zhongshan Road, Dalian 116023, Liaoning, China
| | - Lihua Zhang
- CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, 457 Zhongshan Road, Dalian 116023, Liaoning, China
| | - Yukui Zhang
- CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, 457 Zhongshan Road, Dalian 116023, Liaoning, China
| |
Collapse
|
12
|
Yuan Z, Peng J, Shu Z, Qin X, Zhong J. Interpretable multitemporal liver function indicator model for prediction and risk factor analysis of drug induced liver injury. Sci Rep 2024; 14:21285. [PMID: 39261535 PMCID: PMC11390907 DOI: 10.1038/s41598-024-66952-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2023] [Accepted: 07/05/2024] [Indexed: 09/13/2024] Open
Abstract
The occurrence of liver injury during cancer treatment is extremely harmful. The risk factors for drug.induced liver injury (DILI) in the pancreatic cancer population have not been investigated. This study aims to develop and validate an interpretable decision tree (DT) model for the early prediction of DILI in pancreatic cancer patients using multitemporal clinical data and screening for related risk factors. A retrospective collection of data was conducted on 307 patients, the training set (n = 215) was used to develop the model, and the test set (n = 92) was used to evaluate the model. The classification and regression trees algorithm was employed to establish the DT model. The Shapley Additive explanations (SHAP) method was used to facilitate clinical interpretation. Model performance was assessed using AUC and the Hosmer‒Lemeshow test. The DT model exhibited superior diagnostic efficacy, the AUC values were 0.995 and 0.994 in the training and test sets, respectively. Four risk factors associated with DILI occurrence were identified: delta.albumin, delta.ALT, and post (AST: ALT), and post.GGT. The multiperiod liver function indicator.based interpretable DT model predicted DILI occurrence in the pancreatic cancer population and contributes to personalized clinical management of pancreatic cancer patients.
Collapse
Affiliation(s)
- Zhongyu Yuan
- Department of Radiology, the Fourth Affiliated Hospital of School of Medicine, and International School of Medicine, International Institutes of Medicine, Zhejiang University, Yiwu, China, 322000, Yiwu, Zhejiang, China
| | - Jiaxuan Peng
- Cancer Center, Department of Radiology, Zhejiang Provincial People's Hospital, Affiliated People's Hospital, Hangzhou Medical College, Hangzhou, Zhejiang, China
| | - Zhenyu Shu
- Cancer Center, Department of Radiology, Zhejiang Provincial People's Hospital, Affiliated People's Hospital, Hangzhou Medical College, Hangzhou, Zhejiang, China
| | - Xue Qin
- Cancer Center, Department of Radiology, Zhejiang Provincial People's Hospital, Affiliated People's Hospital, Hangzhou Medical College, Hangzhou, Zhejiang, China
| | - Jianguo Zhong
- Cancer Center, Department of Radiology, Zhejiang Provincial People's Hospital, Affiliated People's Hospital, Hangzhou Medical College, Hangzhou, Zhejiang, China.
| |
Collapse
|
13
|
Dhibar S, Jana B. Optimized Collective Variable for Collapse Transition in Linear Hydrophobic Polymers: Importance of Hydration Water and End-to-End Distance. J Chem Theory Comput 2024; 20:7404-7415. [PMID: 39252562 DOI: 10.1021/acs.jctc.4c00753] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/11/2024]
Abstract
Choosing an appropriate collective variable (CV) for any biomolecular process is a challenging task. Researchers are developing methods to solve this issue using a variety of methodologies, most recently using machine learning (ML) methods. In this work, we investigate the mechanism of collapse transition across various lengths of polymer systems through adaptively sampled multiple short trajectories utilizing the Time Lagged Independent Component Analysis (TICA) framework. From TICA analysis, it is revealed that the radius of gyration (Rg) and end-to-end distance serve as good order parameters (OPs) for these systems describing overall energy landscapes. Markov state model (MSM) and mean first passage time (MFPT) analysis suggest that hydration water (Nw) plays a determining role in dictating the time scale and barrier for the collapsed transition for the C40 system. P-fold analysis on identifying transition state ensembles (TSE) identified by committor analysis also strengthens the role of Nw in such a transition. TICA, MSM, and committor analyses on the collapse transition for C45 reveal similarities with C40 systems in different aspects. Furthermore, we propose a pipeline integrating XGBoost regression along with an interpretable ML model, Shapley Additive exPlanation (SHAP) to precisely elucidate the contribution of each OP locally at the TSE. Through this approach, we observe that the collapse transition is primarily driven by Nw for both polymer systems. A carefully designed protocol for the collapsed transition of C60 systems indirectly reiterates the above result. Overall, our results suggest that while the end-to-end distance should be considered for better resolution of metastable states in the landscape, Nw is the crucial coordinate to be used in enhanced sampling for the exploration of actual collapse transitions for linear hydrophobic polymer systems. The Python code for analyzing the contribution of different OPs in the TSE using an ML-aided protocol is available on GitHub (https://github.com/saikat-ai/linear_polymer_project).
Collapse
Affiliation(s)
- Saikat Dhibar
- School of Chemical Sciences, Indian Association for the Cultivation of Science, Jadavpur, Kolkata 700032, India
| | - Biman Jana
- School of Chemical Sciences, Indian Association for the Cultivation of Science, Jadavpur, Kolkata 700032, India
| |
Collapse
|
14
|
Zhang X, Tseo Y, Bai Y, Chen F, Uhler C. Prediction of protein subcellular localization in single cells. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.25.605178. [PMID: 39091825 PMCID: PMC11291118 DOI: 10.1101/2024.07.25.605178] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 08/04/2024]
Abstract
The subcellular localization of a protein is important for its function and interaction with other molecules, and its mislocalization is linked to numerous diseases. While atlas-scale efforts have been made to profile protein localization across various cell lines, existing datasets only contain limited pairs of proteins and cell lines which do not cover all human proteins. We present a method that uses both protein sequences and cellular landmark images to perform Predictions of Unseen Proteins' Subcellular localization (PUPS), which can generalize to both proteins and cell lines not used for model training. PUPS combines a protein language model and an image inpainting model to utilize both protein sequence and cellular images for protein localization prediction. The protein sequence input enables generalization to unseen proteins and the cellular image input enables cell type specific prediction that captures single-cell variability. PUPS' ability to generalize to unseen proteins and cell lines enables us to assess the variability in protein localization across cell lines as well as across single cells within a cell line and to identify the biological processes associated with the proteins that have variable localization. Experimental validation shows that PUPS can be used to predict protein localization in newly performed experiments outside of the Human Protein Atlas used for training. Collectively, PUPS utilizes both protein sequences and cellular images to predict protein localization in unseen proteins and cell lines with the ability to capture single-cell variability.
Collapse
Affiliation(s)
- Xinyi Zhang
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, U.S.A
- Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard, U.S.A
| | - Yitong Tseo
- Computational and Systems Biology Program, Massachusetts Institute of Technology, U.S.A
| | - Yunhao Bai
- Broad Institute of MIT and Harvard, U.S.A
| | - Fei Chen
- Broad Institute of MIT and Harvard, U.S.A
- Department of Stem Cell and Regenerative Biology, Harvard University, U.S.A
| | - Caroline Uhler
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, U.S.A
- Eric and Wendy Schmidt Center, Broad Institute of MIT and Harvard, U.S.A
| |
Collapse
|
15
|
Borole P, Rajan A. Building trust in deep learning-based immune response predictors with interpretable explanations. Commun Biol 2024; 7:279. [PMID: 38448546 PMCID: PMC10917751 DOI: 10.1038/s42003-024-05968-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2023] [Accepted: 02/23/2024] [Indexed: 03/08/2024] Open
Abstract
The ability to predict whether a peptide will get presented on Major Histocompatibility Complex (MHC) class I molecules has profound implications in designing vaccines. Numerous deep learning-based predictors for peptide presentation on MHC class I molecules exist with high levels of accuracy. However, these MHC class I predictors are treated as black-box functions, providing little insight into their decision making. To build turst in these predictors, it is crucial to understand the rationale behind their decisions with human-interpretable explanations. We present MHCXAI, eXplainable AI (XAI) techniques to help interpret the outputs from MHC class I predictors in terms of input peptide features. In our experiments, we explain the outputs of four state-of-the-art MHC class I predictors over a large dataset of peptides and MHC alleles. Additionally, we evaluate the reliability of the explanations by comparing against ground truth and checking their robustness. MHCXAI seeks to increase understanding of deep learning-based predictors in the immune response domain and build trust with validated explanations.
Collapse
Affiliation(s)
- Piyush Borole
- School of Informatics, University of Edinburgh, Informatics Forum, 10 Crichton St, Newington, Edinburgh, EH8 9AB, Scotland, UK.
| | - Ajitha Rajan
- School of Informatics, University of Edinburgh, Informatics Forum, 10 Crichton St, Newington, Edinburgh, EH8 9AB, Scotland, UK.
| |
Collapse
|
16
|
Kawamura N, Sato W, Shimokawa K, Fujita T, Kawanishi Y. Machine Learning-Based Interpretable Modeling for Subjective Emotional Dynamics Sensing Using Facial EMG. SENSORS (BASEL, SWITZERLAND) 2024; 24:1536. [PMID: 38475072 DOI: 10.3390/s24051536] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/25/2023] [Revised: 02/03/2024] [Accepted: 02/26/2024] [Indexed: 03/14/2024]
Abstract
Understanding the association between subjective emotional experiences and physiological signals is of practical and theoretical significance. Previous psychophysiological studies have shown a linear relationship between dynamic emotional valence experiences and facial electromyography (EMG) activities. However, whether and how subjective emotional valence dynamics relate to facial EMG changes nonlinearly remains unknown. To investigate this issue, we re-analyzed the data of two previous studies that measured dynamic valence ratings and facial EMG of the corrugator supercilii and zygomatic major muscles from 50 participants who viewed emotional film clips. We employed multilinear regression analyses and two nonlinear machine learning (ML) models: random forest and long short-term memory. In cross-validation, these ML models outperformed linear regression in terms of the mean squared error and correlation coefficient. Interpretation of the random forest model using the SHapley Additive exPlanation tool revealed nonlinear and interactive associations between several EMG features and subjective valence dynamics. These findings suggest that nonlinear ML models can better fit the relationship between subjective emotional valence dynamics and facial EMG than conventional linear models and highlight a nonlinear and complex relationship. The findings encourage emotion sensing using facial EMG and offer insight into the subjective-physiological association.
Collapse
Affiliation(s)
- Naoya Kawamura
- Computational Cognitive Neuroscience Laboratory, Graduate School of Informatics, Kyoto University, Yoshida-Honmachi, Sakyo, Kyoto 606-8501, Japan
- Psychological Process Team, Guardian Robot Project, RIKEN, 2-2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0288, Japan
| | - Wataru Sato
- Computational Cognitive Neuroscience Laboratory, Graduate School of Informatics, Kyoto University, Yoshida-Honmachi, Sakyo, Kyoto 606-8501, Japan
- Psychological Process Team, Guardian Robot Project, RIKEN, 2-2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0288, Japan
| | - Koh Shimokawa
- Psychological Process Team, Guardian Robot Project, RIKEN, 2-2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0288, Japan
| | - Tomohiro Fujita
- Multimodal Data Recognition Research Team, Guardian Robot Project, RIKEN, 2-2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0288, Japan
| | - Yasutomo Kawanishi
- Multimodal Data Recognition Research Team, Guardian Robot Project, RIKEN, 2-2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0288, Japan
| |
Collapse
|
17
|
Will A, Oliinyk D, Bleiholder C, Meier F. Peptide collision cross sections of 22 post-translational modifications. Anal Bioanal Chem 2023; 415:6633-6645. [PMID: 37758903 PMCID: PMC10598134 DOI: 10.1007/s00216-023-04957-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2022] [Revised: 07/13/2023] [Accepted: 08/23/2023] [Indexed: 09/29/2023]
Abstract
Recent advances have rekindled the interest in ion mobility as an additional dimension of separation in mass spectrometry (MS)-based proteomics. Ion mobility separates ions according to their size and shape in the gas phase. Here, we set out to investigate the effect of 22 different post-translational modifications (PTMs) on the collision cross section (CCS) of peptides. In total, we analyzed ~4300 pairs of matching modified and unmodified peptide ion species by trapped ion mobility spectrometry (TIMS). Linear alignment based on spike-in reference peptides resulted in highly reproducible CCS values with a median coefficient of variation of 0.26%. On a global level, we observed a redistribution in the m/z vs. ion mobility space for modified peptides upon changes in their charge state. Pairwise comparison between modified and unmodified peptides of the same charge state revealed median shifts in CCS between -1.4% (arginine citrullination) and +4.5% (O-GlcNAcylation). In general, increasing modified peptide masses were correlated with higher CCS values, in particular within homologous PTM series. However, investigating the ion populations in more detail, we found that the change in CCS can vary substantially for a given PTM and is partially correlated with the gas phase structure of its unmodified counterpart. In conclusion, our study shows PTM- and sequence-specific effects on the cross section of peptides, which could be further leveraged for proteome-wide PTM analysis.
Collapse
Affiliation(s)
- Andreas Will
- Functional Proteomics, Jena University Hospital, Am Klinikum 1, 07747, Jena, Germany
| | - Denys Oliinyk
- Functional Proteomics, Jena University Hospital, Am Klinikum 1, 07747, Jena, Germany
| | - Christian Bleiholder
- Department of Chemistry and Biochemistry, Florida State University, Tallahassee, FL, 32304, USA
| | - Florian Meier
- Functional Proteomics, Jena University Hospital, Am Klinikum 1, 07747, Jena, Germany.
| |
Collapse
|
18
|
Karim MR, Islam T, Shajalal M, Beyan O, Lange C, Cochez M, Rebholz-Schuhmann D, Decker S. Explainable AI for Bioinformatics: Methods, Tools and Applications. Brief Bioinform 2023; 24:bbad236. [PMID: 37478371 DOI: 10.1093/bib/bbad236] [Citation(s) in RCA: 30] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 05/10/2023] [Accepted: 05/26/2023] [Indexed: 07/23/2023] Open
Abstract
Artificial intelligence (AI) systems utilizing deep neural networks and machine learning (ML) algorithms are widely used for solving critical problems in bioinformatics, biomedical informatics and precision medicine. However, complex ML models that are often perceived as opaque and black-box methods make it difficult to understand the reasoning behind their decisions. This lack of transparency can be a challenge for both end-users and decision-makers, as well as AI developers. In sensitive areas such as healthcare, explainability and accountability are not only desirable properties but also legally required for AI systems that can have a significant impact on human lives. Fairness is another growing concern, as algorithmic decisions should not show bias or discrimination towards certain groups or individuals based on sensitive attributes. Explainable AI (XAI) aims to overcome the opaqueness of black-box models and to provide transparency in how AI systems make decisions. Interpretable ML models can explain how they make predictions and identify factors that influence their outcomes. However, the majority of the state-of-the-art interpretable ML methods are domain-agnostic and have evolved from fields such as computer vision, automated reasoning or statistics, making direct application to bioinformatics problems challenging without customization and domain adaptation. In this paper, we discuss the importance of explainability and algorithmic transparency in the context of bioinformatics. We provide an overview of model-specific and model-agnostic interpretable ML methods and tools and outline their potential limitations. We discuss how existing interpretable ML methods can be customized and fit to bioinformatics research problems. Further, through case studies in bioimaging, cancer genomics and text mining, we demonstrate how XAI methods can improve transparency and decision fairness. Our review aims at providing valuable insights and serving as a starting point for researchers wanting to enhance explainability and decision transparency while solving bioinformatics problems. GitHub: https://github.com/rezacsedu/XAI-for-bioinformatics.
Collapse
Affiliation(s)
- Md Rezaul Karim
- Computer Science 5 - Information Systems and Databases, RWTH Aachen University, Germany
- Department of Data Science and Artificial Intelligence, Fraunhofer FIT, Germany
| | - Tanhim Islam
- Computer Science 9 - Process and Data Science, RWTH Aachen University, Germany
| | | | - Oya Beyan
- Computer Science 5 - Information Systems and Databases, RWTH Aachen University, Germany
- University of Cologne, Faculty of Medicine and University Hospital Cologne, Institute for Medical Informatics, Germany
| | - Christoph Lange
- Computer Science 5 - Information Systems and Databases, RWTH Aachen University, Germany
- Department of Data Science and Artificial Intelligence, Fraunhofer FIT, Germany
| | - Michael Cochez
- Department of Computer Science, Vrije Universiteit Amsterdam, the Netherlands
- Elsevier Discovery Lab, Amsterdam, the Netherlands
| | - Dietrich Rebholz-Schuhmann
- ZBMED - Information Center for Life Sciences, Cologne, Germany
- Faculty of Medicine, University of Cologne, Germany
| | - Stefan Decker
- Computer Science 5 - Information Systems and Databases, RWTH Aachen University, Germany
- Department of Data Science and Artificial Intelligence, Fraunhofer FIT, Germany
| |
Collapse
|
19
|
Hartout P, Počuča B, Méndez-García C, Schleberger C. Investigating the human and nonobese diabetic mouse MHC class II immunopeptidome using protein language modeling. Bioinformatics 2023; 39:btad469. [PMID: 37527005 PMCID: PMC10421966 DOI: 10.1093/bioinformatics/btad469] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Revised: 06/17/2023] [Accepted: 07/31/2023] [Indexed: 08/03/2023] Open
Abstract
MOTIVATION Identifying peptides associated with the major histocompability complex class II (MHCII) is a central task in the evaluation of the immunoregulatory function of therapeutics and drug prototypes. MHCII-peptide presentation prediction has multiple biopharmaceutical applications, including the safety assessment of biologics and engineered derivatives in silico, or the fast progression of antigen-specific immunomodulatory drug discovery programs in immune disease and cancer. This has resulted in the collection of large-scale datasets on adaptive immune receptor antigenic responses and MHC-associated peptide proteomics. In parallel, recent deep learning algorithmic advances in protein language modeling have shown potential in leveraging large collections of sequence data and improve MHC presentation prediction. RESULTS Here, we train a compact transformer model (AEGIS) on human and mouse MHCII immunopeptidome data, including a preclinical murine model, and evaluate its performance on the peptide presentation prediction task. We show that the transformer performs on par with existing deep learning algorithms and that combining datasets from multiple organisms increases model performance. We trained variants of the model with and without MHCII information. In both alternatives, the inclusion of peptides presented by the I-Ag7 MHC class II molecule expressed by nonobese diabetic mice enabled for the first time the accurate in silico prediction of presented peptides in a preclinical type 1 diabetes model organism, which has promising therapeutic applications. AVAILABILITY AND IMPLEMENTATION The source code is available at https://github.com/Novartis/AEGIS.
Collapse
Affiliation(s)
- Philip Hartout
- Discovery Sciences, Novartis Institutes for Biomedical Research, Basel 4056, Switzerland
| | - Bojana Počuča
- NIBR Research Informatics, Novartis Institutes for Biomedical Research, Basel 4056, Switzerland
| | - Celia Méndez-García
- Discovery Sciences, Novartis Institutes for Biomedical Research, Basel 4056, Switzerland
| | - Christian Schleberger
- Discovery Sciences, Novartis Institutes for Biomedical Research, Basel 4056, Switzerland
| |
Collapse
|
20
|
Peng J, Wang L, Wang P, Pei Y. Density Functional Theory Computation and Machine Learning Studies of Interaction between Au 3 Clusters and 20 Natural Amino Acid Molecules. ACS OMEGA 2023; 8:23024-23031. [PMID: 37396243 PMCID: PMC10308543 DOI: 10.1021/acsomega.3c02195] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/01/2023] [Accepted: 05/17/2023] [Indexed: 07/04/2023]
Abstract
The optimal adsorption sites and the binding energies of neutral Au3 clusters with 20 natural amino acids under the gas phase and water solvation were systematically investigated based on density functional theory (DFT) calculations. The calculation results showed that in the gas phase Au3 tends to bind with N atoms of amino groups in amino acids, except methionine, which tends to bind with Au3 through S atoms. Under water solvation, Au3 clusters tended to bind to N atoms of amino groups and N atoms of side chain amino groups in amino acids. However, methionine and cysteine bind more strongly to the gold atom through the S atom. Based on the binding energy data of Au3 clusters and 20 natural amino acids under water solvation calculated by DFT, a machine learning model (gradient boosted decision tree) was proposed to predict the optimal binding Gibbs free energy (ΔG) of the interaction between Au3 clusters and amino acids. The main factors affecting the strength of the interaction between Au3 and amino acids were uncovered by the feature importance analysis.
Collapse
Affiliation(s)
- Jiao Peng
- Department
of Chemistry, Key Laboratory for Green Organic Synthesis and Application
of Hunan Province, Key Laboratory of Environmentally Friendly Chemistry
and Applications of Ministry of Education, Xiangtan University, Xiangtan, Hunan 411105, China
| | - Li Wang
- Department
of Chemistry, Key Laboratory for Green Organic Synthesis and Application
of Hunan Province, Key Laboratory of Environmentally Friendly Chemistry
and Applications of Ministry of Education, Xiangtan University, Xiangtan, Hunan 411105, China
| | - Pu Wang
- Department
of Chemistry, Key Laboratory for Green Organic Synthesis and Application
of Hunan Province, Key Laboratory of Environmentally Friendly Chemistry
and Applications of Ministry of Education, Xiangtan University, Xiangtan, Hunan 411105, China
| | - Yong Pei
- Department
of Chemistry, Key Laboratory for Green Organic Synthesis and Application
of Hunan Province, Key Laboratory of Environmentally Friendly Chemistry
and Applications of Ministry of Education, Xiangtan University, Xiangtan, Hunan 411105, China
- School
of Minerals Processing and Bioengineering, Central South University, Changsha, Hunan 410083, China
- State
Key Laboratory of Complex Nonferrous Metal Resources Clean Utilization, Kunming 650093, China
| |
Collapse
|
21
|
Ye W, Chen X, Li P, Tao Y, Wang Z, Gao C, Cheng J, Li F, Yi D, Wei Z, Yi D, Wu Y. OEDL: an optimized ensemble deep learning method for the prediction of acute ischemic stroke prognoses using union features. Front Neurol 2023; 14:1158555. [PMID: 37416306 PMCID: PMC10321134 DOI: 10.3389/fneur.2023.1158555] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2023] [Accepted: 05/22/2023] [Indexed: 07/08/2023] Open
Abstract
Background Early stroke prognosis assessments are critical for decision-making regarding therapeutic intervention. We introduced the concepts of data combination, method integration, and algorithm parallelization, aiming to build an integrated deep learning model based on a combination of clinical and radiomics features and analyze its application value in prognosis prediction. Methods The research steps in this study include data source and feature extraction, data processing and feature fusion, model building and optimization, model training, and so on. Using data from 441 stroke patients, clinical and radiomics features were extracted, and feature selection was performed. Clinical, radiomics, and combined features were included to construct predictive models. We applied the concept of deep integration to the joint analysis of multiple deep learning methods, used a metaheuristic algorithm to improve the parameter search efficiency, and finally, developed an acute ischemic stroke (AIS) prognosis prediction method, namely, the optimized ensemble of deep learning (OEDL) method. Results Among the clinical features, 17 features passed the correlation check. Among the radiomics features, 19 features were selected. In the comparison of the prediction performance of each method, the OEDL method based on the concept of ensemble optimization had the best classification performance. In the comparison to the predictive performance of each feature, the inclusion of the combined features resulted in better classification performance than that of the clinical and radiomics features. In the comparison to the prediction performance of each balanced method, SMOTEENN, which is based on a hybrid sampling method, achieved the best classification performance than that of the unbalanced, oversampled, and undersampled methods. The OEDL method with combined features and mixed sampling achieved the best classification performance, with 97.89, 95.74, 94.75, 94.03, and 94.35% for Macro-AUC, ACC, Macro-R, Macro-P, and Macro-F1, respectively, and achieved advanced performance in comparison with that of methods in previous studies. Conclusion The OEDL approach proposed herein could effectively achieve improved stroke prognosis prediction performance, the effect of using combined data modeling was significantly better than that of single clinical or radiomics feature models, and the proposed method had a better intervention guidance value. Our approach is beneficial for optimizing the early clinical intervention process and providing the necessary clinical decision support for personalized treatment.
Collapse
Affiliation(s)
- Wei Ye
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, Chongqing, China
| | - Xicheng Chen
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, Chongqing, China
| | - Pengpeng Li
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, Chongqing, China
| | - Yongjun Tao
- Department of Neurology, Taizhou Municipal Hospital, Taizhou, Zhejiang, China
| | - Zhenyan Wang
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, Chongqing, China
| | - Chengcheng Gao
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, Chongqing, China
| | - Jian Cheng
- Department of Radiology, Taizhou Municipal Hospital, Taizhou, Zhejiang, China
| | - Fang Li
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, Chongqing, China
| | - Dali Yi
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, Chongqing, China
- Department of Health Education, College of Preventive Medicine, Army Medical University, Chongqing, China
| | - Zeliang Wei
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, Chongqing, China
| | - Dong Yi
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, Chongqing, China
| | - Yazhou Wu
- Department of Health Statistics, College of Preventive Medicine, Army Medical University, Chongqing, China
| |
Collapse
|
22
|
Steyaert S, Pizurica M, Nagaraj D, Khandelwal P, Hernandez-Boussard T, Gentles AJ, Gevaert O. Multimodal data fusion for cancer biomarker discovery with deep learning. NAT MACH INTELL 2023; 5:351-362. [PMID: 37693852 PMCID: PMC10484010 DOI: 10.1038/s42256-023-00633-5] [Citation(s) in RCA: 79] [Impact Index Per Article: 39.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Accepted: 02/17/2023] [Indexed: 09/12/2023]
Abstract
Technological advances now make it possible to study a patient from multiple angles with high-dimensional, high-throughput multi-scale biomedical data. In oncology, massive amounts of data are being generated ranging from molecular, histopathology, radiology to clinical records. The introduction of deep learning has significantly advanced the analysis of biomedical data. However, most approaches focus on single data modalities leading to slow progress in methods to integrate complementary data types. Development of effective multimodal fusion approaches is becoming increasingly important as a single modality might not be consistent and sufficient to capture the heterogeneity of complex diseases to tailor medical care and improve personalised medicine. Many initiatives now focus on integrating these disparate modalities to unravel the biological processes involved in multifactorial diseases such as cancer. However, many obstacles remain, including lack of usable data as well as methods for clinical validation and interpretation. Here, we cover these current challenges and reflect on opportunities through deep learning to tackle data sparsity and scarcity, multimodal interpretability, and standardisation of datasets.
Collapse
Affiliation(s)
- Sandra Steyaert
- Stanford Center for Biomedical Informatics Research (BMIR), Department of Medicine, Stanford University
| | - Marija Pizurica
- Stanford Center for Biomedical Informatics Research (BMIR), Department of Medicine, Stanford University
| | | | | | - Tina Hernandez-Boussard
- Stanford Center for Biomedical Informatics Research (BMIR), Department of Medicine, Stanford University
- Department of Biomedical Data Science, Stanford University
| | - Andrew J Gentles
- Stanford Center for Biomedical Informatics Research (BMIR), Department of Medicine, Stanford University
- Department of Biomedical Data Science, Stanford University
| | - Olivier Gevaert
- Stanford Center for Biomedical Informatics Research (BMIR), Department of Medicine, Stanford University
- Department of Biomedical Data Science, Stanford University
| |
Collapse
|
23
|
Huang AA, Huang SY. Increasing transparency in machine learning through bootstrap simulation and shapely additive explanations. PLoS One 2023; 18:e0281922. [PMID: 36821544 PMCID: PMC9949629 DOI: 10.1371/journal.pone.0281922] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2022] [Accepted: 02/05/2023] [Indexed: 02/24/2023] Open
Abstract
Machine learning methods are widely used within the medical field. However, the reliability and efficacy of these models is difficult to assess, making it difficult for researchers to identify which machine-learning model to apply to their dataset. We assessed whether variance calculations of model metrics (e.g., AUROC, Sensitivity, Specificity) through bootstrap simulation and SHapely Additive exPlanations (SHAP) could increase model transparency and improve model selection. Data from the England National Health Services Heart Disease Prediction Cohort was used. After comparison of model metrics for XGBoost, Random Forest, Artificial Neural Network, and Adaptive Boosting, XGBoost was used as the machine-learning model of choice in this study. Boost-strap simulation (N = 10,000) was used to empirically derive the distribution of model metrics and covariate Gain statistics. SHapely Additive exPlanations (SHAP) to provide explanations to machine-learning output and simulation to evaluate the variance of model accuracy metrics. For the XGBoost modeling method, we observed (through 10,000 completed simulations) that the AUROC ranged from 0.771 to 0.947, a difference of 0.176, the balanced accuracy ranged from 0.688 to 0.894, a 0.205 difference, the sensitivity ranged from 0.632 to 0.939, a 0.307 difference, and the specificity ranged from 0.595 to 0.944, a 0.394 difference. Among 10,000 simulations completed, we observed that the gain for Angina ranged from 0.225 to 0.456, a difference of 0.231, for Cholesterol ranged from 0.148 to 0.326, a difference of 0.178, for maximum heart rate (MaxHR) ranged from 0.081 to 0.200, a range of 0.119, and for Age ranged from 0.059 to 0.157, difference of 0.098. Use of simulations to empirically evaluate the variability of model metrics and explanatory algorithms to observe if covariates match the literature are necessary for increased transparency, reliability, and utility of machine learning methods. These variance statistics, combined with model accuracy statistics can help researchers identify the best model for a given dataset.
Collapse
Affiliation(s)
- Alexander A. Huang
- Department of Statistics and Data Science, Cornell University, Ithaca, New York, United States of America
- Department of MD Education, Northwestern University Feinberg School of Medicine, Chicago, Illinois, United States of America
| | - Samuel Y. Huang
- Department of Statistics and Data Science, Cornell University, Ithaca, New York, United States of America
- Department of Internal Medicine, Virginia Commonwealth University School of Medicine, Richmond, Virginia, United States of America
| |
Collapse
|
24
|
Kokkotis C, Giarmatzis G, Giannakou E, Moustakidis S, Tsatalas T, Tsiptsios D, Vadikolias K, Aggelousis N. An Explainable Machine Learning Pipeline for Stroke Prediction on Imbalanced Data. Diagnostics (Basel) 2022; 12:2392. [PMID: 36292081 PMCID: PMC9600473 DOI: 10.3390/diagnostics12102392] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2022] [Revised: 09/26/2022] [Accepted: 09/27/2022] [Indexed: 11/16/2022] Open
Abstract
Stroke is an acute neurological dysfunction attributed to a focal injury of the central nervous system due to reduced blood flow to the brain. Nowadays, stroke is a global threat associated with premature death and huge economic consequences. Hence, there is an urgency to model the effect of several risk factors on stroke occurrence, and artificial intelligence (AI) seems to be the appropriate tool. In the present study, we aimed to (i) develop reliable machine learning (ML) prediction models for stroke disease; (ii) cope with a typical severe class imbalance problem, which is posed due to the stroke patients' class being significantly smaller than the healthy class; and (iii) interpret the model output for understanding the decision-making mechanism. The effectiveness of the proposed ML approach was investigated in a comparative analysis with six well-known classifiers with respect to metrics that are related to both generalization capability and prediction accuracy. The best overall false-negative rate was achieved by the Multi-Layer Perceptron (MLP) classifier (18.60%). Shapley Additive Explanations (SHAP) were employed to investigate the impact of the risk factors on the prediction output. The proposed AI method could lead to the creation of advanced and effective risk stratification strategies for each stroke patient, which would allow for timely diagnosis and the right treatments.
Collapse
Affiliation(s)
- Christos Kokkotis
- Department of Physical Education and Sport Science, Democritus University of Thrace, 69100 Komotini, Greece
| | - Georgios Giarmatzis
- Department of Physical Education and Sport Science, Democritus University of Thrace, 69100 Komotini, Greece
| | - Erasmia Giannakou
- Department of Physical Education and Sport Science, Democritus University of Thrace, 69100 Komotini, Greece
| | | | - Themistoklis Tsatalas
- Department of Physical Education and Sport Science, University of Thessaly, 38221 Trikala, Greece
| | - Dimitrios Tsiptsios
- Department of Neurology, School of Medicine, University Hospital of Alexandroupolis, Democritus University of Thrace, 68100 Alexandroupolis, Greece
| | - Konstantinos Vadikolias
- Department of Neurology, School of Medicine, University Hospital of Alexandroupolis, Democritus University of Thrace, 68100 Alexandroupolis, Greece
| | - Nikolaos Aggelousis
- Department of Physical Education and Sport Science, Democritus University of Thrace, 69100 Komotini, Greece
| |
Collapse
|