1
|
Garg A, Ramamurthi N, Das SS. Addressing Imbalanced Classification Problems in Drug Discovery and Development Using Random Forest, Support Vector Machine, AutoGluon-Tabular, and H2O AutoML. J Chem Inf Model 2025; 65:3976-3989. [PMID: 40230275 DOI: 10.1021/acs.jcim.5c00023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/16/2025]
Abstract
The classification models built on class imbalanced data sets tend to prioritize the accuracy of the majority class, and thus, the minority class generally has a higher misclassification rate. Different techniques are available to address the class imbalance in classification models and can be categorized as data-level, algorithm-level, and hybrid methods. But to the best of our knowledge, an in-depth analysis of the performance of these techniques against the class ratio is not available in the literature. We have addressed these shortcomings in this study and have performed a detailed analysis of the performance of four different techniques to address imbalanced class distribution using machine learning (ML) methods and AutoML tools. To carry out our study, we have selected four such techniques─(a) threshold optimization using (i) GHOST and (ii) the area under the precision-recall curve (AUPR) curve, (b) internal balancing method of AutoML and class-weight of machine learning methods, and (c) data balancing using SMOTETomek─and generated 27 data sets considering nine different class ratios (i.e., the ratio of the positive class and total samples) from three data sets that belong to the drug discovery and development field. We have employed random forest (RF) and support vector machine (SVM) as representatives of ML classifier and AutoGluon-Tabular (version 0.6.1) and H2O AutoML (version 3.40.0.4) as representatives of AutoML tools. The important findings of our studies are as follows: (i) there is no effect of threshold optimization on ranking metrics such as AUC and AUPR, but AUC and AUPR get affected by class-weighting and SMOTTomek; (ii) for ML methods RF and SVM, significant percentage improvement up to 375, 33.33, and 450 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy, which are suitable for performance evaluation of imbalanced data sets; (iii) for AutoML libraries AutoGluon-Tabular and H2O AutoML, significant percentage improvement up to 383.33, 37.25, and 533.33 over all the data sets can be achieved, respectively, for F1 score, MCC, and balanced accuracy; (iv) the general pattern of percentage improvement in balanced accuracy is that the percentage improvement increases when the class ratio is systematically decreased from 0.5 to 0.1; in the case of F1 score and MCC, maximum improvement is achieved at the class ratio of 0.3; (v) for both ML and AutoML with balancing, it is observed that any individual class-balancing technique does not outperform all other methods on a significantly higher number of data sets based on F1 score; (vi) the three external balancing techniques combined outperformed the internal balancing methods of the ML and AutoML; (vii) AutoML tools perform as good as the ML models and in some cases perform even better for handling imbalanced classification when applied with imbalance handling techniques. In summary, exploration of multiple data balancing techniques is recommended for classifying imbalanced data sets to achieve optimal performance as neither of the external techniques nor the internal techniques outperform others significantly. The results are specific to the ML methods and AutoML libraries used in this study, and for generalization, a study can be carried out considering a sizable number of ML methods and AutoML libraries.
Collapse
Affiliation(s)
- Ayush Garg
- TCS Research (Life Sciences Division), Tata Consultancy Services Limited, Noida 201303, India
| | - Narayanan Ramamurthi
- TCS Research (Life Sciences Division), Tata Consultancy Services Limited, Chennai 600113, India
| | - Shyam Sundar Das
- TCS Research (Life Sciences Division), Tata Consultancy Services Limited, Kolkata 700160, India
| |
Collapse
|
2
|
Al-Ahmari S, Nadeem F. Improving Surgical Site Infection Prediction Using Machine Learning: Addressing Challenges of Highly Imbalanced Data. Diagnostics (Basel) 2025; 15:501. [PMID: 40002652 PMCID: PMC11854898 DOI: 10.3390/diagnostics15040501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2025] [Revised: 02/13/2025] [Accepted: 02/17/2025] [Indexed: 02/27/2025] Open
Abstract
Background: Surgical site infections (SSIs) lead to higher hospital readmission rates and healthcare costs, representing a significant global healthcare burden. Machine learning (ML) has demonstrated potential in predicting SSIs; however, the challenge of addressing imbalanced class ratios remains. Objectives: The aim of this study is to evaluate and enhance the predictive capabilities of machine learning models for SSIs by assessing the effects of feature selection, resampling techniques, and hyperparameter optimization. Methods: Using routine SSI surveillance data from multiple hospitals in Saudi Arabia, we analyzed a dataset of 64,793 surgical patients, of whom 1632 developed SSI. Seven machine learning algorithms were created and tested: Decision Tree (DT), Gaussian Naive Bayes (GNB), Support Vector Machine (SVM), Logistic Regression (LR), Random Forest (RF), Stochastic Gradient Boosting (SGB), and K-Nearest Neighbors (KNN). We also improved several resampling strategies, such as undersampling and oversampling. Grid search five-fold cross-validation was employed for comprehensive hyperparameter optimization, in conjunction with balanced sampling techniques. Features were selected using a filter method based on their relationships with the target variable. Results: Our findings revealed that RF achieves the highest performance, with an MCC of 0.72. The synthetic minority oversampling technique (SMOTE) is the best-performing resampling technique, consistently enhancing the performance of most machine learning models, except for LR and GNB. LR struggles with class imbalance due to its linear assumptions and bias toward the majority class, while GNB's reliance on feature independence and Gaussian distribution make it unreliable for under-represented minority classes. For computational efficiency, the Instance Hardness Threshold (IHT) offers a viable alternative undersampling technique, though it may compromise performance to some extent. Conclusions: This study underscores the potential of ML models as effective tools for assessing SSI risk, warranting further clinical exploration to improve patient outcomes. By employing advanced ML techniques and robust validation methods, these models demonstrate promising accuracy and reliability in predicting SSI events, even in the face of significant class imbalances. In addition, using MCC in this study ensures a more reliable and robust evaluation of the model's predictive performance, particularly in the presence of an imbalanced dataset, where other metrics may fail to provide an accurate evaluation.
Collapse
Affiliation(s)
- Salha Al-Ahmari
- Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia
- Department of Computer and Information Systems, Applied College, King Khalid University, Abha 61421, Saudi Arabia
| | - Farrukh Nadeem
- Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia
| |
Collapse
|
3
|
Siddique F, Lee BK. Predicting adolescent psychopathology from early life factors: A machine learning tutorial. GLOBAL EPIDEMIOLOGY 2024; 8:100161. [PMID: 39279846 PMCID: PMC11402309 DOI: 10.1016/j.gloepi.2024.100161] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2024] [Revised: 07/10/2024] [Accepted: 08/27/2024] [Indexed: 09/18/2024] Open
Abstract
Objective The successful implementation and interpretation of machine learning (ML) models in epidemiological studies can be challenging without an extensive programming background. We provide a didactic example of machine learning for risk prediction in this study by determining whether early life factors could be useful for predicting adolescent psychopathology. Methods In total, 9643 adolescents ages 9-10 from the Adolescent Brain and Cognitive Development (ABCD) Study were included in ML analysis to predict high Child Behavior Checklist (CBCL) scores (i.e., t-scores ≥ 60). ML models were constructed using a series of predictor combinations (prenatal, family history, sociodemographic) across 5 different algorithms. We assessed ML performance through sensitivity, specificity, F1-score, and area under the curve (AUC) metrics. Results A total of 1267 adolescents (13.1 %) were found to have high CBCL scores. The best performing algorithms were elastic net and gradient boosted trees. The best performing elastic net models included prenatal and family history factors (Sensitivity 0.654, Specificity 0.713; AUC 0.742, F1-score 0.401) and prenatal, family, history, and sociodemographic factors (Sensitivity 0.668, Specificity 0.704; AUC 0.745, F1-score 0.402). Across all 5 ML algorithms, family history factors (e.g., either parent had nervous breakdowns, trouble holding jobs/fights/police encounters, and counseling for mental issues) and sociodemographic covariates (e.g., maternal age, child's sex, caregiver income and caregiver education) tended to be better predictors of adolescent psychopathology. The most important prenatal predictors were unplanned pregnancy, birth complications, and pregnancy complications. Conclusion Our results suggest that inclusion of prenatal, family history, and sociodemographic factors in ML models can generate moderately accurate predictions of adolescent psychopathology. Issues associated with model overfitting, hyperparameter tuning, and system seed setting should be considered throughout model training, testing, and validation. Future early risk predictions models may improve with the inclusion of additional relevant covariates.
Collapse
Affiliation(s)
- Faizaan Siddique
- Department of Epidemiology and Biostatistics, School of Public Health, Drexel University, Philadelphia, PA, United States of America
- Conestoga High School, Berwyn, PA, United States of America
| | - Brian K Lee
- Department of Epidemiology and Biostatistics, School of Public Health, Drexel University, Philadelphia, PA, United States of America
- Department of Global Public Health, Karolinska Institutet, Stockholm, Sweden
| |
Collapse
|
4
|
Ostojic D, Lalousis PA, Donohoe G, Morris DW. The challenges of using machine learning models in psychiatric research and clinical practice. Eur Neuropsychopharmacol 2024; 88:53-65. [PMID: 39232341 DOI: 10.1016/j.euroneuro.2024.08.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/25/2024] [Revised: 08/06/2024] [Accepted: 08/12/2024] [Indexed: 09/06/2024]
Abstract
To understand the complex nature of heterogeneous psychiatric disorders, scientists and clinicians are required to employ a wide range of clinical, endophenotypic, neuroimaging, genomic, and environmental data to understand the biological mechanisms of psychiatric illness before this knowledge is applied into clinical setting. Machine learning (ML) is an automated process that can detect patterns from large multidimensional datasets and can supersede conventional statistical methods as it can detect both linear and non-linear relationships. Due to this advantage, ML has potential to enhance our understanding, improve diagnosis, prognosis and treatment of psychiatric disorders. The current review provides an in-depth examination of, and offers practical guidance for, the challenges encountered in the application of ML models in psychiatric research and clinical practice. These challenges include the curse of dimensionality, data quality, the 'black box' problem, hyperparameter tuning, external validation, class imbalance, and data representativeness. These challenges are particularly critical in the context of psychiatry as it is expected that researchers will encounter them during the stages of ML model development and deployment. We detail practical solutions and best practices to effectively mitigate the outlined challenges. These recommendations have the potential to improve reliability and interpretability of ML models in psychiatry.
Collapse
Affiliation(s)
- Dijana Ostojic
- School of Biological and Chemical Sciences and School of Psychology, Centre for Neuroimaging, Cognition and Genomics (NICOG), University of Galway, Ireland
| | - Paris Alexandros Lalousis
- Department of Psychosis Studies, Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, United Kingdom; Section for Precision Psychiatry, Department of Psychiatry and Psychotherapy, Ludwig-Maximilian-University Munich, Munich, Germany
| | - Gary Donohoe
- School of Biological and Chemical Sciences and School of Psychology, Centre for Neuroimaging, Cognition and Genomics (NICOG), University of Galway, Ireland
| | - Derek W Morris
- School of Biological and Chemical Sciences and School of Psychology, Centre for Neuroimaging, Cognition and Genomics (NICOG), University of Galway, Ireland.
| |
Collapse
|
5
|
Hou YF, Zhang L, Zhang Q, Ge F, Dral PO. Physics-Informed Active Learning for Accelerating Quantum Chemical Simulations. J Chem Theory Comput 2024. [PMID: 39264419 DOI: 10.1021/acs.jctc.4c00821] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/13/2024]
Abstract
Quantum chemical simulations can be greatly accelerated by constructing machine learning potentials, which is often done using active learning (AL). The usefulness of the constructed potentials is often limited by the high effort required and their insufficient robustness in the simulations. Here, we introduce the end-to-end AL for constructing robust data-efficient potentials with affordable investment of time and resources and minimum human interference. Our AL protocol is based on the physics-informed sampling of training points, automatic selection of initial data, uncertainty quantification, and convergence monitoring. The versatility of this protocol is shown in our implementation of quasi-classical molecular dynamics for simulating vibrational spectra, conformer search of a key biochemical molecule, and time-resolved mechanism of the Diels-Alder reaction. These investigations took us days instead of weeks of pure quantum chemical calculations on a high-performance computing cluster.
Collapse
Affiliation(s)
- Yi-Fan Hou
- State Key Laboratory of Physical Chemistry of Solid Surfaces, College of Chemistry and Chemical Engineering, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, and Innovation Laboratory for Sciences and Technologies of Energy Materials of Fujian Province (IKKEM), Xiamen University, Xiamen, Fujian 361005, China
| | - Lina Zhang
- State Key Laboratory of Physical Chemistry of Solid Surfaces, College of Chemistry and Chemical Engineering, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, and Innovation Laboratory for Sciences and Technologies of Energy Materials of Fujian Province (IKKEM), Xiamen University, Xiamen, Fujian 361005, China
| | - Quanhao Zhang
- State Key Laboratory of Physical Chemistry of Solid Surfaces, College of Chemistry and Chemical Engineering, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, and Innovation Laboratory for Sciences and Technologies of Energy Materials of Fujian Province (IKKEM), Xiamen University, Xiamen, Fujian 361005, China
| | - Fuchun Ge
- State Key Laboratory of Physical Chemistry of Solid Surfaces, College of Chemistry and Chemical Engineering, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, and Innovation Laboratory for Sciences and Technologies of Energy Materials of Fujian Province (IKKEM), Xiamen University, Xiamen, Fujian 361005, China
| | - Pavlo O Dral
- State Key Laboratory of Physical Chemistry of Solid Surfaces, College of Chemistry and Chemical Engineering, Fujian Provincial Key Laboratory of Theoretical and Computational Chemistry, and Innovation Laboratory for Sciences and Technologies of Energy Materials of Fujian Province (IKKEM), Xiamen University, Xiamen, Fujian 361005, China
- Institute of Physics, Faculty of Physics, Astronomy, and Informatics, Nicolaus Copernicus University in Toruń, ul. Grudziądzka 5, Toruń 87-100, Poland
| |
Collapse
|
6
|
Kuo PF, Hsu WT, Lord D, Putra IGB. Classification of autonomous vehicle crash severity: Solving the problems of imbalanced datasets and small sample size. ACCIDENT; ANALYSIS AND PREVENTION 2024; 205:107666. [PMID: 38901160 DOI: 10.1016/j.aap.2024.107666] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Revised: 05/21/2024] [Accepted: 06/03/2024] [Indexed: 06/22/2024]
Abstract
Only a few researchers have shown how environmental factors and road features relate to Autonomous Vehicle (AV) crash severity levels, and none have focused on the data limitation problems, such as small sample sizes, imbalanced datasets, and high dimensional features. To address these problems, we analyzed an AV crash dataset (2019 to 2021) from the California Department of Motor Vehicles (CA DMV), which included 266 collision reports (51 of those causing injuries). We included external environmental variables by collecting various points of interest (POIs) and roadway features from Open Street Map (OSM) and Data San Francisco (SF). Random Over-Sampling Examples (ROSE) and the Synthetic Minority Over-Sampling Technique (SMOTE) methods were used to balance the dataset and increase the sample size. These two balancing methods were used to expand the dataset and solve the small sample size problem simultaneously. Mutual information, random forest, and XGboost were utilized to address the high dimensional feature and the selection problem caused by including a variety of types of POIs as predictive variables. Because existing studies do not use consistent procedures, we compared the effectiveness of using the feature-selection preprocessing method as the first process to employing the data-balance technique as the first process. Our results showed that AV crash severity levels are related to vehicle manufacturers, vehicle damage level, collision type, vehicle movement, the parties involved in the crash, speed limit, and some types of POIs (areas near transportation, entertainment venues, public places, schools, and medical facilities). Both resampling methods and three data preprocessing methods improved model performance, and the model that used SMOTE and data-balancing first was the best. The results suggest that over-sampling and the feature selection method can improve model prediction performance and define new factors related to AV crash severity levels.
Collapse
Affiliation(s)
- Pei-Fen Kuo
- Department of Geomatics, National Cheng Kung University, Taiwan.
| | - Wei-Ting Hsu
- Department of Geomatics, National Cheng Kung University, Taiwan
| | - Dominique Lord
- Zachry Department of Civil and Environmental Engineering, Texas A&M University, USA
| | | |
Collapse
|
7
|
Bronstein MV, Kummerfeld E, MacDonald A, Vinogradov S. Identifying psychological predictors of SARS-CoV-2 vaccination: A machine learning study. Vaccine 2024; 42:126198. [PMID: 39106578 DOI: 10.1016/j.vaccine.2024.126198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2022] [Revised: 07/29/2024] [Accepted: 07/29/2024] [Indexed: 08/09/2024]
Abstract
BACKGROUND Major barriers to addressing SARS-CoV-2 vaccine hesitancy include limited knowledge of what causes delay/refusal of SARS-CoV-2 vaccination and limited ability to predict who will remain unvaccinated over significant time periods despite vaccine availability. The present study begins to address these barriers by developing a machine learning model that prospectively predicts who will persist in not vaccinating against SARS-CoV-2. METHOD Unvaccinated individuals (n = 325) who completed a baseline survey were followed over the six-month period when vaccines against SARS-CoV-2 were first widely available (April-October 2021). A random forest model was used to predict who would remain unvaccinated against SARS-CoV-2 from their baseline measures, including demographic information (e.g., age), medical history (e.g., of influenza vaccination), Health-Belief Model constructs (e.g., perceived vaccine dangerousness), conspiracist ideation, and task-based metrics of vulnerability to conspiracist ideation (e.g., tendency toward illusory pattern perception). RESULTS The resulting model significantly predicted vaccination status (AUC-PR = 0.77, 95%CI [0.56 0.90]). At the optimal probability threshold determined by the Generalized Threshold Shifting Protocol, the model was moderately precise (0.83) when identifying individuals who remained unvaccinated (n = 80), and had a very low rate (0.04) of false-positives (incorrectly suggesting that individuals remained unvaccinated). Permutational importance tests suggested that baseline SARS-CoV-2 vaccine intentions conveyed the most information about future SARS-CoV-2 vaccination status. Conspiracist ideation was the second most informative predictor, suggesting that misinformation influences vaccination behavior. Other important predictors included perceived vaccine dangerousness, as expected under the Health Belief Model, and influenza vaccination history. CONCLUSIONS The model we developed can accurately and prospectively identify individuals who remain unvaccinated against SARS-CoV-2. It could therefore facilitate empirically-informed allocation of interventions that encourage vaccine uptake. The predictive value of conspiracist ideation, perceived vaccine dangerousness, and vaccine intentions in our model is consistent with potential causal relations between these variables and SARS-CoV-2 vaccine uptake.
Collapse
Affiliation(s)
- Michael V Bronstein
- Department of Psychiatry and Behavioral Sciences, University of Minnesota, MN, USA; Institute for Health Informatics, University of Minnesota, MN, USA.
| | - Erich Kummerfeld
- Institute for Health Informatics, University of Minnesota, MN, USA
| | | | - Sophia Vinogradov
- Department of Psychiatry and Behavioral Sciences, University of Minnesota, MN, USA
| |
Collapse
|
8
|
Wang HE, Weiner JP, Saria S, Lehmann H, Kharrazi H. Assessing racial bias in healthcare predictive models: Practical lessons from an empirical evaluation of 30-day hospital readmission models. J Biomed Inform 2024; 156:104683. [PMID: 38925281 DOI: 10.1016/j.jbi.2024.104683] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2023] [Revised: 05/20/2024] [Accepted: 06/23/2024] [Indexed: 06/28/2024]
Abstract
OBJECTIVE Despite increased availability of methodologies to identify algorithmic bias, the operationalization of bias evaluation for healthcare predictive models is still limited. Therefore, this study proposes a process for bias evaluation through an empirical assessment of common hospital readmission models. The process includes selecting bias measures, interpretation, determining disparity impact and potential mitigations. METHODS This retrospective analysis evaluated racial bias of four common models predicting 30-day unplanned readmission (i.e., LACE Index, HOSPITAL Score, and the CMS readmission measure applied as is and retrained). The models were assessed using 2.4 million adult inpatient discharges in Maryland from 2016 to 2019. Fairness metrics that are model-agnostic, easy to compute, and interpretable were implemented and apprised to select the most appropriate bias measures. The impact of changing model's risk thresholds on these measures was further assessed to guide the selection of optimal thresholds to control and mitigate bias. RESULTS Four bias measures were selected for the predictive task: zero-one-loss difference, false negative rate (FNR) parity, false positive rate (FPR) parity, and generalized entropy index. Based on these measures, the HOSPITAL score and the retrained CMS measure demonstrated the lowest racial bias. White patients showed a higher FNR while Black patients resulted in a higher FPR and zero-one-loss. As the models' risk threshold changed, trade-offs between models' fairness and overall performance were observed, and the assessment showed all models' default thresholds were reasonable for balancing accuracy and bias. CONCLUSIONS This study proposes an Applied Framework to Assess Fairness of Predictive Models (AFAFPM) and demonstrates the process using 30-day hospital readmission model as the example. It suggests the feasibility of applying algorithmic bias assessment to determine optimized risk thresholds so that predictive models can be used more equitably and accurately. It is evident that a combination of qualitative and quantitative methods and a multidisciplinary team are necessary to identify, understand and respond to algorithm bias in real-world healthcare settings. Users should also apply multiple bias measures to ensure a more comprehensive, tailored, and balanced view. The results of bias measures, however, must be interpreted with caution and consider the larger operational, clinical, and policy context.
Collapse
Affiliation(s)
- H Echo Wang
- Department of Health Policy and Management, Johns Hopkins School of Public Health, Baltimore, MD, USA.
| | - Jonathan P Weiner
- Department of Health Policy and Management, Johns Hopkins School of Public Health, Baltimore, MD, USA; Center for Population Health Information Technology, Johns Hopkins School of Public Health, Baltimore, MD, USA.
| | - Suchi Saria
- Department of Computer Science and Statistics, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA.
| | - Harold Lehmann
- Biomedical Informatics and Data Science, Division of General Internal Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA.
| | - Hadi Kharrazi
- Department of Health Policy and Management, Johns Hopkins School of Public Health, Baltimore, MD, USA; Center for Population Health Information Technology, Johns Hopkins School of Public Health, Baltimore, MD, USA; Biomedical Informatics and Data Science, Division of General Internal Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA.
| |
Collapse
|
9
|
Wossnig L, Furtmann N, Buchanan A, Kumar S, Greiff V. Best practices for machine learning in antibody discovery and development. Drug Discov Today 2024; 29:104025. [PMID: 38762089 DOI: 10.1016/j.drudis.2024.104025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 04/25/2024] [Accepted: 05/13/2024] [Indexed: 05/20/2024]
Abstract
In the past 40 years, therapeutic antibody discovery and development have advanced considerably, with machine learning (ML) offering a promising way to speed up the process by reducing costs and the number of experiments required. Recent progress in ML-guided antibody design and development (D&D) has been hindered by the diversity of data sets and evaluation methods, which makes it difficult to conduct comparisons and assess utility. Establishing standards and guidelines will be crucial for the wider adoption of ML and the advancement of the field. This perspective critically reviews current practices, highlights common pitfalls and proposes method development and evaluation guidelines for various ML-based techniques in therapeutic antibody D&D. Addressing challenges across the ML process, best practices are recommended for each stage to enhance reproducibility and progress.
Collapse
Affiliation(s)
- Leonard Wossnig
- LabGenius Ltd, The Biscuit Factory, 100 Drummond Road, London SE16 4DG, UK; Department of Computer Science, University College London, 66-72 Gower St, London WC1E 6EA, UK.
| | - Norbert Furtmann
- R&D Large Molecules Research Platform, Sanofi Deutschland GmbH, Industriepark Höchst, Frankfurt Am Main, Germany
| | - Andrew Buchanan
- Biologics Engineering, R&D, AstraZeneca, Cambridge CB2 0AA, UK
| | - Sandeep Kumar
- Computational Protein Design and Modeling Group, Computational Science, Moderna Therapeutics, 200 Technology Square, Cambridge, MA 02139, USA
| | - Victor Greiff
- Department of Immunology and Oslo University Hospital, University of Oslo, Oslo, Norway
| |
Collapse
|
10
|
Liu J, Luo J, Chen X, Xie J, Wang C, Wang H, Yuan Q, Li S, Zhang Y, Hu J, Shi C. Opioid Nonadherence Risk Prediction of Patients with Cancer-Related Pain Based on Five Machine Learning Algorithms. Pain Res Manag 2024; 2024:7347876. [PMID: 38872993 PMCID: PMC11175844 DOI: 10.1155/2024/7347876] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2023] [Revised: 04/03/2024] [Accepted: 05/02/2024] [Indexed: 06/15/2024]
Abstract
Objectives Opioid nonadherence represents a significant barrier to cancer pain treatment efficacy. However, there is currently no effective prediction method for opioid adherence in patients with cancer pain. We aimed to develop and validate a machine learning (ML) model and evaluate its feasibility to predict opioid nonadherence in patients with cancer pain. Methods This was a secondary analysis from a cross-sectional study that included 1195 patients from March 1, 2018, to October 31, 2019. Five ML algorithms, such as logistic regression (LR), random forest, eXtreme Gradient Boosting, multilayer perceptron, and support vector machine, were used to predict opioid nonadherence in patients with cancer pain using 43 demographic and clinical factors as predictors. The predictive effects of the models were compared by the area under the receiver operating characteristic curve (AUC_ROC), accuracy, precision, sensitivity, specificity, and F1 scores. The value of the best model for clinical application was assessed using decision curve analysis (DCA). Results The best model obtained in this study, the LR model, had an AUC_ROC of 0.82, accuracy of 0.82, and specificity of 0.71. The DCA showed that clinical interventions for patients at high risk of opioid nonadherence based on the LR model can benefit patients. The strongest predictors for adherence were, in order of importance, beliefs about medicines questionnaire (BMQ)-harm, time since the start of opioid, and BMQ-necessity. Discussion. ML algorithms can be used as an effective means of predicting adherence to opioids in patients with cancer pain, which allows for proactive clinical intervention to optimize cancer pain management. This trial is registered with ChiCTR2000033576.
Collapse
Affiliation(s)
- Jinmei Liu
- Department of Pharmacy, Union Hospital, Tongji Medical College, Huazhong University of Science & Technology (HUST), Wuhan, China
- Hubei Province Clinical Research Center for Precision Medicine for Critical Illness, Wuhan 430022, China
| | - Juan Luo
- Department of Pharmacy, Union Hospital, Tongji Medical College, Huazhong University of Science & Technology (HUST), Wuhan, China
- Hubei Province Clinical Research Center for Precision Medicine for Critical Illness, Wuhan 430022, China
| | - Xu Chen
- Department of Pharmacy, Union Hospital, Tongji Medical College, Huazhong University of Science & Technology (HUST), Wuhan, China
- Hubei Province Clinical Research Center for Precision Medicine for Critical Illness, Wuhan 430022, China
| | - Jiyi Xie
- Department of Pharmacy, Union Hospital, Tongji Medical College, Huazhong University of Science & Technology (HUST), Wuhan, China
- Hubei Province Clinical Research Center for Precision Medicine for Critical Illness, Wuhan 430022, China
| | - Cong Wang
- Department of Pharmacy, Union Hospital, Tongji Medical College, Huazhong University of Science & Technology (HUST), Wuhan, China
- Hubei Province Clinical Research Center for Precision Medicine for Critical Illness, Wuhan 430022, China
| | - Hanxiang Wang
- Department of Pharmacy, Union Hospital, Tongji Medical College, Huazhong University of Science & Technology (HUST), Wuhan, China
- Hubei Province Clinical Research Center for Precision Medicine for Critical Illness, Wuhan 430022, China
| | - Qi Yuan
- Department of Pharmacy, Union Hospital, Tongji Medical College, Huazhong University of Science & Technology (HUST), Wuhan, China
- Hubei Province Clinical Research Center for Precision Medicine for Critical Illness, Wuhan 430022, China
| | - Shijun Li
- Department of Pharmacy, Union Hospital, Tongji Medical College, Huazhong University of Science & Technology (HUST), Wuhan, China
- Hubei Province Clinical Research Center for Precision Medicine for Critical Illness, Wuhan 430022, China
| | - Yu Zhang
- Department of Pharmacy, Union Hospital, Tongji Medical College, Huazhong University of Science & Technology (HUST), Wuhan, China
- Hubei Province Clinical Research Center for Precision Medicine for Critical Illness, Wuhan 430022, China
| | - Jianli Hu
- Cancer Center, Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430022, China
| | - Chen Shi
- Department of Pharmacy, Union Hospital, Tongji Medical College, Huazhong University of Science & Technology (HUST), Wuhan, China
- Hubei Province Clinical Research Center for Precision Medicine for Critical Illness, Wuhan 430022, China
| |
Collapse
|
11
|
De Abreu Ferreira R, Zhong S, Moureaud C, Le MT, Rothstein A, Li X, Wang L, Patwardhan M. A Pilot, Predictive Surveillance Model in Pharmacovigilance Using Machine Learning Approaches. Adv Ther 2024; 41:2435-2445. [PMID: 38704799 PMCID: PMC11133112 DOI: 10.1007/s12325-024-02870-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Accepted: 04/04/2024] [Indexed: 05/07/2024]
Abstract
INTRODUCTION The identification of a new adverse event (AE) caused by a drug product is one of the key activities in the pharmaceutical industry to ensure the safety profile of a drug product. Machine learning (ML) has the potential to assist with signal detection and supplement traditional pharmacovigilance (PV) surveillance methods. This pilot ML modeling study was designed to detect potential safety signals for two AbbVie products and test the model's capability of detecting safety signals earlier than humans. METHODS Drug X, a mature product with post-marketing data, and Drug Y, a recently approved drug in another therapeutic area, were selected. Gradient boosting-based ML approaches (e.g., XGBoost) were applied as the main modeling strategy. RESULTS For Drug X, eight true signals were present in the test set. Among 12 potential new signals generated, four were true signals with a 50.0% sensitivity rate and a 33.3% positive predictive value (PPV) rate. Among the remaining eight potential new signals, one was confirmed as a signal and detected six months earlier than humans. For Drug Y, nine true signals were present in the test set. Among 13 potential new signals generated, five were true signals with a 55.6% sensitivity rate and a 38.5% PPV rate. Among the remaining eight potential new signals, none were confirmed as true signals upon human review. CONCLUSION This model demonstrated acceptable accuracy for safety signal detection and potential for earlier detection when compared to humans. Expert judgment, flexibility, and critical thinking are essential human skills required for the final, accurate assessment of adverse event cases.
Collapse
Affiliation(s)
- Rosa De Abreu Ferreira
- Medical Safety Evaluation, Pharmacovigilance and Patient Safety, Epidemiology, and Research and Development Quality Assurance, AbbVie, Inc., North Chicago, IL, USA
| | - Sheng Zhong
- Statistical Sciences and Analytics, Data and Statistical Sciences, AbbVie, Inc., North Chicago, IL, USA
| | - Charlotte Moureaud
- Safety Data Sciences, Pharmacovigilance and Patient Safety, Epidemiology, and Research and Development Quality Assurance, AbbVie, Inc., North Chicago, IL, USA.
| | - Michelle T Le
- Medication Safety Fellow, Purdue University College of Pharmacy, West Lafayette, IN, USA
| | - Adrienne Rothstein
- Medical Safety Evaluation, Pharmacovigilance and Patient Safety, Epidemiology, and Research and Development Quality Assurance, AbbVie, Inc., North Chicago, IL, USA
| | - Xiaomeng Li
- Statistical Sciences and Analytics, Data and Statistical Sciences, AbbVie, Inc., North Chicago, IL, USA
| | - Li Wang
- Statistical Sciences and Analytics, Data and Statistical Sciences, AbbVie, Inc., North Chicago, IL, USA
| | - Meenal Patwardhan
- Medical Safety Evaluation, Pharmacovigilance and Patient Safety, Epidemiology, and Research and Development Quality Assurance, AbbVie, Inc., North Chicago, IL, USA
- Safety Data Sciences, Pharmacovigilance and Patient Safety, Epidemiology, and Research and Development Quality Assurance, AbbVie, Inc., North Chicago, IL, USA
| |
Collapse
|
12
|
Mirzaee Moghaddam Kasmaee A, Ataei A, Moravvej SV, Alizadehsani R, Gorriz JM, Zhang YD, Tan RS, Acharya UR. ELRL-MD: a deep learning approach for myocarditis diagnosis using cardiac magnetic resonance images with ensemble and reinforcement learning integration. Physiol Meas 2024; 45:055011. [PMID: 38697206 DOI: 10.1088/1361-6579/ad46e2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Accepted: 05/02/2024] [Indexed: 05/04/2024]
Abstract
Objective.Myocarditis poses a significant health risk, often precipitated by viral infections like coronavirus disease, and can lead to fatal cardiac complications. As a less invasive alternative to the standard diagnostic practice of endomyocardial biopsy, which is highly invasive and thus limited to severe cases, cardiac magnetic resonance (CMR) imaging offers a promising solution for detecting myocardial abnormalities.Approach.This study introduces a deep model called ELRL-MD that combines ensemble learning and reinforcement learning (RL) for effective myocarditis diagnosis from CMR images. The model begins with pre-training via the artificial bee colony (ABC) algorithm to enhance the starting point for learning. An array of convolutional neural networks (CNNs) then works in concert to extract and integrate features from CMR images for accurate diagnosis. Leveraging the Z-Alizadeh Sani myocarditis CMR dataset, the model employs RL to navigate the dataset's imbalance by conceptualizing diagnosis as a decision-making process.Main results.ELRL-DM demonstrates remarkable efficacy, surpassing other deep learning, conventional machine learning, and transfer learning models, achieving an F-measure of 88.2% and a geometric mean of 90.6%. Extensive experimentation helped pinpoint the optimal reward function settings and the perfect count of CNNs.Significance.The study addresses the primary technical challenge of inherent data imbalance in CMR imaging datasets and the risk of models converging on local optima due to suboptimal initial weight settings. Further analysis, leaving out ABC and RL components, confirmed their contributions to the model's overall performance, underscoring the effectiveness of addressing these critical technical challenges.
Collapse
Affiliation(s)
| | - Alireza Ataei
- Department of Mathematics, Faculty of Intelligent Systems Engineering and Data Science, Persian Gulf University, Bushehr 7516913817, Iran
| | - Seyed Vahid Moravvej
- Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan, Iran
| | - Roohallah Alizadehsani
- Institute for Intelligent Systems Research and Innovation (IISRI), Deakin University, Waurn Ponds, Australia
| | - Juan M Gorriz
- Data Science and Computational Intelligence Institute, University of Granada, Granada, Spain
| | - Yu-Dong Zhang
- Department of Informatics, University of Leicester, Leicester, United Kingdom
| | - Ru-San Tan
- Duke-NUS Medical School, Singapore, Singapore
| | - U Rajendra Acharya
- School of Mathematics, Physics and Computing, University of Southern Queensland, Springfield, Australia
| |
Collapse
|
13
|
Almazroi AA, Ayub N. Enhancing aspect-based multi-labeling with ensemble learning for ethical logistics. PLoS One 2024; 19:e0295248. [PMID: 38771789 PMCID: PMC11108219 DOI: 10.1371/journal.pone.0295248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2023] [Accepted: 11/20/2023] [Indexed: 05/23/2024] Open
Abstract
In the dynamic domain of logistics, effective communication is essential for streamlined operations. Our innovative solution, the Multi-Labeling Ensemble (MLEn), tackles the intricate task of extracting multi-labeled data, employing advanced techniques for accurate preprocessing of textual data through the NLTK toolkit. This approach is carefully tailored to the prevailing language used in logistics communication. MLEn utilizes innovative methods, including sentiment intensity analysis, Word2Vec, and Doc2Vec, ensuring comprehensive feature extraction. This proves particularly suitable for logistics in e-commerce, capturing nuanced communication essential for efficient operations. Ethical considerations are a cornerstone in logistics communication, and MLEn plays a pivotal role in detecting and categorizing inappropriate language, aligning inherently with ethical norms. Leveraging Tf-IDF and Vader for feature enhancement, MLEn adeptly discerns and labels ethically sensitive content in logistics communication. Across diverse datasets, including Emotions, MLEn consistently achieves impressive accuracy levels ranging from 92% to 97%, establishing its superiority in the logistics context. Particularly, our proposed method, DenseNet-EHO, outperforms BERT by 8% and surpasses other techniques by a 15-25% efficiency. A comprehensive analysis, considering metrics such as precision, recall, F1-score, Ranking Loss, Jaccard Similarity, AUC-ROC, sensitivity, and time complexity, underscores DenseNet-EHO's efficiency, aligning with the practical demands within the logistics track. Our research significantly contributes to enhancing precision, diversity, and computational efficiency in aspect-based sentiment analysis within logistics. By integrating cutting-edge preprocessing, sentiment intensity analysis, and vectorization, MLEn emerges as a robust framework for multi-label datasets, consistently outperforming conventional approaches and giving outstanding precision, accuracy, and efficiency in the logistics field.
Collapse
Affiliation(s)
- Abdulwahab Ali Almazroi
- Department of Information Technology, College of Computing and Information Technology at Khulais, University of Jeddah, Jeddah, Saudi Arabia
| | - Nasir Ayub
- Department of Creative Technologies, Air University Islamabad, Islamabad, Pakistan
| |
Collapse
|
14
|
Ansari M, White AD. Learning peptide properties with positive examples only. DIGITAL DISCOVERY 2024; 3:977-986. [PMID: 38756224 PMCID: PMC11094695 DOI: 10.1039/d3dd00218g] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/05/2023] [Accepted: 03/30/2024] [Indexed: 05/18/2024]
Abstract
Deep learning can create accurate predictive models by exploiting existing large-scale experimental data, and guide the design of molecules. However, a major barrier is the requirement of both positive and negative examples in the classical supervised learning frameworks. Notably, most peptide databases come with missing information and low number of observations on negative examples, as such sequences are hard to obtain using high-throughput screening methods. To address this challenge, we solely exploit the limited known positive examples in a semi-supervised setting, and discover peptide sequences that are likely to map to certain antimicrobial properties via positive-unlabeled learning (PU). In particular, we use the two learning strategies of adapting base classifier and reliable negative identification to build deep learning models for inferring solubility, hemolysis, binding against SHP-2, and non-fouling activity of peptides, given their sequence. We evaluate the predictive performance of our PU learning method and show that by only using the positive data, it can achieve competitive performance when compared with the classical positive-negative (PN) classification approach, where there is access to both positive and negative examples.
Collapse
Affiliation(s)
- Mehrad Ansari
- Department of Chemical Engineering, University of Rochester Rochester NY 14627 USA
| | - Andrew D White
- Department of Chemical Engineering, University of Rochester Rochester NY 14627 USA
| |
Collapse
|
15
|
Miley K, Bronstein MV, Ma S, Lee H, Green MF, Ventura J, Hooker CI, Nahum M, Vinogradov S. Trajectories and predictors of response to social cognition training in people with schizophrenia: A proof-of-concept machine learning study. Schizophr Res 2024; 266:92-99. [PMID: 38387253 PMCID: PMC11005939 DOI: 10.1016/j.schres.2024.02.027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Revised: 12/15/2023] [Accepted: 02/17/2024] [Indexed: 02/24/2024]
Abstract
BACKGROUND Social cognition training (SCT) can improve social cognition deficits in schizophrenia. However, little is known about patterns of response to SCT or individual characteristics that predict response. METHODS 76 adults with schizophrenia randomized to receive 8-12 weeks of remotely-delivered SCT were included in this analysis. Social cognition was measured with a composite of six assessments. Latent class growth analyses identified trajectories of social cognitive response to SCT. Random forest and logistic regression models were trained to predict membership in the trajectory group that showed improvement from baseline measures including symptoms, functioning, motivation, and cognition. RESULTS Five trajectory groups were identified: Group 1 (29 %) began with slightly above average social cognition, and this ability significantly improved with SCT. Group 2 (9 %) had baseline social cognition approximately one standard deviation above the sample mean and did not improve with training. Groups 3 (18 %) and 4 (36 %) began with average to slightly below-average social cognition and showed non-significant trends toward improvement. Group 5 (8 %) began with social cognition approximately one standard deviation below the sample mean, and experienced significant deterioration in social cognition. The random forest model had the best performance, predicting Group 1 membership with an area under the curve of 0.73 (SD 0.24; 95 % CI [0.51-0.87]). CONCLUSIONS Findings suggest that there are distinct patterns of response to SCT in schizophrenia and that those with slightly above average social cognition at baseline may be most likely to experience gains. Results may inform future research seeking to individualize SCT treatment for schizophrenia.
Collapse
Affiliation(s)
- Kathleen Miley
- HealthPartners Institute, Minneapolis, MN, USA; Department of Psychiatry and Behavioral Sciences, University of Minnesota Medical School, MN, USA.
| | - Michael V Bronstein
- Department of Psychiatry and Behavioral Sciences, University of Minnesota Medical School, MN, USA
| | - Sisi Ma
- Institute for Health Informatics, University of Minnesota, MN, USA
| | - Hyunkyu Lee
- Department of Research and Development, Posit Science Inc., San Francisco, CA, USA
| | - Michael F Green
- VA Greater Los Angeles, Los Angeles, CA, USA; Department of Psychiatry and Biobehavioral Sciences, University of California, Los Angeles, CA, USA
| | - Joseph Ventura
- Department of Psychiatry and Biobehavioral Sciences, University of California, Los Angeles, CA, USA
| | - Christine I Hooker
- Department of Psychiatry and Behavioral Sciences, Rush University Medical Center, Chicago, IL, USA
| | - Mor Nahum
- School of Occupational Therapy, Hebrew University of Jerusalem, Israel
| | - Sophia Vinogradov
- Department of Psychiatry and Behavioral Sciences, University of Minnesota Medical School, MN, USA
| |
Collapse
|
16
|
Zhou Q, Ye W, Yu X, Bao YJ. A pathway-based computational framework for identification of a new modal of multi-omics biomarkers and its application in esophageal cancer. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2024; 247:108077. [PMID: 38382307 DOI: 10.1016/j.cmpb.2024.108077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Revised: 01/14/2024] [Accepted: 02/10/2024] [Indexed: 02/23/2024]
Abstract
BACKGROUND The pathway-based strategy has been recently proposed for identifying biomarkers with the advantages of higher biological interpretability and cross-data robustness than the conventional gene-based strategy. However, its utility in clinical applications has been limited due to the high computational complexity and ill-defined performance. OBJECTIVE The current study presents a machine learning-based computational framework using multi-omics data for identifying a new modal of biomarkers, called pathway-derived core biomarkers, which have the advantages of both gene-based and pathway-based biomarkers. METHODS Machine-learning methods and gene-pathway network were integrated to select the pathway-derived core biomarkers. Multiple machine-learning algorithms were used to construct and validate the diagnostic models of the biomarkers based on more than 1400 multi-omics clinical samples of esophageal squamous cell carcinoma (ESCC). RESULTS The results showed that the classifier models based on the new modal biomarkers achieved superior performance in the training datasets with an average AUC/accuracy of 0.98/0.95 and 0.89/0.81 for mRNAs and miRNA, respectively, higher than the currently known classifier models based on the conventional gene-based strategy and pathway-based strategy. In the testing cohorts, the AUC/accuracy increased by 6.1 %/7.3 % than the models based on the native gene-based biomarkers. The improved performance was further confirmed in independent validation cohorts. Specifically, the sensitivity/specificity increased by ∼3 % and the variance significantly decreased by ∼69 % compared with that of the native gene-based biomarkers. Importantly, the pathway-derived core biomarkers also recovered 45 % more previously reported biomarkers than the gene-based biomarkers and are more functionally relevant to the ESCC etiology (involved in 14 versus 7 pathways related with ESCC or other cancer), highlighting the cross-data robustness of this new modal of biomarkers via enhanced functional relevance. CONCLUSIONS The results demonstrated that the new modal of biomarkers not only have improved predicting performance and robustness, but also exhibit higher functional interpretability thus leading to the potential application in cancer diagnosis.
Collapse
Affiliation(s)
- Qi Zhou
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, China
| | - Weicai Ye
- School of Computer Science and Engineering, Guangdong Province Key Laboratory of Computational Science, and National Engineering Laboratory for Big Data Analysis and Application, Sun Yat-sen University, Guangzhou, China
| | - Xiaolan Yu
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, China; Hubei Jiangxia Laboratory, Wuhan, China
| | - Yun-Juan Bao
- State Key Laboratory of Biocatalysis and Enzyme Engineering, School of Life Sciences, Hubei University, Wuhan, China.
| |
Collapse
|
17
|
Hang NT, Long NT, Duy ND, Chien NN, Van Phuong N. Towards safer and efficient formulations: Machine learning approaches to predict drug-excipient compatibility. Int J Pharm 2024; 653:123884. [PMID: 38341049 DOI: 10.1016/j.ijpharm.2024.123884] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Revised: 01/28/2024] [Accepted: 02/03/2024] [Indexed: 02/12/2024]
Abstract
Predicting drug-excipient compatibility is a critical aspect of pharmaceutical formulation design. In this study, we introduced an innovative approach that leverages machine learning techniques to improve the accuracy of drug-excipient compatibility predictions. Mol2vec and 2D molecular descriptors combined with the stacking technique were used to improve the performance of the model. This approach achieved a significant advancement in the predictive capacity as demonstrated by the accuracy, precision, recall, AUC, and MCC of 0.98, 0.87, 0.88, 0.93 and 0.86, respectively. Using the DE-INTERACT model as the benchmark, our stacking model could remarkably detect drug-excipient incompatibility in 10/12 tested cases, while DE-INTERACT managed to recognize only 3 out of 12 incompatibility cases in the validation experiments. To ensure user accessibility, the trained model was deployed to a user-friendly web platform (URL: https://decompatibility.streamlit.app/). This interactive interface accommodated inputs through various types, including names, PubChem CID, or SMILES strings. It promptly generated compatibility predictions alongside corresponding probability scores. However, the continual refinement of model performance is crucial before applying this model in practice.
Collapse
Affiliation(s)
- Nguyen Thu Hang
- Department of Pharmacognosy, Hanoi University of Pharmacy, Hanoi, Viet Nam
| | - Nguyen Thanh Long
- Department of Pharmacognosy, Hanoi University of Pharmacy, Hanoi, Viet Nam
| | - Nguyen Dang Duy
- Department of Pharmacognosy, Hanoi University of Pharmacy, Hanoi, Viet Nam
| | - Nguyen Ngoc Chien
- National Institute of Pharmaceutical Technology, Hanoi University of Pharmacy, Hanoi, Viet Nam
| | - Nguyen Van Phuong
- Department of Pharmacognosy, Hanoi University of Pharmacy, Hanoi, Viet Nam.
| |
Collapse
|
18
|
Akinola LK, Uzairu A, Shallangwa GA, Abechi SE, Umar AB. Identification of estrogen receptor agonists among hydroxylated polychlorinated biphenyls using classification-based quantitative structure-activity relationship models. Curr Res Toxicol 2024; 6:100158. [PMID: 38435023 PMCID: PMC10907392 DOI: 10.1016/j.crtox.2024.100158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 02/22/2024] [Accepted: 02/22/2024] [Indexed: 03/05/2024] Open
Abstract
Identification of estrogen receptor (ER) agonists among environmental toxicants is essential for assessing the potential impact of toxicants on human health. Using 2D autocorrelation descriptors as predictor variables, two binary logistic regression models were developed to identify active ER agonists among hydroxylated polychlorinated biphenyls (OH-PCBs). The classifications made by the two models on the training set compounds resulted in accuracy, sensitivity and specificity of 95.9 %, 93.9 % and 97.6 % for ERα dataset and 91.9 %, 90.9 % and 92.7 % for ERβ dataset. The areas under the ROC curves, constructed with the training set data, were found to be 0.985 and 0.987 for the two models. Predictions made by models I and II correctly classified 84.0 % and 88.0 % of the test set compounds and 89.8 % and 85.8% of the cross-validation set compounds respectively. The two classification-based QSAR models proposed in this paper are considered robust and reliable for rapid identification of ERα and ERβ agonists among OH-PCB congeners.
Collapse
Affiliation(s)
- Lukman K. Akinola
- Department of Chemistry, Ahmadu Bello University, Zaria, Nigeria
- Department of Chemistry, Bauchi State University, Gadau, Nigeria
| | - Adamu Uzairu
- Department of Chemistry, Ahmadu Bello University, Zaria, Nigeria
| | | | | | | |
Collapse
|
19
|
Lee JH, Shin J, Min JH, Jeong WK, Kim H, Choi SY, Lee J, Hong S, Kim K. Preoperative prediction of early recurrence in resectable pancreatic cancer integrating clinical, radiologic, and CT radiomics features. Cancer Imaging 2024; 24:6. [PMID: 38191489 PMCID: PMC10775464 DOI: 10.1186/s40644-024-00653-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 12/29/2023] [Indexed: 01/10/2024] Open
Abstract
OBJECTIVES To use clinical, radiographic, and CT radiomics features to develop and validate a preoperative prediction model for the early recurrence of pancreatic cancer. METHODS We retrospectively analyzed 190 patients (150 and 40 in the development and test cohort from different centers) with pancreatic cancer who underwent pancreatectomy between January 2018 and June 2021. Radiomics, clinical-radiologic (CR), and clinical-radiologic-radiomics (CRR) models were developed for the prediction of recurrence within 12 months after surgery. Performance was evaluated using the area under the curve (AUC), Brier score, sensitivity, and specificity. RESULTS Early recurrence occurred in 36.7% and 42.5% of the development and test cohorts, respectively (P = 0.62). The features for the CR model included carbohydrate antigen 19-9 > 500 U/mL (odds ratio [OR], 3.60; P = 0.01), abutment to the portal and/or superior mesenteric vein (OR, 2.54; P = 0.054), and adjacent organ invasion (OR, 2.91; P = 0.03). The CRR model demonstrated significantly higher AUCs than the radiomics model in the internal (0.77 vs. 0.73; P = 0.048) and external (0.83 vs. 0.69; P = 0.038) validations. Although we found no significant difference between AUCs of the CR and CRR models (0.83 vs. 0.76; P = 0.17), CRR models showed more balanced sensitivity and specificity (0.65 and 0.87) than CR model (0.41 and 0.91) in the test cohort. CONCLUSIONS The CRR model outperformed the radiomics and CR models in predicting the early recurrence of pancreatic cancer, providing valuable information for risk stratification and treatment guidance.
Collapse
Affiliation(s)
- Jeong Hyun Lee
- Department of Radiology and Center for Imaging Science, Samsung Medical Center, Sungkyunkwan University School of Medicine, 81 Irwon-ro Gangnam-gu, Seoul, 06351, Republic of Korea
| | - Jaeseung Shin
- Department of Radiology and Center for Imaging Science, Samsung Medical Center, Sungkyunkwan University School of Medicine, 81 Irwon-ro Gangnam-gu, Seoul, 06351, Republic of Korea
| | - Ji Hye Min
- Department of Radiology and Center for Imaging Science, Samsung Medical Center, Sungkyunkwan University School of Medicine, 81 Irwon-ro Gangnam-gu, Seoul, 06351, Republic of Korea.
| | - Woo Kyoung Jeong
- Department of Radiology and Center for Imaging Science, Samsung Medical Center, Sungkyunkwan University School of Medicine, 81 Irwon-ro Gangnam-gu, Seoul, 06351, Republic of Korea
| | - Honsoul Kim
- Department of Radiology and Center for Imaging Science, Samsung Medical Center, Sungkyunkwan University School of Medicine, 81 Irwon-ro Gangnam-gu, Seoul, 06351, Republic of Korea
| | - Seo-Youn Choi
- Department of Radiology and Center for Imaging Science, Samsung Medical Center, Sungkyunkwan University School of Medicine, 81 Irwon-ro Gangnam-gu, Seoul, 06351, Republic of Korea
- Department of Radiology, Soonchunhyang University Bucheon Hospital, Soonchunhyang University College of Medicine, Bucheon, Republic of Korea
| | - Jisun Lee
- Department of Radiology, College of Medicine, Chungbuk National University, Chungbuk National University Hospital, Cheongju, Republic of Korea
| | - Sungjun Hong
- Department of Digital Health, Samsung Advanced Institute of Health Sciences and Technology (SAIHST), Sungkyunkwan University, Seoul, Republic of Korea
| | - Kyunga Kim
- Department of Digital Health, Samsung Advanced Institute of Health Sciences and Technology (SAIHST), Sungkyunkwan University, Seoul, Republic of Korea
- Biomedical Statistics Center, Research Institute for Future Medicine, Samsung Medical Center, Seoul, Republic of Korea
| |
Collapse
|
20
|
van Heerden A, Turon G, Duran-Frigola M, Pillay N, Birkholtz LM. Machine Learning Approaches Identify Chemical Features for Stage-Specific Antimalarial Compounds. ACS OMEGA 2023; 8:43813-43826. [PMID: 38027377 PMCID: PMC10666252 DOI: 10.1021/acsomega.3c05664] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/02/2023] [Revised: 10/18/2023] [Accepted: 10/20/2023] [Indexed: 12/01/2023]
Abstract
Efficacy data from diverse chemical libraries, screened against the various stages of the malaria parasite Plasmodium falciparum, including asexual blood stage (ABS) parasites and transmissible gametocytes, serve as a valuable reservoir of information on the chemical space of compounds that are either active (or not) against the parasite. We postulated that this data can be mined to define chemical features associated with the sole ABS activity and/or those that provide additional life cycle activity profiles like gametocytocidal activity. Additionally, this information could provide chemical features associated with inactive compounds, which could eliminate any future unnecessary screening of similar chemical analogs. Therefore, we aimed to use machine learning to identify the chemical space associated with stage-specific antimalarial activity. We collected data from various chemical libraries that were screened against the asexual (126 374 compounds) and sexual (gametocyte) stages of the parasite (93 941 compounds), calculated the compounds' molecular fingerprints, and trained machine learning models to recognize stage-specific active and inactive compounds. We were able to build several models that predict compound activity against ABS and dual activity against ABS and gametocytes, with Support Vector Machines (SVM) showing superior abilities with high recall (90 and 66%) and low false-positive predictions (15 and 1%). This allowed the identification of chemical features enriched in active and inactive populations, an important outcome that could be mined for essential chemical features to streamline hit-to-lead optimization strategies of antimalarial candidates. The predictive capabilities of the models held true in diverse chemical spaces, indicating that the ML models are therefore robust and can serve as a prioritization tool to drive and guide phenotypic screening and medicinal chemistry programs.
Collapse
Affiliation(s)
- Ashleigh van Heerden
- Department
of Biochemistry, Genetics and Microbiology, Institute for Sustainable
Malaria Control, University of Pretoria, Private Bag X20, Hatfield 0028, South Africa
| | - Gemma Turon
- Ersilia
Open Source Initiative, 28 Belgrave Road, Cambridge CB1 3DE, U.K.
| | | | - Nelishia Pillay
- Department
of Computer Science, University of Pretoria, Private Bag X20, Hatfield 0028, South Africa
| | - Lyn-Marié Birkholtz
- Department
of Biochemistry, Genetics and Microbiology, Institute for Sustainable
Malaria Control, University of Pretoria, Private Bag X20, Hatfield 0028, South Africa
| |
Collapse
|
21
|
Handa K, Thomas MC, Kageyama M, Iijima T, Bender A. On the difficulty of validating molecular generative models realistically: a case study on public and proprietary data. J Cheminform 2023; 15:112. [PMID: 37990215 PMCID: PMC10664602 DOI: 10.1186/s13321-023-00781-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Accepted: 11/10/2023] [Indexed: 11/23/2023] Open
Abstract
While a multitude of deep generative models have recently emerged there exists no best practice for their practically relevant validation. On the one hand, novel de novo-generated molecules cannot be refuted by retrospective validation (so that this type of validation is biased); but on the other hand prospective validation is expensive and then often biased by the human selection process. In this case study, we frame retrospective validation as the ability to mimic human drug design, by answering the following question: Can a generative model trained on early-stage project compounds generate middle/late-stage compounds de novo? To this end, we used experimental data that contains the elapsed time of a synthetic expansion following hit identification from five public (where the time series was pre-processed to better reflect realistic synthetic expansions) and six in-house project datasets, and used REINVENT as a widely adopted RNN-based generative model. After splitting the dataset and training REINVENT on early-stage compounds, we found that rediscovery of middle/late-stage compounds was much higher in public projects (at 1.60%, 0.64%, and 0.21% of the top 100, 500, and 5000 scored generated compounds) than in in-house projects (where the values were 0.00%, 0.03%, and 0.04%, respectively). Similarly, average single nearest neighbour similarity between early- and middle/late-stage compounds in public projects was higher between active compounds than inactive compounds; however, for in-house projects the converse was true, which makes rediscovery (if so desired) more difficult. We hence show that the generative model recovers very few middle/late-stage compounds from real-world drug discovery projects, highlighting the fundamental difference between purely algorithmic design and drug discovery as a real-world process. Evaluating de novo compound design approaches appears, based on the current study, difficult or even impossible to do retrospectively.Scientific Contribution This contribution hence illustrates aspects of evaluating the performance of generative models in a real-world setting which have not been extensively described previously and which hopefully contribute to their further future development.
Collapse
Affiliation(s)
- Koichi Handa
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK.
- Toxicology & DMPK Research Department, Teijin Institute for Bio-Medical Research, Teijin Pharma Limited, 4-3-2 Asahigaoka, Hino-Shi, Tokyo, 191-8512, Japan.
| | - Morgan C Thomas
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK
| | - Michiharu Kageyama
- Toxicology & DMPK Research Department, Teijin Institute for Bio-Medical Research, Teijin Pharma Limited, 4-3-2 Asahigaoka, Hino-Shi, Tokyo, 191-8512, Japan
| | - Takeshi Iijima
- Toxicology & DMPK Research Department, Teijin Institute for Bio-Medical Research, Teijin Pharma Limited, 4-3-2 Asahigaoka, Hino-Shi, Tokyo, 191-8512, Japan
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge, CB2 1EW, UK.
| |
Collapse
|
22
|
Kouba P, Kohout P, Haddadi F, Bushuiev A, Samusevich R, Sedlar J, Damborsky J, Pluskal T, Sivic J, Mazurenko S. Machine Learning-Guided Protein Engineering. ACS Catal 2023; 13:13863-13895. [PMID: 37942269 PMCID: PMC10629210 DOI: 10.1021/acscatal.3c02743] [Citation(s) in RCA: 41] [Impact Index Per Article: 20.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 09/20/2023] [Indexed: 11/10/2023]
Abstract
Recent progress in engineering highly promising biocatalysts has increasingly involved machine learning methods. These methods leverage existing experimental and simulation data to aid in the discovery and annotation of promising enzymes, as well as in suggesting beneficial mutations for improving known targets. The field of machine learning for protein engineering is gathering steam, driven by recent success stories and notable progress in other areas. It already encompasses ambitious tasks such as understanding and predicting protein structure and function, catalytic efficiency, enantioselectivity, protein dynamics, stability, solubility, aggregation, and more. Nonetheless, the field is still evolving, with many challenges to overcome and questions to address. In this Perspective, we provide an overview of ongoing trends in this domain, highlight recent case studies, and examine the current limitations of machine learning-based methods. We emphasize the crucial importance of thorough experimental validation of emerging models before their use for rational protein design. We present our opinions on the fundamental problems and outline the potential directions for future research.
Collapse
Affiliation(s)
- Petr Kouba
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
- Faculty of
Electrical Engineering, Czech Technical
University in Prague, Technicka 2, 166 27 Prague 6, Czech Republic
| | - Pavel Kohout
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Faraneh Haddadi
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Anton Bushuiev
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
| | - Raman Samusevich
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
- Institute
of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo nám. 2, 160 00 Prague 6, Czech Republic
| | - Jiri Sedlar
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
| | - Jiri Damborsky
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Tomas Pluskal
- Institute
of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo nám. 2, 160 00 Prague 6, Czech Republic
| | - Josef Sivic
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
| | - Stanislav Mazurenko
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| |
Collapse
|
23
|
Ma K, He S, Sinha G, Ebadi A, Florea A, Tremblay S, Wong A, Xi P. Towards Building a Trustworthy Deep Learning Framework for Medical Image Analysis. SENSORS (BASEL, SWITZERLAND) 2023; 23:8122. [PMID: 37836952 PMCID: PMC10574977 DOI: 10.3390/s23198122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 08/05/2023] [Accepted: 08/30/2023] [Indexed: 10/15/2023]
Abstract
Computer vision and deep learning have the potential to improve medical artificial intelligence (AI) by assisting in diagnosis, prediction, and prognosis. However, the application of deep learning to medical image analysis is challenging due to limited data availability and imbalanced data. While model performance is undoubtedly essential for medical image analysis, model trust is equally important. To address these challenges, we propose TRUDLMIA, a trustworthy deep learning framework for medical image analysis, which leverages image features learned through self-supervised learning and utilizes a novel surrogate loss function to build trustworthy models with optimal performance. The framework is validated on three benchmark data sets for detecting pneumonia, COVID-19, and melanoma, and the created models prove to be highly competitive, even outperforming those designed specifically for the tasks. Furthermore, we conduct ablation studies, cross-validation, and result visualization and demonstrate the contribution of proposed modules to both model performance (up to 21%) and model trust (up to 5%). We expect that the proposed framework will support researchers and clinicians in advancing the use of deep learning for dealing with public health crises, improving patient outcomes, increasing diagnostic accuracy, and enhancing the overall quality of healthcare delivery.
Collapse
Affiliation(s)
- Kai Ma
- Faculty of Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada; (S.H.); (A.E.); (A.W.)
| | - Siyuan He
- Faculty of Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada; (S.H.); (A.E.); (A.W.)
- Digital Technologies Research Centre, National Research Council of Canada, Ottawa, ON K1A 0R6, Canada;
| | - Grant Sinha
- Faculty of Mathematics, University of Waterloo, Waterloo, ON N2L 3G1, Canada;
| | - Ashkan Ebadi
- Faculty of Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada; (S.H.); (A.E.); (A.W.)
- Digital Technologies Research Centre, National Research Council of Canada, Ottawa, ON K1A 0R6, Canada;
| | - Adrian Florea
- Department of Emergency Medicine, McGill University, Montreal, QC H4A 3J1, Canada;
| | - Stéphane Tremblay
- Digital Technologies Research Centre, National Research Council of Canada, Ottawa, ON K1A 0R6, Canada;
| | - Alexander Wong
- Faculty of Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada; (S.H.); (A.E.); (A.W.)
| | - Pengcheng Xi
- Faculty of Engineering, University of Waterloo, Waterloo, ON N2L 3G1, Canada; (S.H.); (A.E.); (A.W.)
- Digital Technologies Research Centre, National Research Council of Canada, Ottawa, ON K1A 0R6, Canada;
| |
Collapse
|
24
|
Ma C, Wolfinger R. A prediction model for blood-brain barrier penetrating peptides based on masked peptide transformers with dynamic routing. Brief Bioinform 2023; 24:bbad399. [PMID: 37985456 DOI: 10.1093/bib/bbad399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Revised: 09/26/2023] [Accepted: 10/17/2023] [Indexed: 11/22/2023] Open
Abstract
Blood-brain barrier penetrating peptides (BBBPs) are short peptide sequences that possess the ability to traverse the selective blood-brain interface, making them valuable drug candidates or carriers for various payloads. However, the in vivo or in vitro validation of BBBPs is resource-intensive and time-consuming, driving the need for accurate in silico prediction methods. Unfortunately, the scarcity of experimentally validated BBBPs hinders the efficacy of current machine-learning approaches in generating reliable predictions. In this paper, we present DeepB3P3, a novel framework for BBBPs prediction. Our contribution encompasses four key aspects. Firstly, we propose a novel deep learning model consisting of a transformer encoder layer, a convolutional network backbone, and a capsule network classification head. This integrated architecture effectively learns representative features from peptide sequences. Secondly, we introduce masked peptides as a powerful data augmentation technique to compensate for small training set sizes in BBBP prediction. Thirdly, we develop a novel threshold-tuning method to handle imbalanced data by approximating the optimal decision threshold using the training set. Lastly, DeepB3P3 provides an accurate estimation of the uncertainty level associated with each prediction. Through extensive experiments, we demonstrate that DeepB3P3 achieves state-of-the-art accuracy of up to 98.31% on a benchmarking dataset, solidifying its potential as a promising computational tool for the prediction and discovery of BBBPs.
Collapse
Affiliation(s)
- Chunwei Ma
- JMP Statistical Discovery, LLC, Cary, 27513, NC, USA
- Department of Computer Science and Engineering, University at Buffalo, Buffalo, 14260, NY, USA
| | | |
Collapse
|
25
|
Boldini D, Grisoni F, Kuhn D, Friedrich L, Sieber SA. Practical guidelines for the use of gradient boosting for molecular property prediction. J Cheminform 2023; 15:73. [PMID: 37641120 PMCID: PMC10464382 DOI: 10.1186/s13321-023-00743-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 08/09/2023] [Indexed: 08/31/2023] Open
Abstract
Decision tree ensembles are among the most robust, high-performing and computationally efficient machine learning approaches for quantitative structure-activity relationship (QSAR) modeling. Among them, gradient boosting has recently garnered particular attention, for its performance in data science competitions, virtual screening campaigns, and bioactivity prediction. However, different variants of gradient boosting exist, the most popular being XGBoost, LightGBM and CatBoost. Our study provides the first comprehensive comparison of these approaches for QSAR. To this end, we trained 157,590 gradient boosting models, which were evaluated on 16 datasets and 94 endpoints, comprising 1.4 million compounds in total. Our results show that XGBoost generally achieves the best predictive performance, while LightGBM requires the least training time, especially for larger datasets. In terms of feature importance, the models surprisingly rank molecular features differently, reflecting differences in regularization techniques and decision tree structures. Thus, expert knowledge must always be employed when evaluating data-driven explanations of bioactivity. Furthermore, our results show that the relevance of each hyperparameter varies greatly across datasets and that it is crucial to optimize as many hyperparameters as possible to maximize the predictive performance. In conclusion, our study provides the first set of guidelines for cheminformatics practitioners to effectively train, optimize and evaluate gradient boosting models for virtual screening and QSAR applications.
Collapse
Affiliation(s)
- Davide Boldini
- Department of Bioscience, Center for Functional Protein Assemblies (CPA), Technical University of Munich, Garching bei Munich, Germany
| | - Francesca Grisoni
- Department of Biomedical Engineering, Institute for Complex Molecular Sciences, Eindhoven University of Technology, Eindhoven, The Netherlands
- Centre for Living Technologies, Alliance TU/E, WUR, UU, UMC Utrecht, Utrecht, The Netherlands
| | | | | | - Stephan A Sieber
- Department of Bioscience, Center for Functional Protein Assemblies (CPA), Technical University of Munich, Garching bei Munich, Germany.
| |
Collapse
|
26
|
Smajić A, Rami I, Sosnin S, Ecker GF. Identifying Differences in the Performance of Machine Learning Models for Off-Targets Trained on Publicly Available and Proprietary Data Sets. Chem Res Toxicol 2023; 36:1300-1312. [PMID: 37439496 PMCID: PMC10445286 DOI: 10.1021/acs.chemrestox.3c00042] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Indexed: 07/14/2023]
Abstract
Each year, publicly available databases are updated with new compounds from different research institutions. Positive experimental outcomes are more likely to be reported; therefore, they account for a considerable fraction of these entries. Established publicly available databases such as ChEMBL allow researchers to use information without constrictions and create predictive tools for a broad spectrum of applications in the field of toxicology. Therefore, we investigated the distribution of positive and nonpositive entries within ChEMBL for a set of off-targets and its impact on the performance of classification models when applied to pharmaceutical industry data sets. Results indicate that models trained on publicly available data tend to overpredict positives, and models based on industry data sets predict negatives more often than those built using publicly available data sets. This is strengthened even further by the visualization of the prediction space for a set of 10,000 compounds, which makes it possible to identify regions in the chemical space where predictions converge. Finally, we highlight the utilization of these models for consensus modeling for potential adverse events prediction.
Collapse
Affiliation(s)
- Aljoša Smajić
- Department of Pharmaceutical Sciences, University of Vienna, Josef-Holaubek-Platz 2, 1090 Vienna, Austria
| | - Iris Rami
- Department of Pharmaceutical Sciences, University of Vienna, Josef-Holaubek-Platz 2, 1090 Vienna, Austria
| | - Sergey Sosnin
- Department of Pharmaceutical Sciences, University of Vienna, Josef-Holaubek-Platz 2, 1090 Vienna, Austria
| | - Gerhard F. Ecker
- Department of Pharmaceutical Sciences, University of Vienna, Josef-Holaubek-Platz 2, 1090 Vienna, Austria
| |
Collapse
|
27
|
Lanini J, Santarossa G, Sirockin F, Lewis R, Fechner N, Misztela H, Lewis S, Maziarz K, Stanley M, Segler M, Stiefl N, Schneider N. PREFER: A New Predictive Modeling Framework for Molecular Discovery. J Chem Inf Model 2023; 63:4497-4504. [PMID: 37487018 DOI: 10.1021/acs.jcim.3c00523] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/26/2023]
Abstract
Machine-learning and deep-learning models have been extensively used in cheminformatics to predict molecular properties, to reduce the need for direct measurements, and to accelerate compound prioritization. However, different setups and frameworks and the large number of molecular representations make it difficult to properly evaluate, reproduce, and compare them. Here we present a new PREdictive modeling FramEwoRk for molecular discovery (PREFER), written in Python (version 3.7.7) and based on AutoSklearn (version 0.14.7), that allows comparison between different molecular representations and common machine-learning models. We provide an overview of the design of our framework and show exemplary use cases and results of several representation-model combinations on diverse data sets, both public and in-house. Finally, we discuss the use of PREFER on small data sets. The code of the framework is freely available on GitHub.
Collapse
Affiliation(s)
- Jessica Lanini
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | - Gianluca Santarossa
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | - Finton Sirockin
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | - Richard Lewis
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | - Nikolas Fechner
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | | | - Sarah Lewis
- Microsoft Research AI4Science, Cambridge CB1 2FB, U.K
| | | | - Megan Stanley
- Microsoft Research AI4Science, Cambridge CB1 2FB, U.K
| | - Marwin Segler
- Microsoft Research AI4Science, Cambridge CB1 2FB, U.K
| | - Nikolaus Stiefl
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| | - Nadine Schneider
- Novartis Institutes for BioMedical Research, Novartis Pharma AG, Novartis Campus, 4002 Basel, Switzerland
| |
Collapse
|
28
|
Ansari M, White AD. Learning Peptide Properties with Positive Examples Only. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.06.01.543289. [PMID: 37333233 PMCID: PMC10274696 DOI: 10.1101/2023.06.01.543289] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/20/2023]
Abstract
Deep learning can create accurate predictive models by exploiting existing large-scale experimental data, and guide the design of molecules. However, a major barrier is the requirement of both positive and negative examples in the classical supervised learning frameworks. Notably, most peptide databases come with missing information and low number of observations on negative examples, as such sequences are hard to obtain using high-throughput screening methods. To address this challenge, we solely exploit the limited known positive examples in a semi-supervised setting, and discover peptide sequences that are likely to map to certain antimicrobial properties via positive-unlabeled learning (PU). In particular, we use the two learning strategies of adapting base classifier and reliable negative identification to build deep learning models for inferring solubility, hemolysis, binding against SHP-2, and non-fouling activity of peptides, given their sequence. We evaluate the predictive performance of our PU learning method and show that by only using the positive data, it can achieve competitive performance when compared with the classical positive-negative (PN) classification approach, where there is access to both positive and negative examples.
Collapse
Affiliation(s)
- Mehrad Ansari
- Department of Chemical Engineering, University of Rochester, Rochester, NY, 14627, USA
| | - Andrew D. White
- Department of Chemical Engineering, University of Rochester, Rochester, NY, 14627, USA
| |
Collapse
|
29
|
Wang H, Zhu G, Izu LT, Chen-Izu Y, Ono N, Altaf-Ul-Amin MD, Kanaya S, Huang M. On QSAR-based cardiotoxicity modeling with the expressiveness-enhanced graph learning model and dual-threshold scheme. Front Physiol 2023; 14:1156286. [PMID: 37228825 PMCID: PMC10203956 DOI: 10.3389/fphys.2023.1156286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2023] [Accepted: 04/05/2023] [Indexed: 05/27/2023] Open
Abstract
Introduction: Given the direct association with malignant ventricular arrhythmias, cardiotoxicity is a major concern in drug design. In the past decades, computational models based on the quantitative structure-activity relationship have been proposed to screen out cardiotoxic compounds and have shown promising results. The combination of molecular fingerprint and the machine learning model shows stable performance for a wide spectrum of problems; however, not long after the advent of the graph neural network (GNN) deep learning model and its variant (e.g., graph transformer), it has become the principal way of quantitative structure-activity relationship-based modeling for its high flexibility in feature extraction and decision rule generation. Despite all these progresses, the expressiveness (the ability of a program to identify non-isomorphic graph structures) of the GNN model is bounded by the WL isomorphism test, and a suitable thresholding scheme that relates directly to the sensitivity and credibility of a model is still an open question. Methods: In this research, we further improved the expressiveness of the GNN model by introducing the substructure-aware bias by the graph subgraph transformer network model. Moreover, to propose the most appropriate thresholding scheme, a comprehensive comparison of the thresholding schemes was conducted. Results: Based on these improvements, the best model attains performance with 90.4% precision, 90.4% recall, and 90.5% F1-score with a dual-threshold scheme (active: < 1 μ M ; non-active: > 30 μ M ). The improved pipeline (graph subgraph transformer network model and thresholding scheme) also shows its advantages in terms of the activity cliff problem and model interpretability.
Collapse
Affiliation(s)
- Huijia Wang
- Graduate School of Science and Technology, Nara Institute of Science and Technology, Ikoma, Japan
| | - Guangxian Zhu
- Graduate School of Science and Technology, Nara Institute of Science and Technology, Ikoma, Japan
| | - Leighton T. Izu
- Department of Pharmacology, University of California, Davis, CA, United States
| | - Ye Chen-Izu
- Department of Biomedical Engineering, University of California, Davis, CA, United States
| | - Naoaki Ono
- Data Science Center, Nara Institute of Science and Technology, Ikoma, Japan
| | - MD Altaf-Ul-Amin
- Graduate School of Science and Technology, Nara Institute of Science and Technology, Ikoma, Japan
| | - Shigehiko Kanaya
- Graduate School of Science and Technology, Nara Institute of Science and Technology, Ikoma, Japan
| | - Ming Huang
- Graduate School of Science and Technology, Nara Institute of Science and Technology, Ikoma, Japan
| |
Collapse
|
30
|
Almukadi H, Jadkarim GA, Mohammed A, Almansouri M, Sultana N, Shaik NA, Banaganapalli B. Combining machine learning and structure-based approaches to develop oncogene PIM kinase inhibitors. Front Chem 2023; 11:1137444. [PMID: 36970406 PMCID: PMC10036574 DOI: 10.3389/fchem.2023.1137444] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Accepted: 02/09/2023] [Indexed: 03/12/2023] Open
Abstract
Introduction: PIM kinases are targets for therapeutic intervention since they are associated with a number of malignancies by boosting cell survival and proliferation. Over the past years, the rate of new PIM inhibitors discovery has increased significantly, however, new generation of potent molecules with the right pharmacologic profiles were in demand that can probably lead to the development of Pim kinase inhibitors that are effective against human cancer.Method: In the current study, a machine learning and structure based approaches were used to generate novel and effective chemical therapeutics for PIM-1 kinase. Four different machine learning methods, namely, support vector machine, random forest, k-nearest neighbour and XGBoost have been used for the development of models. Total, 54 Descriptors have been selected using the Boruta method.Results: SVM, Random Forest and XGBoost shows better performance as compared to k-NN. An ensemble approach was implemented and, finally, four potential molecules (CHEMBL303779, CHEMBL690270, MHC07198, and CHEMBL748285) were found to be effective for the modulation of PIM-1 activity. Molecular docking and molecular dynamic simulation corroborated the potentiality of the selected molecules. The molecular dynamics (MD) simulation study indicated the stability between protein and ligands.Discussion: Our findings suggest that the selected models are robust and can be potentially useful for facilitating the discovery against PIM kinase.
Collapse
Affiliation(s)
- Haifa Almukadi
- Department of Pharmacology and Toxicology, Faculty of Pharmacy, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Gada Ali Jadkarim
- Department of Genetic Medicine, Faculty of Medicine, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Arif Mohammed
- Department of Biology, College of Science, University of Jeddah, Jeddah, Saudi Arabia
| | - Majid Almansouri
- Department of Clinical Biochemistry, Faculty of Medicine, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Nasreen Sultana
- Department of Biotechnology, Acharya Nagarjuna University, Guntur, India
- *Correspondence: Noor Ahmad Shaik, ; Nasreen Sultana, ; Babajan Banaganapalli,
| | - Noor Ahmad Shaik
- Department of Genetic Medicine, Faculty of Medicine, King Abdulaziz University, Jeddah, Saudi Arabia
- Princess Al-Jawhara Al-Brahim Center of Excellence in Research of Hereditary Disorders, King Abdulaziz University, Jeddah, Saudi Arabia
- *Correspondence: Noor Ahmad Shaik, ; Nasreen Sultana, ; Babajan Banaganapalli,
| | - Babajan Banaganapalli
- Department of Genetic Medicine, Faculty of Medicine, King Abdulaziz University, Jeddah, Saudi Arabia
- Princess Al-Jawhara Al-Brahim Center of Excellence in Research of Hereditary Disorders, King Abdulaziz University, Jeddah, Saudi Arabia
- *Correspondence: Noor Ahmad Shaik, ; Nasreen Sultana, ; Babajan Banaganapalli,
| |
Collapse
|
31
|
Behnoush AH, Khalaji A, Rezaee M, Momtahen S, Mansourian S, Bagheri J, Masoudkabir F, Hosseini K. Machine learning-based prediction of 1-year mortality in hypertensive patients undergoing coronary revascularization surgery. Clin Cardiol 2023; 46:269-278. [PMID: 36588391 PMCID: PMC10018097 DOI: 10.1002/clc.23963] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/17/2022] [Revised: 12/12/2022] [Accepted: 12/19/2022] [Indexed: 01/03/2023] Open
Abstract
BACKGROUND Machine learning (ML) has shown promising results in all fields of medicine, including preventive cardiology. Hypertensive patients are at higher risk of mortality after coronary artery bypass graft (CABG) surgery; thus, we aimed to design and evaluate five ML models to predict 1-year mortality among hypertensive patients who underwent CABG. HYOTHESIS ML algorithms can significantly improve mortality prediction after CABG. METHODS Tehran Heart Center's CABG data registry was used to extract several baseline and peri-procedural characteristics and mortality data. The best features were chosen using random forest (RF) feature selection algorithm. Five ML models were developed to predict 1-year mortality: logistic regression (LR), RF, artificial neural network (ANN), extreme gradient boosting (XGB), and naïve Bayes (NB). The area under the curve (AUC), sensitivity, and specificity were used to evaluate the models. RESULTS Among the 8,493 hypertensive patients who underwent CABG (mean age of 68.27 ± 9.27 years), 303 died in the first year. Eleven features were selected as the best predictors, among which total ventilation hours and ejection fraction were the leading ones. LR showed the best prediction ability with an AUC of 0.82, while the least AUC was for the NB model (0.79). Among the subgroups, the highest AUC for LR model was for two age range groups (50-59 and 80-89 years), overweight, diabetic, and smoker subgroups of hypertensive patients. CONCLUSIONS All ML models had excellent performance in predicting 1-year mortality among CABG hypertension patients, while LR was the best regarding AUC. These models can help clinicians assess the risk of mortality in specific subgroups at higher risk (such as hypertensive ones).
Collapse
Affiliation(s)
- Amir Hossein Behnoush
- Tehran Heart Center, Cardiovascular Diseases Research InstituteTehran University of Medical SciencesTehranIran
- Cardiac Primary Prevention Research Center, Cardiovascular Diseases Research InstituteTehran University of Medical SciencesTehranIran
- School of MedicineTehran University of Medical SciencesTehranIran
- Non‐Communicable Diseases Research Center, Endocrinology and Metabolism Population Sciences InstituteTehran University of Medical SciencesTehranIran
| | - Amirmohammad Khalaji
- Tehran Heart Center, Cardiovascular Diseases Research InstituteTehran University of Medical SciencesTehranIran
- Cardiac Primary Prevention Research Center, Cardiovascular Diseases Research InstituteTehran University of Medical SciencesTehranIran
- School of MedicineTehran University of Medical SciencesTehranIran
- Non‐Communicable Diseases Research Center, Endocrinology and Metabolism Population Sciences InstituteTehran University of Medical SciencesTehranIran
| | - Malihe Rezaee
- Tehran Heart Center, Cardiovascular Diseases Research InstituteTehran University of Medical SciencesTehranIran
- Cardiac Primary Prevention Research Center, Cardiovascular Diseases Research InstituteTehran University of Medical SciencesTehranIran
- Non‐Communicable Diseases Research Center, Endocrinology and Metabolism Population Sciences InstituteTehran University of Medical SciencesTehranIran
- School of MedicineShahid Beheshti University of Medical SciencesTehranIran
| | - Shahram Momtahen
- Department of Surgery, Tehran Heart CenterTehran University of Medical SciencesTehranIran
| | - Soheil Mansourian
- Department of Surgery, Tehran Heart CenterTehran University of Medical SciencesTehranIran
| | - Jamshid Bagheri
- Department of Surgery, Tehran Heart CenterTehran University of Medical SciencesTehranIran
| | - Farzad Masoudkabir
- Tehran Heart Center, Cardiovascular Diseases Research InstituteTehran University of Medical SciencesTehranIran
- Cardiac Primary Prevention Research Center, Cardiovascular Diseases Research InstituteTehran University of Medical SciencesTehranIran
| | - Kaveh Hosseini
- Tehran Heart Center, Cardiovascular Diseases Research InstituteTehran University of Medical SciencesTehranIran
- Cardiac Primary Prevention Research Center, Cardiovascular Diseases Research InstituteTehran University of Medical SciencesTehranIran
| |
Collapse
|
32
|
Andrade KM, Silva BPM, de Oliveira LR, Cury PR. Automatic dental biofilm detection based on deep learning. J Clin Periodontol 2023; 50:571-581. [PMID: 36635042 DOI: 10.1111/jcpe.13774] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2022] [Revised: 12/06/2022] [Accepted: 01/09/2023] [Indexed: 01/14/2023]
Abstract
AIM To estimate the automated biofilm detection capacity of the U-Net neural network on tooth images. MATERIALS AND METHODS Two datasets of intra-oral photographs taken in the frontal and lateral views of permanent and deciduous dentitions were employed. The first dataset consisted of 96 photographs taken before and after applying a disclosing agent and was used to validate the domain's expert biofilm annotation (intra-class correlation coefficient = .93). The second dataset comprised 480 photos, with or without orthodontic appliances, and without disclosing agents, and was used to train the neural network to segment the biofilm. Dental biofilm labelled by the dentist (without disclosing agents) was considered the ground truth. Segmentation performance was measured using accuracy, F1 score, sensitivity, and specificity. RESULTS The U-Net model achieved an accuracy of 91.8%, F1 score of 60.6%, specificity of 94.4%, and sensitivity of 67.2%. The accuracy was higher in the presence of orthodontic appliances (92.6%). CONCLUSIONS Visually segmenting dental biofilm employing a U-Net is feasible and can assist professionals and patients in identifying dental biofilm, thus improving oral hygiene and health.
Collapse
Affiliation(s)
- Katia Montanha Andrade
- Graduate Program in Dentistry and Health, School of Dentistry, Federal University of Bahia, Salvador, Brazil
| | | | | | - Patricia Ramos Cury
- Division of Periodontics, School of Dentistry, Federal University of Bahia, Salvador, Brazil
| |
Collapse
|
33
|
Leventi-Peetz AM, Weber K. Probabilistic machine learning for breast cancer classification. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:624-655. [PMID: 36650782 DOI: 10.3934/mbe.2023029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
A probabilistic neural network has been implemented to predict the malignancy of breast cancer cells, based on a data set, the features of which are used for the formulation and training of a model for a binary classification problem. The focus is placed on considerations when building the model, in order to achieve not only accuracy but also a safe quantification of the expected uncertainty of the calculated network parameters and the medical prognosis. The source code is included to make the results reproducible, also in accordance with the latest trending in machine learning research, named Papers with Code. The various steps taken for the code development are introduced in detail but also the results are visually displayed and critically analyzed also in the sense of explainable artificial intelligence. In statistical-classification problems, the decision boundary is the region of the problem space in which the classification label of the classifier is ambiguous. Problem aspects and model parameters which influence the decision boundary are a special aspect of practical investigation considered in this work. Classification results issued by technically transparent machine learning software can inspire more confidence, as regards their trustworthiness which is very important, especially in the case of medical prognosis. Furthermore, transparency allows the user to adapt models and learning processes to the specific needs of a problem and has a boosting influence on the development of new methods in relevant machine learning fields (transfer learning).
Collapse
Affiliation(s)
| | - Kai Weber
- inducto Daten- und Informationssysteme GmbH, 84405 Dorfen, Germany
| |
Collapse
|
34
|
Cheminformatics analysis of chemicals that increase estrogen and progesterone synthesis for a breast cancer hazard assessment. Sci Rep 2022; 12:20647. [PMID: 36450809 PMCID: PMC9712655 DOI: 10.1038/s41598-022-24889-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Accepted: 11/22/2022] [Indexed: 12/03/2022] Open
Abstract
Factors that increase estrogen or progesterone (P4) action are well-established as increasing breast cancer risk, and many first-line treatments to prevent breast cancer recurrence work by blocking estrogen synthesis or action. In previous work, using data from an in vitro steroidogenesis assay developed for the U.S. Environmental Protection Agency (EPA) ToxCast program, we identified 182 chemicals that increased estradiol (E2up) and 185 that increased progesterone (P4up) in human H295R adrenocortical carcinoma cells, an OECD validated assay for steroidogenesis. Chemicals known to induce mammary effects in vivo were very likely to increase E2 or P4 synthesis, further supporting the importance of these pathways for breast cancer. To identify additional chemical exposures that may increase breast cancer risk through E2 or P4 steroidogenesis, we developed a cheminformatics approach to identify structural features associated with these activities and to predict other E2 or P4 steroidogens from their chemical structures. First, we used molecular descriptors and physicochemical properties to cluster the 2,012 chemicals screened in the steroidogenesis assay using a self-organizing map (SOM). Structural features such as triazine, phenol, or more broadly benzene ramified with halide, amine or alcohol, are enriched for E2 or P4up chemicals. Among E2up chemicals, phenol and benzenone are found as significant substructures, along with nitrogen-containing biphenyls. For P4up chemicals, phenol and complex aromatic systems ramified with oxygen-based groups such as flavone or phenolphthalein are significant substructures. Chemicals that are active for both E2up and P4up are enriched with substructures such as dihydroxy phosphanedithione or are small chemicals that contain one benzene ramified with chlorine, alcohol, methyl or primary amine. These results are confirmed with a chemotype ToxPrint analysis. Then, we used machine learning and artificial intelligence algorithms to develop and validate predictive classification QSAR models for E2up and P4up chemicals. These models gave reasonable external prediction performances (balanced accuracy ~ 0.8 and Matthews Coefficient Correlation ~ 0.5) on an external validation. The QSAR models were enriched by adding a confidence score that considers the chemical applicability domain and a ToxPrint assessment of the chemical. This profiling and these models may be useful to direct future testing and risk assessments for chemicals related to breast cancer and other hormonally-mediated outcomes.
Collapse
|
35
|
Boldini D, Friedrich L, Kuhn D, Sieber SA. Tuning gradient boosting for imbalanced bioassay modelling with custom loss functions. J Cheminform 2022; 14:80. [DOI: 10.1186/s13321-022-00657-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2022] [Accepted: 10/30/2022] [Indexed: 11/12/2022] Open
Abstract
AbstractWhile in the last years there has been a dramatic increase in the number of available bioassay datasets, many of them suffer from extremely imbalanced distribution between active and inactive compounds. Thus, there is an urgent need for novel approaches to tackle class imbalance in drug discovery. Inspired by recent advances in computer vision, we investigated a panel of alternative loss functions for imbalanced classification in the context of Gradient Boosting and benchmarked them on six datasets from public and proprietary sources, for a total of 42 tasks and 2 million compounds. Our findings show that with these modifications, we achieve statistically significant improvements over the conventional cross-entropy loss function on five out of six datasets. Furthermore, by employing these bespoke loss functions we are able to push Gradient Boosting to match or outperform a wide variety of previously reported classifiers and neural networks. We also investigate the impact of changing the loss function on training time and find that it increases convergence speed up to 8 times faster. As such, these results show that tuning the loss function for Gradient Boosting is a straightforward and computationally efficient method to achieve state-of-the-art performance on imbalanced bioassay datasets without compromising on interpretability and scalability.
Graphical Abstract
Collapse
|
36
|
Thomas M, O’Boyle NM, Bender A, de Graaf C. Augmented Hill-Climb increases reinforcement learning efficiency for language-based de novo molecule generation. J Cheminform 2022; 14:68. [PMID: 36192789 PMCID: PMC9531503 DOI: 10.1186/s13321-022-00646-z] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Accepted: 09/23/2022] [Indexed: 11/10/2022] Open
Abstract
A plethora of AI-based techniques now exists to conduct de novo molecule generation that can devise molecules conditioned towards a particular endpoint in the context of drug design. One popular approach is using reinforcement learning to update a recurrent neural network or language-based de novo molecule generator. However, reinforcement learning can be inefficient, sometimes requiring up to 105 molecules to be sampled to optimize more complex objectives, which poses a limitation when using computationally expensive scoring functions like docking or computer-aided synthesis planning models. In this work, we propose a reinforcement learning strategy called Augmented Hill-Climb based on a simple, hypothesis-driven hybrid between REINVENT and Hill-Climb that improves sample-efficiency by addressing the limitations of both currently used strategies. We compare its ability to optimize several docking tasks with REINVENT and benchmark this strategy against other commonly used reinforcement learning strategies including REINFORCE, REINVENT (version 1 and 2), Hill-Climb and best agent reminder. We find that optimization ability is improved ~ 1.5-fold and sample-efficiency is improved ~ 45-fold compared to REINVENT while still delivering appealing chemistry as output. Diversity filters were used, and their parameters were tuned to overcome observed failure modes that take advantage of certain diversity filter configurations. We find that Augmented Hill-Climb outperforms the other reinforcement learning strategies used on six tasks, especially in the early stages of training or for more difficult objectives. Lastly, we show improved performance not only on recurrent neural networks but also on a reinforcement learning stabilized transformer architecture. Overall, we show that Augmented Hill-Climb improves sample-efficiency for language-based de novo molecule generation conditioning via reinforcement learning, compared to the current state-of-the-art. This makes more computationally expensive scoring functions, such as docking, more accessible on a relevant timescale.
Collapse
Affiliation(s)
- Morgan Thomas
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, CB2 1EW UK
| | - Noel M. O’Boyle
- Computational Chemistry, Sosei Heptares, Steinmetz Building, Granta Park, Great Abington, Cambridge, CB21 6DG UK
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, CB2 1EW UK
| | - Chris de Graaf
- Computational Chemistry, Sosei Heptares, Steinmetz Building, Granta Park, Great Abington, Cambridge, CB21 6DG UK
| |
Collapse
|
37
|
Machine learning algorithms identify demographics, dietary features, and blood biomarkers associated with stroke records. J Neurol Sci 2022; 440:120335. [PMID: 35863116 DOI: 10.1016/j.jns.2022.120335] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Revised: 05/26/2022] [Accepted: 07/05/2022] [Indexed: 11/22/2022]
Abstract
OBJECTIVE We conducted a comprehensive evaluation of features associated with stroke records. METHODS We screened the dietary nutrients, blood biomarkers, and clinical information from the National Health and Nutrition Examination Survey (NHANES) 2015-16 database to assess a self-reported history of all strokes (136 strokes, n = 4381). We computed feature importance, built machine learning (ML) models, developed a nomogram, and validated the nomogram on NHANES 2007-08, 2017-18, and the baseline UK Biobank. We calculated the odds ratios with/without adjusting sampling weights (OR/ORw). RESULTS The clinical features have the best predictive power compared to dietary nutrients and blood biomarkers, with 22.8% increased average area under the receiver operating characteristic curves (AUROC) in ML models. We further modeled with ten most important clinical features without compromising the predictive performance. The key features positively associated with stroke include age, cigarette smoking, tobacco smoking, Caucasian or African American race, hypertension, diabetes mellitus, asthma history; the negatively associated feature is the family income. The nomogram based on these key features achieved good performances (AUROC between 0.753 and 0.822) on the test set, the NHANES 2007-08, 2017-18, and the UK Biobank. Key features from the nomogram model include age (OR = 1.05, ORw = 1.06), Caucasian/African American (OR = 2.68, ORw = 2.67), diabetes mellitus (OR = 2.30, ORw = 1.99), asthma (OR = 2.10, ORw = 2.41), hypertension (OR = 1.86, ORw = 2.10), and income (OR = 0.83, ORw = 0.81). CONCLUSIONS We identified clinical key features and built predictive models for assessing stroke records with high performance. A nomogram consisting of questionnaire-based variables would help identify stroke survivors and evaluate the potential risk of stroke.
Collapse
|
38
|
Gaber A, Taher MF, Wahed MA, Shalaby NM, Gaber S. Classification of facial paralysis based on machine learning techniques. Biomed Eng Online 2022; 21:65. [PMID: 36071434 PMCID: PMC9449956 DOI: 10.1186/s12938-022-01036-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2022] [Accepted: 08/24/2022] [Indexed: 11/11/2022] Open
Abstract
Facial paralysis (FP) is an inability to move facial muscles voluntarily, affecting daily activities. There is a need for quantitative assessment and severity level classification of FP to evaluate the condition. None of the available tools are widely accepted. A comprehensive FP evaluation system has been developed by the authors. The system extracts real-time facial animation units (FAUs) using the Kinect V2 sensor and includes both FP assessment and classification. This paper describes the development and testing of the FP classification phase. A dataset of 375 records from 13 unilateral FP patients and 1650 records from 50 control subjects was compiled. Artificial Intelligence and Machine Learning methods are used to classify seven FP categories: the normal case and three severity levels: mild, moderate, and severe for the left and right sides. For better prediction results (Accuracy = 96.8%, Sensitivity = 88.9% and Specificity = 99%), an ensemble learning classifier was developed rather than one weak classifier. The ensemble approach based on SVMs was proposed for the high-dimensional data to gather the advantages of stacking and bagging. To address the problem of an imbalanced dataset, a hybrid strategy combining three separate techniques was used. Model robustness and stability was evaluated using fivefold cross-validation. The results showed that the classifier is robust, stable and performs well for different train and test samples. The study demonstrates that FAUs acquired by the Kinect sensor can be used in classifying FP. The developed FP assessment and classification system provides a detailed quantitative report and has significant advantages over existing grading scales.
Collapse
Affiliation(s)
- Amira Gaber
- Systems and Biomedical Engineering Department, Faculty of Engineering, Cairo University, Giza, Egypt
| | - Mona F. Taher
- Systems and Biomedical Engineering Department, Faculty of Engineering, Cairo University, Giza, Egypt
| | - Manal Abdel Wahed
- Systems and Biomedical Engineering Department, Faculty of Engineering, Cairo University, Giza, Egypt
| | | | - Sarah Gaber
- Department of Neuromuscular Disorder and Its Surgery, Faculty of Physical Therapy, Cairo University, Giza, Egypt
| |
Collapse
|
39
|
Alsaui AA, Alghofaili YA, Alghadeer M, Alharbi FH. Resampling Techniques for Materials Informatics: Limitations in Crystal Point Groups Classification. J Chem Inf Model 2022; 62:3514-3523. [DOI: 10.1021/acs.jcim.2c00666] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Abdulmohsen A. Alsaui
- Electrical Engineering Department, Indian Institute of Technology Madras, Chennai 600036, India
| | - Yousef A. Alghofaili
- Research and Development Department, Xpedite Information Technology, Riyadh 13333, Saudi Arabia
| | - Mohammed Alghadeer
- Applied Mathematics and Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, United States
| | - Fahhad H. Alharbi
- Electrical Engineering Department, King Fahd University of Petroleum and Minerals, Dhahran 31261, Saudi Arabia
- SDAIA-KFUPM Joint Research Center for Artificial Intelligence, Dhahran 31261, Saudi Arabia
| |
Collapse
|
40
|
Walter M, Allen LN, de la Vega de León A, Webb SJ, Gillet VJ. Analysis of the benefits of imputation models over traditional QSAR models for toxicity prediction. J Cheminform 2022; 14:32. [PMID: 35672779 PMCID: PMC9172131 DOI: 10.1186/s13321-022-00611-w] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Accepted: 05/12/2022] [Indexed: 11/21/2022] Open
Abstract
Recently, imputation techniques have been adapted to predict activity values among sparse bioactivity matrices, showing improvements in predictive performance over traditional QSAR models. These models are able to use experimental activity values for auxiliary assays when predicting the activity of a test compound on a specific assay. In this study, we tested three different multi-task imputation techniques on three classification-based toxicity datasets: two of small scale (12 assays each) and one large scale with 417 assays. Moreover, we analyzed in detail the improvements shown by the imputation models. We found that test compounds that were dissimilar to training compounds, as well as test compounds with a large number of experimental values for other assays, showed the largest improvements. We also investigated the impact of sparsity on the improvements seen as well as the relatedness of the assays being considered. Our results show that even a small amount of additional information can provide imputation methods with a strong boost in predictive performance over traditional single task and multi-task predictive models.
Collapse
|
41
|
Xu M, Yang H, Liu G, Tang Y, Li W. In Silico Prediction of Chemical Aquatic Toxicity by Multiple Machine Learning and Deep Learning Approaches. J Appl Toxicol 2022; 42:1766-1776. [PMID: 35653511 DOI: 10.1002/jat.4354] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2022] [Revised: 05/16/2022] [Accepted: 05/31/2022] [Indexed: 11/08/2022]
Abstract
Fish is one of the model animals used to evaluate the adverse effects of a chemical exposed to the ecosystem. However, its low throughput and relevantly high expense make it impossible to test all new chemicals in manufacture. Hence using in silico models to prioritize compounds to be tested has been widely applied in environmental risk assessment and drug discovery. In this study, we constructed the local predictive models for four fish species, including bluegill sunfish, rainbow trout, fathead minnow, and sheepshead minnow, and the global models with all four fish data. A total of 1874 unique compounds with their labels, i.e. toxic (LC50 < 10 ppm) or nontoxic were collected from ECOTOX and literature. Both conventional machine learning methods and the deep learning architecture, graph convolutional network (GCN), were used to build predictive models. The classification accuracy of the best local model for each fish species was higher than 0.83. For the global models, two strategies including consistency prediction and probability threshold were adopted to improve the predictive capability at the cost of limiting applicability domain. For 63% of compounds in domain, the accuracy was around 0.97. By comparison of the deep learning and machine learning methods, we found that the single-task GCN showed specific advantages in performance and multi-task GCN showed no advantages over the conventional machine learning methods. The data and models are available on GitHub (https://github.com/ChemPredict/ChemicalAquaticToxicity).
Collapse
Affiliation(s)
- Minjie Xu
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, China
| | - Hongbin Yang
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, China
| | - Guixia Liu
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, China
| | - Yun Tang
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, China
| | - Weihua Li
- Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai, China
| |
Collapse
|
42
|
H Attia M, H Attia M, Tarek Farghaly Y, Ahmed El-Sayed Abulnoor B, Curate F. Performance of the supervised learning algorithms in sex estimation of the proximal femur: A comparative study in contemporary Egyptian and Turkish samples. Sci Justice 2022; 62:288-309. [PMID: 35598923 DOI: 10.1016/j.scijus.2022.03.003] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Revised: 12/27/2021] [Accepted: 03/06/2022] [Indexed: 10/18/2022]
Abstract
Sex estimation standards are population specific however, we argue that machine learning techniques (ML) may enhance the biological sex determination on trans-population application. Linear discriminant analysis (LDA) versus nine ML including quadratic discriminant analysis (QDA), support vector machine (SVM), Decision Tree (DT), Gaussian process (GPC), Naïve Bayesian (NBC), K-Nearest Neighbor (KNN), Random Forest (RFM) and Adaptive boosting (Adaboost) were compared. The experiments involve two contemporary populations: Turkish (n = 300) and Egyptian populations (n = 100) for training and validation, respectively. Base models were calibrated using isotonic and sigmoid calibration schemes. Results were analyzed at posterior probabilities (pp) thresholds >0.95 and >0.80. At pp = 0.5, ML algorithms yielded comparable accuracies in the training (90% to 97%) and test sets (81% to 88%) which are not modified after employing the calibration techniques. At pp >0.95, the raw RFM, LDA, QDA, and SVM models have shown the best performance however, calibration techniques improved the performance of various classifier especially NBC and Adaboost. By contrast, the performance of GPC, KNN, QDA models worsened by calibration. RFM has shown the best performance among all models at both thresholds whereas LDA benefited the best from using both calibration methods at pp >0.80. Complex ML models are not necessarily achieving better performance metrics. LDA and QDA remain the fastest and simplest classifiers. We demonstrated the capability of enhancing sex estimation using ML on an independent population sample however, differences in the underlying probability distribution generated by models were detected which warranted more cautious application by forensic practitioners.
Collapse
Affiliation(s)
- MennattAllah H Attia
- Forensic Medicine and Clinical Toxicology, Faculty of Medicine, Alexandria University, Alexandria, Egypt.
| | - Mohamed H Attia
- Biomedical Engineering, Medical Research Institute, Alexandria University, Egypt; Institute for Intelligent Systems Research and Innovation, Deakin University, Australia.
| | | | | | - Francisco Curate
- University of Coimbra, Research Centre for Anthropology and Health, Department of Life Sciences, Coimbra, Portugal; University of Coimbra, Laboratory of Forensic Anthropology, Department of Life Sciences, Coimbra, Portugal.
| |
Collapse
|
43
|
García-Cebollada H, López A, Sancho J. Protposer: the web server that readily proposes protein stabilizing mutations with high PPV. Comput Struct Biotechnol J 2022; 20:2415-2433. [PMID: 35664235 PMCID: PMC9133766 DOI: 10.1016/j.csbj.2022.05.008] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Revised: 05/05/2022] [Accepted: 05/05/2022] [Indexed: 01/23/2023] Open
Abstract
Protein stability is a requisite for most biotechnological and medical applications of proteins. As natural proteins tend to suffer from a low conformational stability ex vivo, great efforts have been devoted toward increasing their stability through rational design and engineering of appropriate mutations. Unfortunately, even the best currently used predictors fail to compute the stability of protein variants with sufficient accuracy and their usefulness as tools to guide the rational stabilisation of proteins is limited. We present here Protposer, a protein stabilising tool based on a different approach. Instead of quantifying changes in stability, Protposer uses structure- and sequence-based screening modules to nominate candidate mutations for subsequent evaluation by a logistic regression model, carefully trained to avoid overfitting. Thus, Protposer analyses PDB files in search for stabilization opportunities and provides a ranked list of promising mutations with their estimated success rates (eSR), their probabilities of being stabilising by at least 0.5 kcal/mol. The agreement between eSRs and actual positive predictive values (PPV) on external datasets of mutations is excellent. When Protposer is used with its Optimal kappa selection threshold, its PPV is above 0.7. Even with less stringent thresholds, Protposer largely outperforms FoldX, Rosetta and PoPMusiC. Indicating the PDB file of the protein suffices to obtain a ranked list of mutations, their eSRs and hints on the likely source of the stabilization expected. Protposer is a distinct, straightforward and highly successful tool to design protein stabilising mutations, and it is freely available for academic use at http://webapps.bifi.es/the-protposer.
Collapse
|
44
|
Deep learning model calibration for improving performance in class-imbalanced medical image classification tasks. PLoS One 2022; 17:e0262838. [PMID: 35085334 PMCID: PMC8794113 DOI: 10.1371/journal.pone.0262838] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2021] [Accepted: 01/05/2022] [Indexed: 11/19/2022] Open
Abstract
In medical image classification tasks, it is common to find that the number of normal samples far exceeds the number of abnormal samples. In such class-imbalanced situations, reliable training of deep neural networks continues to be a major challenge, therefore biasing the predicted class probabilities toward the majority class. Calibration has been proposed to alleviate some of these effects. However, there is insufficient analysis explaining whether and when calibrating a model would be beneficial. In this study, we perform a systematic analysis of the effect of model calibration on its performance on two medical image modalities, namely, chest X-rays and fundus images, using various deep learning classifier backbones. For this, we study the following variations: (i) the degree of imbalances in the dataset used for training; (ii) calibration methods; and (iii) two classification thresholds, namely, default threshold of 0.5, and optimal threshold from precision-recall (PR) curves. Our results indicate that at the default classification threshold of 0.5, the performance achieved through calibration is significantly superior (p < 0.05) to using uncalibrated probabilities. However, at the PR-guided threshold, these gains are not significantly different (p > 0.05). This observation holds for both image modalities and at varying degrees of imbalance. The code is available at https://github.com/sivaramakrishnan-rajaraman/Model_calibration.
Collapse
|
45
|
Bhat HS, Reeves ME, Goldman‐Mellor S. Equity‐Weighted Bootstrapping: Examples and Analysis. Stat (Int Stat Inst) 2022. [DOI: 10.1002/sta4.456] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Affiliation(s)
- Harish S. Bhat
- Applied Mathematics University of California Merced CA USA
| | | | | |
Collapse
|
46
|
Jeong W, Gaggioli CA, Gagliardi L. Active Learning Configuration Interaction for Excited-State Calculations of Polycyclic Aromatic Hydrocarbons. J Chem Theory Comput 2021; 17:7518-7530. [PMID: 34787422 PMCID: PMC8675132 DOI: 10.1021/acs.jctc.1c00769] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2021] [Indexed: 11/30/2022]
Abstract
We present the active learning configuration interaction (ALCI) method for multiconfigurational calculations based on large active spaces. ALCI leverages the use of an active learning procedure to find important electronic configurations among the full configurational space generated within an active space. We tested it for the calculation of singlet-singlet excited states of acenes and pyrene using different machine learning algorithms. The ALCI method yields excitation energies within 0.2-0.3 eV from those obtained by traditional complete active-space configuration interaction (CASCI) calculations (affordable for active spaces up to 16 electrons in 16 orbitals) by including only a small fraction of the CASCI configuration space in the calculations. For larger active spaces (we tested up to 26 electrons in 26 orbitals), not affordable with traditional CI methods, ALCI captures the trends of experimental excitation energies. Overall, ALCI provides satisfactory approximations to large active-space wave functions with up to 10 orders of magnitude fewer determinants for the systems presented here. These ALCI wave functions are promising and affordable starting points for the subsequent second-order perturbation theory or pair-density functional theory calculations.
Collapse
Affiliation(s)
- WooSeok Jeong
- Department
of Chemistry, Nanoporous Materials Genome Center, Chemical Theory
Center, and Minnesota Supercomputing Institute, University of Minnesota, Minneapolis, Minnesota 55455, United States
| | - Carlo Alberto Gaggioli
- Department
of Chemistry, Pritzker School of Molecular Engineering, James Franck
Institute, Chicago Center for Theoretical Chemistry, University of Chicago, Chicago, Illinois 60637, United States
| | - Laura Gagliardi
- Department
of Chemistry, Pritzker School of Molecular Engineering, James Franck
Institute, Chicago Center for Theoretical Chemistry, University of Chicago, Chicago, Illinois 60637, United States
- Argonne
National Laboratory, Lemont, Illinois 60439, United States
| |
Collapse
|
47
|
Patrick Walters W. Comparing classification models-a practical tutorial. J Comput Aided Mol Des 2021; 36:381-389. [PMID: 34549368 DOI: 10.1007/s10822-021-00417-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2021] [Accepted: 08/18/2021] [Indexed: 01/17/2023]
Abstract
While machine learning models have become a mainstay in Cheminformatics, the field has yet to agree on standards for model evaluation and comparison. In many cases, authors compare methods by performing multiple folds of cross-validation and reporting the mean value for an evaluation metric such as the area under the receiver operating characteristic. These comparisons of mean values often lack statistical rigor and can lead to inaccurate conclusions. In the interest of encouraging best practices, this tutorial provides an example of how multiple methods can be compared in a statistically rigorous fashion.
Collapse
|
48
|
Amendola G, Cosconati S. PyRMD: A New Fully Automated AI-Powered Ligand-Based Virtual Screening Tool. J Chem Inf Model 2021; 61:3835-3845. [PMID: 34270903 DOI: 10.1021/acs.jcim.1c00653] [Citation(s) in RCA: 27] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Artificial intelligence (AI) algorithms are dramatically redefining the current drug discovery landscape by boosting the efficiency of its various steps. Still, their implementation often requires a certain level of expertise in AI paradigms and coding. This often prevents the use of these powerful methodologies by non-expert users involved in the design of new biologically active compounds. Here, the random matrix discriminant (RMD) algorithm, a high-performance AI method specifically tailored for the identification of new ligands, was implemented in a new fully automated tool, PyRMD. This ligand-based virtual screening tool can be trained using target bioactivity data directly downloaded from the ChEMBL repository without manual intervention. The software automatically splits the available training compounds into active and inactive sets and learns the distinctive chemical features responsible for the compounds' activity/inactivity. PyRMD was designed to easily screen millions of compounds in hours through an automated workflow and intuitive input files, allowing fine tuning of each parameter of the calculation. Additionally, PyRMD features a wealth of benchmark metrics, to accurately probe the model performance, which were used here to gauge its predictive potential and limitations. PyRMD is freely available on GitHub (https://github.com/cosconatilab/PyRMD) as an open-source tool.
Collapse
Affiliation(s)
- Giorgio Amendola
- DiSTABiF, University of Campania Luigi Vanvitelli, Via Vivaldi 43, 81100 Caserta, Italy
| | - Sandro Cosconati
- DiSTABiF, University of Campania Luigi Vanvitelli, Via Vivaldi 43, 81100 Caserta, Italy
| |
Collapse
|