1
|
Peteani G, Huynh MTD, Gerebtzoff G, Rodríguez-Pérez R. Application of machine learning models for property prediction to targeted protein degraders. Nat Commun 2024; 15:5764. [PMID: 38982061 PMCID: PMC11233499 DOI: 10.1038/s41467-024-49979-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Accepted: 06/21/2024] [Indexed: 07/11/2024] Open
Abstract
Machine learning (ML) systems can model quantitative structure-property relationships (QSPR) using existing experimental data and make property predictions for new molecules. With the advent of modalities such as targeted protein degraders (TPD), the applicability of QSPR models is questioned and ML usage in TPD-centric projects remains limited. Herein, ML models are developed and evaluated for TPDs' property predictions, including passive permeability, metabolic clearance, cytochrome P450 inhibition, plasma protein binding, and lipophilicity. Interestingly, performance on TPDs is comparable to that of other modalities. Predictions for glues and heterobifunctionals often yield lower and higher errors, respectively. For permeability, CYP3A4 inhibition, and human and rat microsomal clearance, misclassification errors into high and low risk categories are lower than 4% for glues and 15% for heterobifunctionals. For all modalities, misclassification errors range from 0.8% to 8.1%. Investigated transfer learning strategies improve predictions for heterobifunctionals. This is the first comprehensive evaluation of ML for the prediction of absorption, distribution, metabolism, and excretion (ADME) and physicochemical properties of TPD molecules, including heterobifunctional and molecular glue sub-modalities. Taken together, our investigations show that ML-based QSPR models are applicable to TPDs and support ML usage for TPDs' design, to potentially accelerate drug discovery.
Collapse
Affiliation(s)
- Giulia Peteani
- Novartis Biomedical Research, Novartis Campus, 4002, Basel, Switzerland
| | | | | | | |
Collapse
|
2
|
Silberstein J, Wellbrook M, Hannigan M. Utilization of a Low-Cost Sensor Array for Mobile Methane Monitoring. SENSORS (BASEL, SWITZERLAND) 2024; 24:519. [PMID: 38257613 PMCID: PMC10820073 DOI: 10.3390/s24020519] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 01/05/2024] [Accepted: 01/10/2024] [Indexed: 01/24/2024]
Abstract
The use of low-cost sensors (LCSs) for the mobile monitoring of oil and gas emissions is an understudied application of low-cost air quality monitoring devices. To assess the efficacy of low-cost sensors as a screening tool for the mobile monitoring of fugitive methane emissions stemming from well sites in eastern Colorado, we colocated an array of low-cost sensors (XPOD) with a reference grade methane monitor (Aeris Ultra) on a mobile monitoring vehicle from 15 August through 27 September 2023. Fitting our low-cost sensor data with a bootstrap and aggregated random forest model, we found a high correlation between the reference and XPOD CH4 concentrations (r = 0.719) and a low experimental error (RMSD = 0.3673 ppm). Other calibration models, including multilinear regression and artificial neural networks (ANN), were either unable to distinguish individual methane spikes above baseline or had a significantly elevated error (RMSDANN = 0.4669 ppm) when compared to the random forest model. Using out-of-bag predictor permutations, we found that sensors that showed the highest correlation with methane displayed the greatest significance in our random forest model. As we reduced the percentage of colocation data employed in the random forest model, errors did not significantly increase until a specific threshold (50 percent of total calibration data). Using a peakfinding algorithm, we found that our model was able to predict 80 percent of methane spikes above 2.5 ppm throughout the duration of our field campaign, with a false response rate of 35 percent.
Collapse
Affiliation(s)
- Jonathan Silberstein
- Department of Mechanical Engineering, University of Colorado at Boulder, 1111 Engineering Drive, Boulder, CO 80309, USA
| | - Matthew Wellbrook
- Urban Labs, University of Chicago, 33 North LaSalle Street Suite 1600, Chicago, IL 60602, USA
| | - Michael Hannigan
- Department of Mechanical Engineering, University of Colorado at Boulder, 1111 Engineering Drive, Boulder, CO 80309, USA
| |
Collapse
|
3
|
Cavasotto CN, Scardino V. Machine Learning Toxicity Prediction: Latest Advances by Toxicity End Point. ACS OMEGA 2022; 7:47536-47546. [PMID: 36591139 PMCID: PMC9798519 DOI: 10.1021/acsomega.2c05693] [Citation(s) in RCA: 53] [Impact Index Per Article: 17.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/02/2022] [Accepted: 11/28/2022] [Indexed: 05/29/2023]
Abstract
Machine learning (ML) models to predict the toxicity of small molecules have garnered great attention and have become widely used in recent years. Computational toxicity prediction is particularly advantageous in the early stages of drug discovery in order to filter out molecules with high probability of failing in clinical trials. This has been helped by the increase in the number of large toxicology databases available. However, being an area of recent application, a greater understanding of the scope and applicability of ML methods is still necessary. There are various kinds of toxic end points that have been predicted in silico. Acute oral toxicity, hepatotoxicity, cardiotoxicity, mutagenicity, and the 12 Tox21 data end points are among the most commonly investigated. Machine learning methods exhibit different performances on different data sets due to dissimilar complexity, class distributions, or chemical space covered, which makes it hard to compare the performance of algorithms over different toxic end points. The general pipeline to predict toxicity using ML has already been analyzed in various reviews. In this contribution, we focus on the recent progress in the area and the outstanding challenges, making a detailed description of the state-of-the-art models implemented for each toxic end point. The type of molecular representation, the algorithm, and the evaluation metric used in each research work are explained and analyzed. A detailed description of end points that are usually predicted, their clinical relevance, the available databases, and the challenges they bring to the field are also highlighted.
Collapse
Affiliation(s)
- Claudio N. Cavasotto
- Computational
Drug Design and Biomedical Informatics Laboratory, Instituto de Investigaciones
en Medicina Traslacional (IIMT), CONICET-Universidad
Austral, Pilar, B1629AHJ Buenos Aires, Argentina
- Austral
Institute for Applied Artificial Intelligence, Universidad Austral, Pilar, B1629AHJ Buenos Aires, Argentina
- Facultad
de Ciencias Biomédicas, Facultad de Ingenierá, Universidad Austral, Pilar, B1630FHB Buenos
Aires, Argentina
| | - Valeria Scardino
- Austral
Institute for Applied Artificial Intelligence, Universidad Austral, Pilar, B1629AHJ Buenos Aires, Argentina
- Meton
AI, Inc., Wilmington, Delaware 19801, United
States
| |
Collapse
|
4
|
Sandeep Ganesh G, Kolusu AS, Prasad K, Samudrala PK, Nemmani KV. Advancing health care via artificial intelligence: From concept to clinic. Eur J Pharmacol 2022; 934:175320. [DOI: 10.1016/j.ejphar.2022.175320] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Revised: 09/30/2022] [Accepted: 10/04/2022] [Indexed: 11/26/2022]
|
5
|
Yang S, Zeng L, Jin X, Lin H, Song J. Feature Genes in Neuroblastoma Distinguishing High-Risk and Non-high-Risk Neuroblastoma Patients: Development and Validation Combining Random Forest With Artificial Neural Network. Front Med (Lausanne) 2022; 9:882348. [PMID: 35911385 PMCID: PMC9336509 DOI: 10.3389/fmed.2022.882348] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2022] [Accepted: 06/13/2022] [Indexed: 11/13/2022] Open
Abstract
There is a significant difference in prognosis among different risk groups. Therefore, it is of great significance to correctly identify the risk grouping of children. Using the genomic data of neuroblastoma samples in public databases, we used GSE49710 as the training set data to calculate the feature genes of the high-risk group and non-high-risk group samples based on the random forest (RF) algorithm and artificial neural network (ANN) algorithm. The screening results of RF showed that EPS8L1, PLCD4, CHD5, NTRK1, and SLC22A4 were the feature differentially expressed genes (DEGs) of high-risk neuroblastoma. The prediction model based on gene expression data in this study showed high overall accuracy and precision in both the training set and the test set (AUC = 0.998 in GSE49710 and AUC = 0.858 in GSE73517). Kaplan–Meier plotter showed that the overall survival and progression-free survival of patients in the low-risk subgroup were significantly better than those in the high-risk subgroup [HR: 3.86 (95% CI: 2.44–6.10) and HR: 3.03 (95% CI: 2.03–4.52), respectively]. Our ANN-based model has better classification performance than the SVM-based model and XGboost-based model. Nevertheless, more convincing data sets and machine learning algorithms will be needed to build diagnostic models for individual organization types in the future.
Collapse
Affiliation(s)
- Sha Yang
- Department of Surgery, Children’s Hospital of Chongqing Medical University, Chongqing, China
- Ministry of Education Key Laboratory of Child Development and Disorders, Chongqing, China
- National Clinical Research Center for Child Health and Disorders, Chongqing, China
- China International Science and Technology Cooperation Base of Child Development and Critical Disorders, Chongqing, China
- Chongqing Key Laboratory of Pediatrics, Chongqing, China
- Chongqing Engineering Research Center of Stem Cell Therapy, Chongqing, China
- Children’s Hospital of Chongqing Medical University, Chongqing, China
| | - Lingfeng Zeng
- Department of Nephrology, The Second Xiangya Hospital of Central South University, Changsha, China
| | - Xin Jin
- Ministry of Education Key Laboratory of Child Development and Disorders, Chongqing, China
- National Clinical Research Center for Child Health and Disorders, Chongqing, China
- China International Science and Technology Cooperation Base of Child Development and Critical Disorders, Chongqing, China
- Chongqing Key Laboratory of Pediatrics, Chongqing, China
- Chongqing Engineering Research Center of Stem Cell Therapy, Chongqing, China
- Children’s Hospital of Chongqing Medical University, Chongqing, China
- Department of Cardiacthoracic, Children’s Hospital of Chongqing Medical University, Chongqing, China
| | - Huapeng Lin
- Department of Intensive Care Unit, Affiliated Hangzhou First People’s Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Jianning Song
- Department of General Surgery, Guiqian International General Hospital, Guiyang, China
- *Correspondence: Jianning Song, ,
| |
Collapse
|
6
|
Hamzic S, Lewis R, Desrayaud S, Soylu C, Fortunato M, Gerebtzoff G, Rodríguez-Pérez R. Predicting In Vivo Compound Brain Penetration Using Multi-task Graph Neural Networks. J Chem Inf Model 2022; 62:3180-3190. [PMID: 35738004 DOI: 10.1021/acs.jcim.2c00412] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Assessing whether compounds penetrate the brain can become critical in drug discovery, either to prevent adverse events or to reach the biological target. Generally, pre-clinical in vivo studies measuring the ratio of brain and blood concentrations (Kp) are required to estimate the brain penetration potential of a new drug entity. In this work, we developed machine learning models to predict in vivo compound brain penetration (as LogKp) from chemical structure. Our results show the benefit of including in vitro experimental data as auxiliary tasks in multi-task graph neural network (MT-GNN) models. MT-GNNs outperformed single-task (ST) models solely trained on in vivo brain penetration data. The best-performing MT-GNN regression model achieved a coefficient of determination of 0.42 and a mean absolute error of 0.39 (2.5-fold) on a prospective validation set and outperformed all tested ST models. To facilitate decision-making, compounds were classified into brain-penetrant or non-penetrant, achieving a Matthew's correlation coefficient of 0.66. Taken together, our findings indicate that the inclusion of in vitro assay data as MT-GNN auxiliary tasks improves in vivo brain penetration predictions and prospective compound prioritization.
Collapse
Affiliation(s)
- Seid Hamzic
- Novartis Institutes for Biomedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Richard Lewis
- Novartis Institutes for Biomedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Sandrine Desrayaud
- Novartis Institutes for Biomedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Cihan Soylu
- Novartis Institutes for BioMedical Research Inc., 181 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Mike Fortunato
- Novartis Institutes for BioMedical Research Inc., 181 Massachusetts Avenue, Cambridge, Massachusetts 02139, United States
| | - Grégori Gerebtzoff
- Novartis Institutes for Biomedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| | - Raquel Rodríguez-Pérez
- Novartis Institutes for Biomedical Research, Novartis Campus, CH-4002 Basel, Switzerland
| |
Collapse
|
7
|
Rodríguez-Pérez R, Miljković F, Bajorath J. Machine Learning in Chemoinformatics and Medicinal Chemistry. Annu Rev Biomed Data Sci 2022; 5:43-65. [PMID: 35440144 DOI: 10.1146/annurev-biodatasci-122120-124216] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
In chemoinformatics and medicinal chemistry, machine learning has evolved into an important approach. In recent years, increasing computational resources and new deep learning algorithms have put machine learning onto a new level, addressing previously unmet challenges in pharmaceutical research. In silico approaches for compound activity predictions, de novo design, and reaction modeling have been further advanced by new algorithmic developments and the emergence of big data in the field. Herein, novel applications of machine learning and deep learning in chemoinformatics and medicinal chemistry are reviewed. Opportunities and challenges for new methods and applications are discussed, placing emphasis on proper baseline comparisons, robust validation methodologies, and new applicability domains. Expected final online publication date for the Annual Review of Biomedical Data Science, Volume 5 is August 2022. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.
Collapse
Affiliation(s)
- Raquel Rodríguez-Pérez
- Department of Life Science Informatics, B-IT (Bonn-Aachen International Center for Information Technology), Chemical Biology and Medicinal Chemistry Program Unit, LIMES (Life and Medical Sciences Institute), Rheinische Friedrich-Wilhelms-Universität, Bonn, Germany; .,Current affiliation: Novartis Institutes for Biomedical Research, Novartis Campus, Basel, Switzerland
| | - Filip Miljković
- Department of Life Science Informatics, B-IT (Bonn-Aachen International Center for Information Technology), Chemical Biology and Medicinal Chemistry Program Unit, LIMES (Life and Medical Sciences Institute), Rheinische Friedrich-Wilhelms-Universität, Bonn, Germany; .,Current affiliation: Data Science and AI, Imaging and Data Analytics, Clinical Pharmacology and Safety Sciences, R&D AstraZeneca, Gothenburg, Sweden
| | - Jürgen Bajorath
- Department of Life Science Informatics, B-IT (Bonn-Aachen International Center for Information Technology), Chemical Biology and Medicinal Chemistry Program Unit, LIMES (Life and Medical Sciences Institute), Rheinische Friedrich-Wilhelms-Universität, Bonn, Germany;
| |
Collapse
|
8
|
Born J, Huynh T, Stroobants A, Cornell WD, Manica M. Active Site Sequence Representations of Human Kinases Outperform Full Sequence Representations for Affinity Prediction and Inhibitor Generation: 3D Effects in a 1D Model. J Chem Inf Model 2021; 62:240-257. [PMID: 34905358 DOI: 10.1021/acs.jcim.1c00889] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Recent advances in deep learning have enabled the development of large-scale multimodal models for virtual screening and de novo molecular design. The human kinome with its abundant sequence and inhibitor data presents an attractive opportunity to develop proteochemometric models that exploit the size and internal diversity of this family of targets. Here, we challenge a standard practice in sequence-based affinity prediction models: instead of leveraging the full primary structure of proteins, each target is represented by a sequence of 29 discontiguous residues defining the ATP binding site. In kinase-ligand binding affinity prediction, our results show that the reduced active site sequence representation is not only computationally more efficient but consistently yields significantly higher performance than the full primary structure. This trend persists across different models, data sets, and performance metrics and holds true when predicting pIC50 for both unseen ligands and kinases. Our interpretability analysis reveals a potential explanation for the superiority of the active site models: whereas only mild statistical effects about the extraction of three-dimensional (3D) interaction sites take place in the full sequence models, the active site models are equipped with an implicit but strong inductive bias about the 3D structure stemming from the discontiguity of the active sites. Moreover, in direct comparisons, our models perform similarly or better than previous state-of-the-art approaches in affinity prediction. We then investigate a de novo molecular design task and find that the active site provides benefits in the computational efficiency, but otherwise, both kinase representations yield similar optimized affinities (for both SMILES- and SELFIES-based molecular generators). Our work challenges the assumption that the full primary structure is indispensable for modeling human kinases.
Collapse
Affiliation(s)
- Jannis Born
- IBM Research Europe, 8804 Rüschlikon, Switzerland.,Department of Biosystems Science and Engineering, ETH Zurich, 4058 Basel, Switzerland
| | - Tien Huynh
- IBM Research, Yorktown Heights, New York 10598, United States
| | - Astrid Stroobants
- Department of Chemistry, Imperial College London, SW7 2AZ London, United Kingdom
| | - Wendy D Cornell
- IBM Research, Yorktown Heights, New York 10598, United States
| | | |
Collapse
|
9
|
Recent Advances in In Silico Target Fishing. Molecules 2021; 26:molecules26175124. [PMID: 34500568 PMCID: PMC8433825 DOI: 10.3390/molecules26175124] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2021] [Revised: 08/14/2021] [Accepted: 08/18/2021] [Indexed: 12/24/2022] Open
Abstract
In silico target fishing, whose aim is to identify possible protein targets for a query molecule, is an emerging approach used in drug discovery due its wide variety of applications. This strategy allows the clarification of mechanism of action and biological activities of compounds whose target is still unknown. Moreover, target fishing can be employed for the identification of off targets of drug candidates, thus recognizing and preventing their possible adverse effects. For these reasons, target fishing has increasingly become a key approach for polypharmacology, drug repurposing, and the identification of new drug targets. While experimental target fishing can be lengthy and difficult to implement, due to the plethora of interactions that may occur for a single small-molecule with different protein targets, an in silico approach can be quicker, less expensive, more efficient for specific protein structures, and thus easier to employ. Moreover, the possibility to use it in combination with docking and virtual screening studies, as well as the increasing number of web-based tools that have been recently developed, make target fishing a more appealing method for drug discovery. It is especially worth underlining the increasing implementation of machine learning in this field, both as a main target fishing approach and as a further development of already applied strategies. This review reports on the main in silico target fishing strategies, belonging to both ligand-based and receptor-based approaches, developed and applied in the last years, with a particular attention to the different web tools freely accessible by the scientific community for performing target fishing studies.
Collapse
|
10
|
Jiménez-Luna J, Grisoni F, Weskamp N, Schneider G. Artificial intelligence in drug discovery: recent advances and future perspectives. Expert Opin Drug Discov 2021; 16:949-959. [PMID: 33779453 DOI: 10.1080/17460441.2021.1909567] [Citation(s) in RCA: 137] [Impact Index Per Article: 34.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Introduction: Artificial intelligence (AI) has inspired computer-aided drug discovery. The widespread adoption of machine learning, in particular deep learning, in multiple scientific disciplines, and the advances in computing hardware and software, among other factors, continue to fuel this development. Much of the initial skepticism regarding applications of AI in pharmaceutical discovery has started to vanish, consequently benefitting medicinal chemistry.Areas covered: The current status of AI in chemoinformatics is reviewed. The topics discussed herein include quantitative structure-activity/property relationship and structure-based modeling, de novo molecular design, and chemical synthesis prediction. Advantages and limitations of current deep learning applications are highlighted, together with a perspective on next-generation AI for drug discovery.Expert opinion: Deep learning-based approaches have only begun to address some fundamental problems in drug discovery. Certain methodological advances, such as message-passing models, spatial-symmetry-preserving networks, hybrid de novo design, and other innovative machine learning paradigms, will likely become commonplace and help address some of the most challenging questions. Open data sharing and model development will play a central role in the advancement of drug discovery with AI.
Collapse
Affiliation(s)
- José Jiménez-Luna
- Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich, Switzerland
| | - Francesca Grisoni
- Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich, Switzerland
| | - Nils Weskamp
- Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an Der Riss, Germany
| | - Gisbert Schneider
- Department of Chemistry and Applied Biosciences, ETH Zurich, Zurich, Switzerland
| |
Collapse
|
11
|
Evaluation of multi-target deep neural network models for compound potency prediction under increasingly challenging test conditions. J Comput Aided Mol Des 2021; 35:285-295. [PMID: 33598870 PMCID: PMC7982389 DOI: 10.1007/s10822-021-00376-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Accepted: 02/03/2021] [Indexed: 11/25/2022]
Abstract
Machine learning (ML) enables modeling of quantitative structure–activity relationships (QSAR) and compound potency predictions. Recently, multi-target QSAR models have been gaining increasing attention. Simultaneous compound potency predictions for multiple targets can be carried out using ensembles of independently derived target-based QSAR models or in a more integrated and advanced manner using multi-target deep neural networks (MT-DNNs). Herein, single-target and multi-target ML models were systematically compared on a large scale in compound potency value predictions for 270 human targets. By design, this large-magnitude evaluation has been a special feature of our study. To these ends, MT-DNN, single-target DNN (ST-DNN), support vector regression (SVR), and random forest regression (RFR) models were implemented. Different test systems were defined to benchmark these ML methods under conditions of varying complexity. Source compounds were divided into training and test sets in a compound- or analog series-based manner taking target information into account. Data partitioning approaches used for model training and evaluation were shown to influence the relative performance of ML methods, especially for the most challenging compound data sets. For example, the performance of MT-DNNs with per-target models yielded superior performance compared to single-target models. For a test compound or its analogs, the availability of potency measurements for multiple targets affected model performance, revealing the influence of ML synergies.
Collapse
|
12
|
Brown J. Practical Chemogenomic Modeling and Molecule Discovery Strategies Unveiled by Active Learning. SYSTEMS MEDICINE 2021. [DOI: 10.1016/b978-0-12-801238-3.11533-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
|
13
|
Norinder U, Spjuth O, Svensson F. Using Predicted Bioactivity Profiles to Improve Predictive Modeling. J Chem Inf Model 2020; 60:2830-2837. [PMID: 32374618 DOI: 10.1021/acs.jcim.0c00250] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Predictive modeling is a cornerstone in early drug development. Using information for multiple domains or across prediction tasks has the potential to improve the performance of predictive modeling. However, aggregating data often leads to incomplete data matrices that might be limiting for modeling. In line with previous studies, we show that by generating predicted bioactivity profiles, and using these as additional features, prediction accuracy of biological endpoints can be improved. Using conformal prediction, a type of confidence predictor, we present a robust framework for the calculation of these profiles and the evaluation of their impact. We report on the outcomes from several approaches to generate the predicted profiles on 16 datasets in cytotoxicity and bioactivity and show that efficiency is improved the most when including the p-values from conformal prediction as bioactivity profiles.
Collapse
Affiliation(s)
- Ulf Norinder
- Department of Computer and Systems Sciences, Stockholm University, Box 7003, SE-164 07 Kista, Sweden.,Department of Pharmaceutical Biosciences, Uppsala University, Box 591, SE-75124 Uppsala, Sweden.,MTM Research Centre, School of Science and Technology, Örebro University, SE-70182 Örebro, Sweden
| | - Ola Spjuth
- Department of Pharmaceutical Biosciences, Uppsala University, Box 591, SE-75124 Uppsala, Sweden.,Science for Life Laboratory, Uppsala University, Box 591, SE-75124 Uppsala, Sweden
| | - Fredrik Svensson
- The Alzheimer's Research UK University College London Drug Discovery Institute, The Cruciform Building, Gower Street, WC1E 6BT London, U.K
| |
Collapse
|
14
|
Rodríguez-Pérez R, Bajorath J. Interpretation of Compound Activity Predictions from Complex Machine Learning Models Using Local Approximations and Shapley Values. J Med Chem 2019; 63:8761-8777. [PMID: 31512867 DOI: 10.1021/acs.jmedchem.9b01101] [Citation(s) in RCA: 177] [Impact Index Per Article: 29.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
In qualitative or quantitative studies of structure-activity relationships (SARs), machine learning (ML) models are trained to recognize structural patterns that differentiate between active and inactive compounds. Understanding model decisions is challenging but of critical importance to guide compound design. Moreover, the interpretation of ML results provides an additional level of model validation based on expert knowledge. A number of complex ML approaches, especially deep learning (DL) architectures, have distinctive black-box character. Herein, a locally interpretable explanatory method termed Shapley additive explanations (SHAP) is introduced for rationalizing activity predictions of any ML algorithm, regardless of its complexity. Models resulting from random forest (RF), nonlinear support vector machine (SVM), and deep neural network (DNN) learning are interpreted, and structural patterns determining the predicted probability of activity are identified and mapped onto test compounds. The results indicate that SHAP has high potential for rationalizing predictions of complex ML models.
Collapse
Affiliation(s)
- Raquel Rodríguez-Pérez
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Endenicher Allee 19c, D-53115 Bonn, Germany.,Department of Medicinal Chemistry, Boehringer Ingelheim Pharma GmbH & Co. KG, Birkendorfer Straße 65, 88397 Biberach an der Riß, Germany
| | - Jürgen Bajorath
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Endenicher Allee 19c, D-53115 Bonn, Germany
| |
Collapse
|
15
|
Liu Z, Singh SB, Zheng Y, Lindblom P, Tice C, Dong C, Zhuang L, Zhao Y, Kruk BA, Lala D, Claremon DA, McGeehan GM, Gregg RD, Cain R. Discovery of Potent Inhibitors of 11β-Hydroxysteroid Dehydrogenase Type 1 Using a Novel Growth-Based Protocol of in Silico Screening and Optimization in CONTOUR. J Chem Inf Model 2019; 59:3422-3436. [PMID: 31355641 DOI: 10.1021/acs.jcim.9b00198] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Affiliation(s)
- Zhijie Liu
- Allergan Plc, 2525 Dupont Drive, Irvine, California 92612, United States
- Vitae Pharmaceuticals, Inc., 502 West Office Center Drive, Fort Washington, Pennsylvania 19034, United States
| | - Suresh B. Singh
- Vitae Pharmaceuticals, Inc., 502 West Office Center Drive, Fort Washington, Pennsylvania 19034, United States
| | - Yajun Zheng
- Allergan Plc, 2525 Dupont Drive, Irvine, California 92612, United States
- Vitae Pharmaceuticals, Inc., 502 West Office Center Drive, Fort Washington, Pennsylvania 19034, United States
| | - Peter Lindblom
- Vitae Pharmaceuticals, Inc., 502 West Office Center Drive, Fort Washington, Pennsylvania 19034, United States
| | - Colin Tice
- Vitae Pharmaceuticals, Inc., 502 West Office Center Drive, Fort Washington, Pennsylvania 19034, United States
| | - Chengguo Dong
- Allergan Plc, 2525 Dupont Drive, Irvine, California 92612, United States
- Vitae Pharmaceuticals, Inc., 502 West Office Center Drive, Fort Washington, Pennsylvania 19034, United States
| | - Linghang Zhuang
- Vitae Pharmaceuticals, Inc., 502 West Office Center Drive, Fort Washington, Pennsylvania 19034, United States
| | - Yi Zhao
- Allergan Plc, 2525 Dupont Drive, Irvine, California 92612, United States
- Vitae Pharmaceuticals, Inc., 502 West Office Center Drive, Fort Washington, Pennsylvania 19034, United States
| | - Barbara A. Kruk
- Vitae Pharmaceuticals, Inc., 502 West Office Center Drive, Fort Washington, Pennsylvania 19034, United States
| | - Deepak Lala
- Vitae Pharmaceuticals, Inc., 502 West Office Center Drive, Fort Washington, Pennsylvania 19034, United States
| | - David A. Claremon
- Vitae Pharmaceuticals, Inc., 502 West Office Center Drive, Fort Washington, Pennsylvania 19034, United States
| | - Gerard M. McGeehan
- Vitae Pharmaceuticals, Inc., 502 West Office Center Drive, Fort Washington, Pennsylvania 19034, United States
| | - Richard D. Gregg
- Vitae Pharmaceuticals, Inc., 502 West Office Center Drive, Fort Washington, Pennsylvania 19034, United States
| | - Robert Cain
- Allergan Plc, 2525 Dupont Drive, Irvine, California 92612, United States
| |
Collapse
|
16
|
Applicability Domain of Active Learning in Chemical Probe Identification: Convergence in Learning from Non-Specific Compounds and Decision Rule Clarification. Molecules 2019; 24:molecules24152716. [PMID: 31357419 PMCID: PMC6696588 DOI: 10.3390/molecules24152716] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2019] [Revised: 07/19/2019] [Accepted: 07/24/2019] [Indexed: 12/27/2022] Open
Abstract
Efficient identification of chemical probes for the manipulation and understanding of biological systems demands specificity for target proteins. Computational means to optimize candidate compound selection for experimental selectivity evaluation are being sought. The active learning virtual screening method has demonstrated the ability to efficiently converge on predictive models with reduced datasets, though its applicability domain to probe identification has yet to be determined. In this article, we challenge active learning’s ability to predict inhibitory bioactivity profiles of selective compounds when learning from chemogenomic features found in non-selective ligand-target pairs. Comparison of controls versus multiple molecule representations de-convolutes factors contributing to predictive capability. Experiments using the matrix metalloproteinase family demonstrate maximum probe bioactivity prediction achieved from only approximately 20% of non-probe bioactivity; this data volume is consistent with prior chemogenomic active learning studies despite the increased difficulty from chemical biology experimental settings used here. Feature weight analyses are combined with a custom visualization to unambiguously detail how active learning arrives at classification decisions, yielding clarified expectations for chemogenomic modeling. The results influence tactical decisions for computational probe design and discovery.
Collapse
|
17
|
Bhhatarai B, Walters WP, Hop CECA, Lanza G, Ekins S. Opportunities and challenges using artificial intelligence in ADME/Tox. NATURE MATERIALS 2019; 18:418-422. [PMID: 31000801 PMCID: PMC6594826 DOI: 10.1038/s41563-019-0332-5] [Citation(s) in RCA: 58] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
A recent conference organized a panel of scientists representing small and big pharma companies, who work at the interface of machine learning (ML) and absorption, distribution, metabolism, excretion, and toxicology (ADME/Tox). With the recent rebirth of AI related to pharma, it is timely to present this collaborative commentary to capture the diverging opinions on the past, present and future role of AI for ADME/Tox and how it can be applied in newer areas such as nanomaterials.
Collapse
Affiliation(s)
- Barun Bhhatarai
- Novartis Institutes for Biomedical Research, Cambridge, MA, USA
| | | | | | | | - Sean Ekins
- Collaborations Pharmaceuticals Inc., Raleigh, NC, USA.
| |
Collapse
|
18
|
Zorn KM, Lane TR, Russo DP, Clark AM, Makarov V, Ekins S. Multiple Machine Learning Comparisons of HIV Cell-based and Reverse Transcriptase Data Sets. Mol Pharm 2019; 16:1620-1632. [PMID: 30779585 DOI: 10.1021/acs.molpharmaceut.8b01297] [Citation(s) in RCA: 44] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
The human immunodeficiency virus (HIV) causes over a million deaths every year and has a huge economic impact in many countries. The first class of drugs approved were nucleoside reverse transcriptase inhibitors. A newer generation of reverse transcriptase inhibitors have become susceptible to drug resistant strains of HIV, and hence, alternatives are urgently needed. We have recently pioneered the use of Bayesian machine learning to generate models with public data to identify new compounds for testing against different disease targets. The current study has used the NIAID ChemDB HIV, Opportunistic Infection and Tuberculosis Therapeutics Database for machine learning studies. We curated and cleaned data from HIV-1 wild-type cell-based and reverse transcriptase (RT) DNA polymerase inhibition assays. Compounds from this database with ≤1 μM HIV-1 RT DNA polymerase activity inhibition and cell-based HIV-1 inhibition are correlated (Pearson r = 0.44, n = 1137, p < 0.0001). Models were trained using multiple machine learning approaches (Bernoulli Naive Bayes, AdaBoost Decision Tree, Random Forest, support vector classification, k-Nearest Neighbors, and deep neural networks as well as consensus approaches) and then their predictive abilities were compared. Our comparison of different machine learning methods demonstrated that support vector classification, deep learning, and a consensus were generally comparable and not significantly different from each other using 5-fold cross validation and using 24 training and test set combinations. This study demonstrates findings in line with our previous studies for various targets that training and testing with multiple data sets does not demonstrate a significant difference between support vector machine and deep neural networks.
Collapse
Affiliation(s)
- Kimberley M Zorn
- Collaborations Pharmaceuticals, Inc. , Main Campus Drive, Lab 3510 , Raleigh , North Carolina 27606 , United States
| | - Thomas R Lane
- Collaborations Pharmaceuticals, Inc. , Main Campus Drive, Lab 3510 , Raleigh , North Carolina 27606 , United States
| | - Daniel P Russo
- Collaborations Pharmaceuticals, Inc. , Main Campus Drive, Lab 3510 , Raleigh , North Carolina 27606 , United States.,The Rutgers Center for Computational and Integrative Biology , Camden , New Jersey 08102 , United States
| | - Alex M Clark
- Molecular Materials Informatics, Inc. , 2234 Duvernay Street , Montreal , Quebec H3J2Y3 , Canada
| | - Vadim Makarov
- Bach Institute of Biochemistry , Research Center of Biotechnology of the Russian Academy of Sciences , Leninsky Prospekt 33-2 , Moscow 119071 , Russia
| | - Sean Ekins
- Collaborations Pharmaceuticals, Inc. , Main Campus Drive, Lab 3510 , Raleigh , North Carolina 27606 , United States
| |
Collapse
|