1
|
Baidya AT, Dante D, Das B, Wang L, Darreh-Shori T, Kumar R. Discovery and characterization of novel pyridone and furan substituted ligands of choline acetyltransferase. Eur J Pharmacol 2025; 998:177638. [PMID: 40252901 DOI: 10.1016/j.ejphar.2025.177638] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2025] [Revised: 04/16/2025] [Accepted: 04/16/2025] [Indexed: 04/21/2025]
Abstract
The key to the management of two devastating diseases, namely Alzheimer's Disease (AD) and Amyotrophic Lateral Sclerosis (ALS) lies in an early diagnosis, which is difficult due to its multifactorial nature. However, a common hallmark of AD and ALS is degeneration of cholinergic system. Choline acetyltransferase (ChAT) has been proposed as a potential target for development of cholinergic-specific biomarker. However, lack of selective, potent, brain permeable molecular probes of ChAT hinder development of ChAT biomarkers. In this study, we have successfully utilised structure-based virtual screening approach and identified two ChAT inhibitors from a database of 1.4 million compounds. The compounds were then subjected to rigorous in vitro characterization. Compound V6 showed Ki value of 11 μM and IC50 value of 21.73 μM, while V15 showed Ki and IC50 values of 4.5 and 9.42 μM, respectively for ChAT enzyme. V6 and V15 showed good solubility of 0.21 mg/mL and 0.17 mg/mL respectively and cytotoxicity analysis indicated no toxicity. We also performed a 200 ns molecular dynamics simulation, which revealed the intricate interaction dynamics for V6 and V15 with ChAT binding pocket. Moreover, the Tanimoto similarity analysis indicated the novelty and structural diversity of the hits. In conclusion, these validated hits provide a platform to develop potent, selective, blood-brain barrier permeable small molecules as chemical probes of ChAT or as Positron Emission Tomography tracer for early diagnosis and/or in vivo monitoring of the effect of new therapeutic candidates in spectrum of neurodegenerative disorders, in which cholinergic deficit is one of the hallmarks.
Collapse
Affiliation(s)
- Anurag Tk Baidya
- Department of Pharmaceutical Engineering & Technology, Indian Institute of Technology (B.H.U.), Varanasi, 221005, U.P., India
| | - Davide Dante
- Division of Clinical Geriatrics, Centre for Alzheimer Research, Department of Neurobiology, Care Sciences and Society, Karolinska Institutet, 141 52, Stockholm, Sweden
| | - Bhanuranjan Das
- Department of Pharmaceutical Engineering & Technology, Indian Institute of Technology (B.H.U.), Varanasi, 221005, U.P., India
| | - Lisha Wang
- Department of Neurobiology, Care Sciences and Society, Division of Neurogeriatrics, Karolinska Institutet, 17164, Solna, Sweden
| | - Taher Darreh-Shori
- Division of Clinical Geriatrics, Centre for Alzheimer Research, Department of Neurobiology, Care Sciences and Society, Karolinska Institutet, 141 52, Stockholm, Sweden
| | - Rajnish Kumar
- Department of Pharmaceutical Engineering & Technology, Indian Institute of Technology (B.H.U.), Varanasi, 221005, U.P., India.
| |
Collapse
|
2
|
Lamens A, Bajorath J. Comparing Explanations of Molecular Machine Learning Models Generated with Different Methods for the Calculation of Shapley Values. Mol Inform 2025; 44:e202500067. [PMID: 40112199 PMCID: PMC11925390 DOI: 10.1002/minf.202500067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2025] [Revised: 03/04/2025] [Accepted: 03/06/2025] [Indexed: 03/22/2025]
Abstract
Feature attribution methods from explainable artificial intelligence (XAI) provide explanations of machine learning models by quantifying feature importance for predictions of test instances. While features determining individual predictions have frequently been identified in machine learning applications, the consistency of feature importance-based explanations of machine learning models using different attribution methods has not been thoroughly investigated. We have systematically compared model explanations in molecular machine learning. Therefore, a test system of highly accurate compound activity predictions for different targets using different machine learning methods was generated. For these predictions, explanations were computed using methodological variants of the Shapley value formalism, a popular feature attribution approach in machine learning adapted from game theory. Predictions of each model were assessed using a model-agnostic and model-specific Shapley value-based method. The resulting feature importance distributions were characterized and compared by a global statistical analysis using diverse measures. Unexpectedly, methodological variants for Shapley value calculations yielded distinct feature importance distributions for highly accurate predictions. There was only little agreement between alternative model explanations. Our findings suggest that feature importance-based explanations of machine learning predictions should include an assessment of consistency using alternative methods.
Collapse
Affiliation(s)
- Alec Lamens
- Department of Life Science Informatics and Data ScienceB-IT, LIMES Program Unit Chemical Biology and Medicinal ChemistryRheinische Friedrich-Wilhelms-UniversitätFriedrich-Hirzebruch-Allee 5/6D-53115BonnGermany
| | - Jürgen Bajorath
- Department of Life Science Informatics and Data ScienceB-IT, LIMES Program Unit Chemical Biology and Medicinal ChemistryRheinische Friedrich-Wilhelms-UniversitätFriedrich-Hirzebruch-Allee 5/6D-53115BonnGermany
- Lamarr Institute for Machine Learning and Artificial IntelligenceRheinische Friedrich-Wilhelms-Universität BonnFriedrich-Hirzebruch-Allee 5/6D-53115BonnGermany
| |
Collapse
|
3
|
Xerxa E, Vogt M, Bajorath J. Influence of Data Curation and Confidence Levels on Compound Predictions Using Machine Learning Models. J Chem Inf Model 2024; 64:9341-9349. [PMID: 39656869 DOI: 10.1021/acs.jcim.4c01573] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2024]
Abstract
While data curation principles and practices are a major topic in data science, they are often not explicitly considered in machine learning (ML) applications in chemistry. We have been interested in evaluating the potential effects of data curation on the performance of molecular ML models. Therefore, a sequential curation scheme was developed for compounds and activity data, and different ML classification models were generated at increasing data confidence levels and evaluated. Sequential data curation was found to systematically increase classification performance in an incremental manner due to cumulative effects of individual data curation criteria. The analysis of chemical space distributions of compound subsets at different data confidence levels revealed that the separation of compounds with different class labels in chemical space generally increased during sequential activity data curation, which was mostly due to subsequent elimination of singletons rather than compounds from analogue series. These findings provided a rationale for increasing the classification performance of ML models as a consequence of increasingly stringent data curation. Taken together, the results reported herein suggest that further attention should be paid to varying data curation and confidence levels when deriving and assessing ML models for chemical applications.
Collapse
Affiliation(s)
- Elena Xerxa
- B-IT, Department of Life Science Informatics and Data Science, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, Bonn D-53115, Germany
- Lamarr Institute for Machine Learning and Artificial Intelligence, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, Bonn D-53115, Germany
| | - Martin Vogt
- B-IT, Department of Life Science Informatics and Data Science, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, Bonn D-53115, Germany
- Lamarr Institute for Machine Learning and Artificial Intelligence, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, Bonn D-53115, Germany
| | - Jürgen Bajorath
- B-IT, Department of Life Science Informatics and Data Science, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, Bonn D-53115, Germany
- Lamarr Institute for Machine Learning and Artificial Intelligence, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, Bonn D-53115, Germany
- LIMES Institute, Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, Bonn D-53115, Germany
| |
Collapse
|
4
|
Jia L, Brémond É, Zaida L, Gaüzère B, Tognetti V, Joubert L. Predicting redox potentials by graph-based machine learning methods. J Comput Chem 2024; 45:2383-2396. [PMID: 38923574 DOI: 10.1002/jcc.27380] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Revised: 03/25/2024] [Accepted: 04/19/2024] [Indexed: 06/28/2024]
Abstract
The evaluation of oxidation and reduction potentials is a pivotal task in various chemical fields. However, their accurate prediction by theoretical computations, which is a complementary task and sometimes the only alternative to experimental measurement, may be often resource-intensive and time-consuming. This paper addresses this challenge through the application of machine learning techniques, with a particular focus on graph-based methods (such as graph edit distances, graph kernels, and graph neural networks) that are reviewed to enlighten their deep links with theoretical chemistry. To this aim, we establish the ORedOx159 database, a comprehensive, homogeneous (with reference values stemming from density functional theory calculations), and reliable resource containing 318 one-electron reduction and oxidation reactions and featuring 159 large organic compounds. Subsequently, we provide an instructive overview of the good practice in machine learning and of commonly utilized machine learning models. We then assess their predictive performances on the ORedOx159 dataset through extensive analyses. Our simulations using descriptors that are computed in an almost instantaneous way result in a notable improvement in prediction accuracy, with mean absolute error (MAE) values equal to 5.6 kcal mol- 1 for reduction and 7.2 kcal mol- 1 for oxidation potentials, which paves a way toward efficient in silico design of new electrochemical systems.
Collapse
Affiliation(s)
- Linlin Jia
- The PRG Group, Institute of Computer Science, University of Bern, Bern, Switzerland
| | - Éric Brémond
- Université Paris Cité, ITODYS, CNRS, Paris, France
| | | | - Benoit Gaüzère
- LITIS, Univ Rouen Normandie, INSA Rouen Normandie, Université Le Havre Normandie, Normandie Univ, Rouen, France
| | - Vincent Tognetti
- Normandy Univ., COBRA UMR 6014 & FR 3038, Université de Rouen, INSA Rouen, CNRS, Mont St Aignan Cedex, France
| | - Laurent Joubert
- Normandy Univ., COBRA UMR 6014 & FR 3038, Université de Rouen, INSA Rouen, CNRS, Mont St Aignan Cedex, France
| |
Collapse
|
5
|
Yao Y, Oberhofer H. Designing building blocks of covalent organic frameworks through on-the-fly batch-based Bayesian optimization. J Chem Phys 2024; 161:074102. [PMID: 39145552 DOI: 10.1063/5.0223540] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2024] [Accepted: 07/30/2024] [Indexed: 08/16/2024] Open
Abstract
In this work, we use a Bayesian optimization (BO) algorithm to sample the space of covalent organic framework (COF) components aimed at the design of COFs with a high hole conductivity. COFs are crystalline, often porous coordination polymers, where organic molecular units-called building blocks (BBs)-are connected by covalent bonds. Even though we limit ourselves here to a space of three-fold symmetric BBs forming two-dimensional COF sheets, their design space is still much too large to be sampled by traditional means through evaluating the properties of each element in this space from first principles. In order to ensure valid BBs, we use a molecular generation algorithm that, by construction, leads to rigid three-fold symmetric molecules. The BO approach then trains two distinct surrogate models for two conductivity properties, level alignment vs a reference electrode and reorganization free energy, which are combined in a fitness function as the objective that evaluates BBs' conductivities. These continuously improving surrogates allow the prediction of a material's properties at a low computational cost. It thus allows us to select promising candidates which, together with candidates that are very different from the molecules already sampled, form the updated training sets of the surrogate models. In the course of 20 such training steps, we find a number of promising candidates, some being only variations on already known motifs and others being completely novel. Finally, we subject the six best such candidates to a computational reverse synthesis analysis to gauge their real-world synthesizability.
Collapse
Affiliation(s)
- Yuxuan Yao
- Department of Chemistry, TUM School of Natural Sciences, Technical University Munich, Lichtenbergstr. 4, 85748 Garching b. München, Germany
- Chair for Theoretical Physics VII and Bavarian Center for Battery Technology, University of Bayreuth, Universitätsstr. 30, D-95447 Bayreuth, Germany
| | - Harald Oberhofer
- Chair for Theoretical Physics VII and Bavarian Center for Battery Technology, University of Bayreuth, Universitätsstr. 30, D-95447 Bayreuth, Germany
| |
Collapse
|
6
|
Venkatraman V, Gaiser J, Demekas D, Roy A, Xiong R, Wheeler TJ. Do Molecular Fingerprints Identify Diverse Active Drugs in Large-Scale Virtual Screening? (No). Pharmaceuticals (Basel) 2024; 17:992. [PMID: 39204097 PMCID: PMC11356940 DOI: 10.3390/ph17080992] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2024] [Revised: 07/18/2024] [Accepted: 07/23/2024] [Indexed: 09/03/2024] Open
Abstract
Computational approaches for small-molecule drug discovery now regularly scale to the consideration of libraries containing billions of candidate small molecules. One promising approach to increased the speed of evaluating billion-molecule libraries is to develop succinct representations of each molecule that enable the rapid identification of molecules with similar properties. Molecular fingerprints are thought to provide a mechanism for producing such representations. Here, we explore the utility of commonly used fingerprints in the context of predicting similar molecular activity. We show that fingerprint similarity provides little discriminative power between active and inactive molecules for a target protein based on a known active-while they may sometimes provide some enrichment for active molecules in a drug screen, a screened data set will still be dominated by inactive molecules. We also demonstrate that high-similarity actives appear to share a scaffold with the query active, meaning that they could more easily be identified by structural enumeration. Furthermore, even when limited to only active molecules, fingerprint similarity values do not correlate with compound potency. In sum, these results highlight the need for a new wave of molecular representations that will improve the capacity to detect biologically active molecules based on their similarity to other such molecules.
Collapse
Affiliation(s)
- Vishwesh Venkatraman
- Department of Chemistry, Norwegian University of Science and Technology, 7034 Trondheim, Norway
| | - Jeremiah Gaiser
- School of Information, University of Arizona, Tucson, AZ 85721, USA
| | - Daphne Demekas
- R. Ken Coit College Pharmacy, University of Arizona, Tucson, AZ 85721, USA
| | - Amitava Roy
- Rocky Mountain Laboratories, Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Hamilton, MT 59840, USA;
- Department of Biomedical and Pharmaceutical Sciences, University of Montana, Missoula, MT 59812, USA
| | - Rui Xiong
- Department of Pharmacology & Toxicology, University of Arizona, Tucson, AZ 85721, USA
| | - Travis J. Wheeler
- R. Ken Coit College Pharmacy, University of Arizona, Tucson, AZ 85721, USA
| |
Collapse
|
7
|
Maeda I, Tamura S, Ogura Y, Serizawa T, Shimada T, Kunimoto R, Miyao T. Scaffold-Hopped Compound Identification by Ligand-Based Approaches with a Prospective Affinity Test. J Chem Inf Model 2024; 64:5557-5569. [PMID: 38950192 PMCID: PMC11267578 DOI: 10.1021/acs.jcim.4c00342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Revised: 06/05/2024] [Accepted: 06/18/2024] [Indexed: 07/03/2024]
Abstract
Scaffold-hopped (SH) compounds are bioactive compounds structurally different from known active compounds. Identifying SH compounds in the ligand-based approaches has been a central issue in medicinal chemistry, and various molecular representations of scaffold hopping have been proposed. However, appropriate representations for SH compound identification remain unclear. Herein, the ability of SH compound identification among several representations was fairly evaluated based on retrospective validation and prospective demonstration. In the retrospective validation, the combinations of two screening algorithms and four two- and three-dimensional molecular representations were compared using controlled data sets for the early identification of SH compounds. We found that the combination of the support vector machine and extended connectivity fingerprint with bond diameter 4 (SVM-ECFP4) and SVM and the rapid overlay of chemical structures (SVM-ROCS) showed a relatively high performance. The compounds that were highly ranked by SVM-ROCS did not share substructures with the active training compounds, while those ranked by SVM-ECFP4 were mostly recombinant. In the prospective demonstration, 93 SH compounds were prepared by screening the Namiki database using SVM-ROCS, targeting ABL1 inhibitors. The primary screening using surface plasmon resonance suggested five active compounds; however, in the competitive binding assays with adenosine triphosphate, no hits were found.
Collapse
Affiliation(s)
- Itsuki Maeda
- Graduate
School of Science and Technology, Nara Institute
of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan
| | - Shunsuke Tamura
- Graduate
School of Science and Technology, Nara Institute
of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan
| | - Yoshihiro Ogura
- Medicinal
Chemistry Research Laboratories, R&D Division, Daiichi Sankyo Co., Ltd., 1-2-58 Hiromachi, Shinagawa-ku, Tokyo 140-8710, Japan
| | - Takayuki Serizawa
- Medicinal
Chemistry Research Laboratories, R&D Division, Daiichi Sankyo Co., Ltd., 1-2-58 Hiromachi, Shinagawa-ku, Tokyo 140-8710, Japan
| | - Takashi Shimada
- Structure-Based
Drug Design Group, Organic & Biomolecular Chemistry Department, Daiichi Sankyo RD Novare Co., Ltd., 1-16-13 Kitakasai, Edogawa-ku, Tokyo 134-8630, Japan
| | - Ryo Kunimoto
- Discovery
Intelligence Research Laboratories, R&D Division, Daiichi Sankyo Co., Ltd., 1-2-58 Hiromachi, Shinagawa-ku, Tokyo 140-8710, Japan
| | - Tomoyuki Miyao
- Graduate
School of Science and Technology, Nara Institute
of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan
- Data
Science Center, Nara Institute of Science
and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan
| |
Collapse
|
8
|
Li J, Wang J, Wu L, Wang X, Luo X, Xu Y. AMHGCN: Adaptive multi-level hypergraph convolution network for human motion prediction. Neural Netw 2024; 172:106153. [PMID: 38306784 DOI: 10.1016/j.neunet.2024.106153] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Revised: 11/20/2023] [Accepted: 01/28/2024] [Indexed: 02/04/2024]
Abstract
Human motion prediction is the key technology for many real-life applications, e.g., self-driving and human-robot interaction. The recent approaches adopt the unrestricted full-connection graph representation to capture the relationships inside the human skeleton. However, there are two issues to be solved: (i) these unrestricted full-connection graph representation methods neglect the inherent dependencies across the joints of the human body; (ii) these methods represent human motions using the features extracted from a single level and thus can neither fully exploit the various connection relationships among the human body nor guarantee the human motion prediction results to be reasonable. To tackle the above issues, we propose an adaptive multi-level hypergraph convolution network (AMHGCN), which uses the adaptive multi-level hypergraph representation to capture various dependencies among the human body. Our method has four different levels of hypergraph representations, including (i) the joint-level hypergraph representation to capture inherent kinetic dependencies in the human body, (ii) the part-level hypergraph representation to exploit the kinetic characteristics at a higher level (in comparison to the joint-level) by viewing some part of the human body as an entirety, (iii) the component-level hypergraph representation to model the semantic information, and (iv) the global-level hypergraph representation to extract long-distance dependencies in the human body. In addition, to take full advantage of the knowledge carried in the training data, we propose a reverse loss (i.e., adopting the future human poses to predict the historical poses reversely) to realize data augmentation. Extensive experiments show that our proposed AMHGCN can achieve state-of-the-art performance on three benchmarks, i.e., Human3.6M, CMU-Mocap, and 3DPW.
Collapse
Affiliation(s)
- Jinkai Li
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Shenzhen 518055, China
| | - Jinghua Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Shenzhen 518055, China
| | - Lian Wu
- School of Mathematics and Big Data, GuiZhou Education University, Guiyang 550018, China
| | - Xin Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Shenzhen 518055, China
| | - Xiaoling Luo
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - Yong Xu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Shenzhen 518055, China; Peng Cheng Laboratory, Shenzhen 518055, China.
| |
Collapse
|
9
|
Boldini D, Ballabio D, Consonni V, Todeschini R, Grisoni F, Sieber SA. Effectiveness of molecular fingerprints for exploring the chemical space of natural products. J Cheminform 2024; 16:35. [PMID: 38528548 DOI: 10.1186/s13321-024-00830-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Accepted: 03/17/2024] [Indexed: 03/27/2024] Open
Abstract
Natural products are a diverse class of compounds with promising biological properties, such as high potency and excellent selectivity. However, they have different structural motifs than typical drug-like compounds, e.g., a wider range of molecular weight, multiple stereocenters and higher fraction of sp3-hybridized carbons. This makes the encoding of natural products via molecular fingerprints difficult, thus restricting their use in cheminformatics studies. To tackle this issue, we explored over 30 years of research to systematically evaluate which molecular fingerprint provides the best performance on the natural product chemical space. We considered 20 molecular fingerprints from four different sources, which we then benchmarked on over 100,000 unique natural products from the COCONUT (COlleCtion of Open Natural prodUcTs) and CMNPD (Comprehensive Marine Natural Products Database) databases. Our analysis focused on the correlation between different fingerprints and their classification performance on 12 bioactivity prediction datasets. Our results show that different encodings can provide fundamentally different views of the natural product chemical space, leading to substantial differences in pairwise similarity and performance. While Extended Connectivity Fingerprints are the de-facto option to encoding drug-like compounds, other fingerprints resulted to match or outperform them for bioactivity prediction of natural products. These results highlight the need to evaluate multiple fingerprinting algorithms for optimal performance and suggest new areas of research. Finally, we provide an open-source Python package for computing all molecular fingerprints considered in the study, as well as data and scripts necessary to reproduce the results, at https://github.com/dahvida/NP_Fingerprints .
Collapse
Affiliation(s)
- Davide Boldini
- TUM School of Natural Sciences, Department of Bioscience, Technical University of Munich, Center for Functional Protein Assemblies (CPA), 85748, Garching bei München, Germany.
| | - Davide Ballabio
- Milano Chemometrics and QSAR Research Group, Department of Earth and Environmental Sciences, University of Milano-Bicocca, P.zza Della Scienza, 1, 20126, Milan, Italy
| | - Viviana Consonni
- Milano Chemometrics and QSAR Research Group, Department of Earth and Environmental Sciences, University of Milano-Bicocca, P.zza Della Scienza, 1, 20126, Milan, Italy
| | - Roberto Todeschini
- Milano Chemometrics and QSAR Research Group, Department of Earth and Environmental Sciences, University of Milano-Bicocca, P.zza Della Scienza, 1, 20126, Milan, Italy
| | - Francesca Grisoni
- Institute for Complex Molecular Systems and Department of Biomedical Engineering, Eindhoven University of Technology, Eindhoven, Netherlands
- Centre for Living Technologies, Alliance TU/e, WUR, UU, UMC Utrecht, Utrecht, Netherlands
| | - Stephan A Sieber
- TUM School of Natural Sciences, Department of Bioscience, Technical University of Munich, Center for Functional Protein Assemblies (CPA), 85748, Garching bei München, Germany
| |
Collapse
|
10
|
Lamens A, Bajorath J. Generation of Molecular Counterfactuals for Explainable Machine Learning Based on Core-Substituent Recombination. ChemMedChem 2024; 19:e202300586. [PMID: 37983655 DOI: 10.1002/cmdc.202300586] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 11/20/2023] [Accepted: 11/20/2023] [Indexed: 11/22/2023]
Abstract
The use of black box machine learning models whose decisions cannot be understood limits the acceptance of predictions in interdisciplinary research and camouflages artificial learning characteristics leading to predictions for other than anticipated reasons. Consequently, there is increasing interest in explainable artificial intelligence to rationalize predictions and uncover potential pitfalls. Among others, relevant approaches include feature attribution methods to identify molecular structures determining predictions and counterfactuals (CFs) or contrastive explanations. CFs are defined as variants of test instances with minimal modifications leading to opposing predictions. In medicinal chemistry, CFs have thus far only been little investigated although they are particularly intuitive from a chemical perspective. We introduce a new methodology for the systematic generation of CFs that is centered on well-defined structural analogues of test compounds. The approach is transparent, computationally straightforward, and shown to provide a wealth of CFs for test sets. The method is made freely available.
Collapse
Affiliation(s)
- Alec Lamens
- Department of Life Science Informatics and Data Science B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany
| | - Jürgen Bajorath
- Department of Life Science Informatics and Data Science B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany
- Lamarr Institute for Machine Learning and Artificial Intelligence, Rheinische Friedrich-Wilhelms-Universität Bonn, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany
| |
Collapse
|
11
|
Colliandre L, Muller C. Bayesian Optimization in Drug Discovery. Methods Mol Biol 2024; 2716:101-136. [PMID: 37702937 DOI: 10.1007/978-1-0716-3449-3_5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/14/2023]
Abstract
Drug discovery deals with the search for initial hits and their optimization toward a targeted clinical profile. Throughout the discovery pipeline, the candidate profile will evolve, but the optimization will mainly stay a trial-and-error approach. Tons of in silico methods have been developed to improve and fasten this pipeline. Bayesian optimization (BO) is a well-known method for the determination of the global optimum of a function. In the last decade, BO has gained popularity in the early drug design phase. This chapter starts with the concept of black box optimization applied to drug design and presents some approaches to tackle it. Then it focuses on BO and explains its principle and all the algorithmic building blocks needed to implement it. This explanation aims to be accessible to people involved in drug discovery projects. A strong emphasis is made on the solutions to deal with the specific constraints of drug discovery. Finally, a large set of practical applications of BO is highlighted.
Collapse
|
12
|
Galati S, Di Stefano M, Bertini S, Granchi C, Giordano A, Gado F, Macchia M, Tuccinardi T, Poli G. Identification of New GSK3β Inhibitors through a Consensus Machine Learning-Based Virtual Screening. Int J Mol Sci 2023; 24:17233. [PMID: 38139062 PMCID: PMC10743990 DOI: 10.3390/ijms242417233] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2023] [Revised: 12/05/2023] [Accepted: 12/06/2023] [Indexed: 12/24/2023] Open
Abstract
Glycogen synthase kinase-3 beta (GSK3β) is a serine/threonine kinase that plays key roles in glycogen metabolism, Wnt/β-catenin signaling cascade, synaptic modulation, and multiple autophagy-related signaling pathways. GSK3β is an attractive target for drug discovery since its aberrant activity is involved in the development of neurodegenerative diseases such as Alzheimer's and Parkinson's disease. In the present study, multiple machine learning models aimed at identifying novel GSK3β inhibitors were developed and evaluated for their predictive reliability. The most powerful models were combined in a consensus approach, which was used to screen about 2 million commercial compounds. Our consensus machine learning-based virtual screening led to the identification of compounds G1 and G4, which showed inhibitory activity against GSK3β in the low-micromolar and sub-micromolar range, respectively. These results demonstrated the reliability of our virtual screening approach. Moreover, docking and molecular dynamics simulation studies were employed for predicting reliable binding modes for G1 and G4, which represent two valuable starting points for future hit-to-lead and lead optimization studies.
Collapse
Affiliation(s)
- Salvatore Galati
- Department of Pharmacy, University of Pisa, 56126 Pisa, Italy; (S.G.); (M.D.S.); (S.B.); (C.G.); (M.M.); (G.P.)
| | - Miriana Di Stefano
- Department of Pharmacy, University of Pisa, 56126 Pisa, Italy; (S.G.); (M.D.S.); (S.B.); (C.G.); (M.M.); (G.P.)
- Department of Life Sciences, University of Siena, 53100 Siena, Italy
| | - Simone Bertini
- Department of Pharmacy, University of Pisa, 56126 Pisa, Italy; (S.G.); (M.D.S.); (S.B.); (C.G.); (M.M.); (G.P.)
| | - Carlotta Granchi
- Department of Pharmacy, University of Pisa, 56126 Pisa, Italy; (S.G.); (M.D.S.); (S.B.); (C.G.); (M.M.); (G.P.)
| | - Antonio Giordano
- Sbarro Institute for Cancer Research and Molecular Medicine Center for Biotechnology, College of Science and Technology, Temple University, Philadelphia, PA 19122, USA;
- Department of Medical Biotechnologies, University of Siena, 53100 Siena, Italy
| | - Francesca Gado
- Department of Pharmaceutical Sciences, University of Milan, 20133 Milan, Italy;
| | - Marco Macchia
- Department of Pharmacy, University of Pisa, 56126 Pisa, Italy; (S.G.); (M.D.S.); (S.B.); (C.G.); (M.M.); (G.P.)
| | - Tiziano Tuccinardi
- Department of Pharmacy, University of Pisa, 56126 Pisa, Italy; (S.G.); (M.D.S.); (S.B.); (C.G.); (M.M.); (G.P.)
| | - Giulio Poli
- Department of Pharmacy, University of Pisa, 56126 Pisa, Italy; (S.G.); (M.D.S.); (S.B.); (C.G.); (M.M.); (G.P.)
| |
Collapse
|
13
|
Janela T, Bajorath J. Anatomy of Potency Predictions Focusing on Structural Analogues with Increasing Potency Differences Including Activity Cliffs. J Chem Inf Model 2023; 63:7032-7044. [PMID: 37943257 DOI: 10.1021/acs.jcim.3c01530] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2023]
Abstract
Potency predictions are popular in compound design and optimization but are complicated by intrinsic limitations. Moreover, even for nonlinear methods, activity cliffs (ACs, formed by structural analogues with large potency differences) represent challenging test cases for compound potency predictions. We have devised a new test system for potency predictions, including AC compounds, that is based on partitioned matched molecular pairs (MMP) and makes it possible to monitor prediction accuracy at the level of analogue pairs with increasing potency differences. The results of systematic predictions using different machine learning and control methods on MMP-based data sets revealed increasing prediction errors when potency differences between corresponding training and test compounds increased, including large prediction errors for AC compounds. At the global level, these prediction errors were not apparent due to the statistical dominance of analogue pairs with small potency differences. Test compounds from such pairs were accurately predicted and determined the observed global prediction accuracy. Shapley value analysis, an explainable artificial intelligence approach, was applied to identify structural features determining potency predictions using different methods. The analysis revealed that numerical predictions of different regression models were determined by features that were shared by MMP partner compounds or absent in these compounds, with opposing effects. These findings provided another rationale for accurate predictions of similar potency values for structural analogues and failures in predicting the potency of AC compounds.
Collapse
Affiliation(s)
- Tiago Janela
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, D-53115 Bonn, Germany
| | - Jürgen Bajorath
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, D-53115 Bonn, Germany
- Lamarr Institute for Machine Learning and Artificial Intelligence, Rheinische Friedrich-Wilhelms-Universität Bonn, Friedrich-Hirzebruch-Allee 5/6, D-53115 Bonn, Germany
| |
Collapse
|
14
|
Mastropietro A, Feldmann C, Bajorath J. Calculation of exact Shapley values for explaining support vector machine models using the radial basis function kernel. Sci Rep 2023; 13:19561. [PMID: 37949930 PMCID: PMC10638308 DOI: 10.1038/s41598-023-46930-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Accepted: 11/07/2023] [Indexed: 11/12/2023] Open
Abstract
Machine learning (ML) algorithms are extensively used in pharmaceutical research. Most ML models have black-box character, thus preventing the interpretation of predictions. However, rationalizing model decisions is of critical importance if predictions should aid in experimental design. Accordingly, in interdisciplinary research, there is growing interest in explaining ML models. Methods devised for this purpose are a part of the explainable artificial intelligence (XAI) spectrum of approaches. In XAI, the Shapley value concept originating from cooperative game theory has become popular for identifying features determining predictions. The Shapley value concept has been adapted as a model-agnostic approach for explaining predictions. Since the computational time required for Shapley value calculations scales exponentially with the number of features used, local approximations such as Shapley additive explanations (SHAP) are usually required in ML. The support vector machine (SVM) algorithm is one of the most popular ML methods in pharmaceutical research and beyond. SVM models are often explained using SHAP. However, there is only limited correlation between SHAP and exact Shapley values, as previously demonstrated for SVM calculations using the Tanimoto kernel, which limits SVM model explanation. Since the Tanimoto kernel is a special kernel function mostly applied for assessing chemical similarity, we have developed the Shapley value-expressed radial basis function (SVERAD), a computationally efficient approach for the calculation of exact Shapley values for SVM models based upon radial basis function kernels that are widely applied in different areas. SVERAD is shown to produce meaningful explanations of SVM predictions.
Collapse
Affiliation(s)
- Andrea Mastropietro
- Department of Computer, Control and Management Engineering "Antonio Ruberti", Sapienza University of Rome, 00185, Rome, Italy
| | - Christian Feldmann
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany
| | - Jürgen Bajorath
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany.
| |
Collapse
|
15
|
Janela T, Bajorath J. Rationalizing general limitations in assessing and comparing methods for compound potency prediction. Sci Rep 2023; 13:17816. [PMID: 37857835 PMCID: PMC10587074 DOI: 10.1038/s41598-023-45086-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Accepted: 10/16/2023] [Indexed: 10/21/2023] Open
Abstract
Compound potency predictions play a major role in computational drug discovery. Predictive methods are typically evaluated and compared in benchmark calculations that are widely applied. Previous studies have revealed intrinsic limitations of potency prediction benchmarks including very similar performance of increasingly complex machine learning methods and simple controls and narrow error margins separating machine learning from randomized predictions. However, origins of these limitations are currently unknown. We have carried out an in-depth analysis of potential reasons leading to artificial outcomes of potency predictions using different methods. Potency predictions on activity classes typically used in benchmark settings were found to be determined by compounds with intermediate potency close to median values of the compound data sets. The potency of these compounds was consistently predicted with high accuracy, without the need for learning, which dominated the results of benchmark calculations, regardless of the activity classes used. Taken together, our findings provide a clear rationale for general limitations of compound potency benchmark predictions and a basis for the design of alternative test systems for methodological comparisons.
Collapse
Affiliation(s)
- Tiago Janela
- B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Department of Life Science Informatics and Data Science, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany
| | - Jürgen Bajorath
- B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Department of Life Science Informatics and Data Science, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany.
| |
Collapse
|
16
|
Siemers FM, Bajorath J. Differences in learning characteristics between support vector machine and random forest models for compound classification revealed by Shapley value analysis. Sci Rep 2023; 13:5983. [PMID: 37045972 PMCID: PMC10097675 DOI: 10.1038/s41598-023-33215-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Accepted: 04/09/2023] [Indexed: 04/14/2023] Open
Abstract
The random forest (RF) and support vector machine (SVM) methods are mainstays in molecular machine learning (ML) and compound property prediction. We have explored in detail how binary classification models derived using these algorithms arrive at their predictions. To these ends, approaches from explainable artificial intelligence (XAI) are applicable such as the Shapley value concept originating from game theory that we adapted and further extended for our analysis. In large-scale activity-based compound classification using models derived from training sets of increasing size, RF and SVM with the Tanimoto kernel produced very similar predictions that could hardly be distinguished. However, Shapley value analysis revealed that their learning characteristics systematically differed and that chemically intuitive explanations of accurate RF and SVM predictions had different origins.
Collapse
Affiliation(s)
- Friederike Maite Siemers
- B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Department of Life Science Informatics and Data Science, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany
| | - Jürgen Bajorath
- B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Department of Life Science Informatics and Data Science, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115, Bonn, Germany.
| |
Collapse
|
17
|
Janela T, Bajorath J. Large-Scale Predictions of Compound Potency with Original and Modified Activity Classes Reveal General Prediction Characteristics and Intrinsic Limitations of Conventional Benchmarking Calculations. Pharmaceuticals (Basel) 2023; 16:ph16040530. [PMID: 37111287 PMCID: PMC10143224 DOI: 10.3390/ph16040530] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Revised: 03/27/2023] [Accepted: 03/31/2023] [Indexed: 04/05/2023] Open
Abstract
Predicting compound potency is a major task in computational medicinal chemistry, for which machine learning is often applied. This study systematically predicted compound potency values for 367 target-based compound activity classes from medicinal chemistry using a preferred machine learning approach and simple control methods. The predictions produced unexpectedly similar results for different classes and comparably high accuracy for machine learning and simple control models. Based on these findings, the influence of different data set modifications on relative prediction accuracies was explored, including potency range balancing, removal of nearest neighbors, and analog series-based compound partitioning. The predictions were surprisingly resistant to these modifications, leading to only small error margin increases. These findings also show that conventional benchmark settings are unsuitable for directly comparing potency prediction methods.
Collapse
Affiliation(s)
- Tiago Janela
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115 Bonn, Germany
| | - Jürgen Bajorath
- Department of Life Science Informatics and Data Science, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115 Bonn, Germany
| |
Collapse
|
18
|
Josephs N, Lin L, Rosenberg S, Kolaczyk ED. Bayesian classification, anomaly detection, and survival analysis using network inputs with application to the microbiome. Ann Appl Stat 2023. [DOI: 10.1214/22-aoas1623] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Affiliation(s)
| | - Lizhen Lin
- Department of Applied and Computational Mathematics and Statistics, The University of Notre Dame
| | | | | |
Collapse
|
19
|
Predicting Potent Compounds Using a Conditional Variational Autoencoder Based upon a New Structure-Potency Fingerprint. Biomolecules 2023; 13:biom13020393. [PMID: 36830761 PMCID: PMC9953226 DOI: 10.3390/biom13020393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2022] [Revised: 02/07/2023] [Accepted: 02/16/2023] [Indexed: 02/22/2023] Open
Abstract
Prediction of the potency of bioactive compounds generally relies on linear or nonlinear quantitative structure-activity relationship (QSAR) models. Nonlinear models are generated using machine learning methods. We introduce a novel approach for potency prediction that depends on a newly designed molecular fingerprint (FP) representation. This structure-potency fingerprint (SPFP) combines different modules accounting for the structural features of active compounds and their potency values in a single bit string, hence unifying structure and potency representation. This encoding enables the derivation of a conditional variational autoencoder (CVAE) using SPFPs of training compounds and apply the model to predict the SPFP potency module of test compounds using only their structure module as input. The SPFP-CVAE approach correctly predicts the potency values of compounds belonging to different activity classes with an accuracy comparable to support vector regression (SVR), representing the state-of-the-art in the field. In addition, highly potent compounds are predicted with very similar accuracy as SVR and deep neural networks.
Collapse
|
20
|
Lungu CN, Mangalagiu V, Mangalagiu II, Mehedinti MC. Benzoquinoline Chemical Space: A Helpful Approach in Antibacterial and Anticancer Drug Design. Molecules 2023; 28:molecules28031069. [PMID: 36770739 PMCID: PMC9921191 DOI: 10.3390/molecules28031069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Revised: 01/09/2023] [Accepted: 01/16/2023] [Indexed: 01/24/2023] Open
Abstract
Benzoquinolines are used in many drug design projects as starting molecules subject to derivatization. This computational study aims to characterize e benzoquinone drug space to ease future drug design processes based on these molecules. The drug space is composed of all benzoquinones, which are active on topoisomerase II and ATP synthase. Topological, chemical, and bioactivity spaces are explored using computational methodologies based on virtual screening and scaffold hopping and molecular docking, respectively. Topological space is a geometrical space in which the elements composing it can be defined as a set of neighbors (which satisfy a particular axiom). In such space, a chemical space can be defined as the property space spanned by all possible molecules and chemical compounds adhering to a given set of construction principles and boundary conditions. In this chemical space, the potentially pharmacologically active molecules form the bioactivity space. Results show a poly-morphological chemical space that suggests distinct characteristics. The chemical space is correlated with properties such as steric energy, the number of hydrogen bonds, the presence of halogen atoms, and membrane permeability-related properties. Lastly, novel chemical compounds (such as oxadiazole methybenzamide and floro methylcyclohexane diene) with drug-like potential, active on TOPO II and ATP synthase have been identified.
Collapse
Affiliation(s)
- Claudiu N. Lungu
- Department of Surgery, Emergency Country Clinical Hospital, 800010 Galati, Romania
- Faculty of Chemistry, Alexandru Ioan Cuza University of Iasi, 11 Carol 1st Bvd, 700506 Iasi, Romania
- Department of Morphological and Functional Science, University of Medicine and Pharmacy, Dunarea de Jos, 800017 Galati, Romania
- Correspondence: (C.N.L.); (I.I.M.)
| | - Violeta Mangalagiu
- Faculty of Chemistry, Alexandru Ioan Cuza University of Iasi, 11 Carol 1st Bvd, 700506 Iasi, Romania
- Faculty of Food Engineering, Stefan cel Mare University of Suceava, 13 Universitatii Str., 720229 Suceava, Romania
| | - Ionel I. Mangalagiu
- Faculty of Chemistry, Alexandru Ioan Cuza University of Iasi, 11 Carol 1st Bvd, 700506 Iasi, Romania
- Institute of Interdisciplinary Research-CERNESIM Centre, Alexandru Ioan Cuza University of Iasi, 11 Carol I, 700506 Iasi, Romania
- Correspondence: (C.N.L.); (I.I.M.)
| | - Mihaela C. Mehedinti
- Faculty of Chemistry, Alexandru Ioan Cuza University of Iasi, 11 Carol 1st Bvd, 700506 Iasi, Romania
- Department of Morphological and Functional Science, University of Medicine and Pharmacy, Dunarea de Jos, 800017 Galati, Romania
| |
Collapse
|
21
|
Tamura S, Miyao T, Bajorath J. Large-scale prediction of activity cliffs using machine and deep learning methods of increasing complexity. J Cheminform 2023; 15:4. [PMID: 36611204 PMCID: PMC9825040 DOI: 10.1186/s13321-022-00676-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Accepted: 12/23/2022] [Indexed: 01/09/2023] Open
Abstract
Activity cliffs (AC) are formed by pairs of structural analogues that are active against the same target but have a large difference in potency. While much of our knowledge about ACs has originated from the analysis and comparison of compounds and activity data, several studies have reported AC predictions over the past decade. Different from typical compound classification tasks, AC predictions must be carried out at the level of compound pairs representing ACs or nonACs. Most AC predictions reported so far have focused on individual methods or comparisons of two or three approaches and only investigated a few compound activity classes (from 2 to 10). Although promising prediction accuracy has been reported in most cases, different system set-ups, AC definitions, methods, and calculation conditions were used, precluding direct comparisons of these studies. Therefore, we have carried out a large-scale AC prediction campaign across 100 activity classes comparing machine learning methods of greatly varying complexity, ranging from pair-based nearest neighbor classifiers and decision tree or kernel methods to deep neural networks. The results of our systematic predictions revealed the level of accuracy that can be expected for AC predictions across many different compound classes. In addition, prediction accuracy did not scale with methodological complexity but was significantly influenced by memorization of compounds shared by different ACs or nonACs. In many instances, limited training data were sufficient for building accurate models using different methods and there was no detectable advantage of deep learning over simpler approaches for AC prediction. On a global scale, support vector machine models performed best, by only small margins compared to others including simple nearest neighbor classifiers.
Collapse
Affiliation(s)
- Shunsuke Tamura
- grid.10388.320000 0001 2240 3300Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115 Bonn, Germany ,grid.260493.a0000 0000 9227 2257Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192 Japan
| | - Tomoyuki Miyao
- grid.260493.a0000 0000 9227 2257Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192 Japan ,grid.260493.a0000 0000 9227 2257Data Science Center, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192 Japan
| | - Jürgen Bajorath
- grid.10388.320000 0001 2240 3300Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 5/6, 53115 Bonn, Germany
| |
Collapse
|
22
|
Detecting the modality of a medical image using visual and textual features. Biomed Signal Process Control 2023. [DOI: 10.1016/j.bspc.2022.104035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
23
|
Sundin I, Voronov A, Xiao H, Papadopoulos K, Bjerrum EJ, Heinonen M, Patronov A, Kaski S, Engkvist O. Human-in-the-loop assisted de novo molecular design. J Cheminform 2022; 14:86. [PMID: 36578043 PMCID: PMC9795720 DOI: 10.1186/s13321-022-00667-8] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Accepted: 12/03/2022] [Indexed: 12/29/2022] Open
Abstract
A de novo molecular design workflow can be used together with technologies such as reinforcement learning to navigate the chemical space. A bottleneck in the workflow that remains to be solved is how to integrate human feedback in the exploration of the chemical space to optimize molecules. A human drug designer still needs to design the goal, expressed as a scoring function for the molecules that captures the designer's implicit knowledge about the optimization task. Little support for this task exists and, consequently, a chemist usually resorts to iteratively building the objective function of multi-parameter optimization (MPO) in de novo design. We propose a principled approach to use human-in-the-loop machine learning to help the chemist to adapt the MPO scoring function to better match their goal. An advantage is that the method can learn the scoring function directly from the user's feedback while they browse the output of the molecule generator, instead of the current manual tuning of the scoring function with trial and error. The proposed method uses a probabilistic model that captures the user's idea and uncertainty about the scoring function, and it uses active learning to interact with the user. We present two case studies for this: In the first use-case, the parameters of an MPO are learned, and in the second use-case a non-parametric component of the scoring function to capture human domain knowledge is developed. The results show the effectiveness of the methods in two simulated example cases with an oracle, achieving significant improvement in less than 200 feedback queries, for the goals of a high QED score and identifying potent molecules for the DRD2 receptor, respectively. We further demonstrate the performance gains with a medicinal chemist interacting with the system.
Collapse
Affiliation(s)
- Iiris Sundin
- grid.5373.20000000108389418Department of Computer Science, Aalto University, Espoo, Finland
| | - Alexey Voronov
- grid.418151.80000 0001 1519 6403Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden
| | - Haoping Xiao
- grid.5373.20000000108389418Department of Computer Science, Aalto University, Espoo, Finland
| | - Kostas Papadopoulos
- grid.418151.80000 0001 1519 6403Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden ,Present Address: Odyssey Therapeutics, Cambridge, MA USA
| | - Esben Jannik Bjerrum
- grid.418151.80000 0001 1519 6403Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden ,Present Address: Odyssey Therapeutics, Cambridge, MA USA
| | - Markus Heinonen
- grid.5373.20000000108389418Department of Computer Science, Aalto University, Espoo, Finland
| | - Atanas Patronov
- grid.418151.80000 0001 1519 6403Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden ,Present Address: Odyssey Therapeutics, Cambridge, MA USA
| | - Samuel Kaski
- grid.5373.20000000108389418Department of Computer Science, Aalto University, Espoo, Finland ,grid.5379.80000000121662407Department of Computer Science, University of Manchester, Manchester, UK
| | - Ola Engkvist
- grid.418151.80000 0001 1519 6403Molecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden ,grid.5371.00000 0001 0775 6028Department of Computer Science and Engineering, Chalmers University of Technology, Gothenburg, Sweden
| |
Collapse
|
24
|
Joint structural annotation of small molecules using liquid chromatography retention order and tandem mass spectrometry data. NAT MACH INTELL 2022. [DOI: 10.1038/s42256-022-00577-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
AbstractStructural annotation of small molecules in biological samples remains a key bottleneck in untargeted metabolomics, despite rapid progress in predictive methods and tools during the past decade. Liquid chromatography–tandem mass spectrometry, one of the most widely used analysis platforms, can detect thousands of molecules in a sample, the vast majority of which remain unidentified even with best-of-class methods. Here we present LC-MS2Struct, a machine learning framework for structural annotation of small-molecule data arising from liquid chromatography–tandem mass spectrometry (LC-MS2) measurements. LC-MS2Struct jointly predicts the annotations for a set of mass spectrometry features in a sample, using a novel structured prediction model trained to optimally combine the output of state-of-the-art MS2 scorers and observed retention orders. We evaluate our method on a dataset covering all publicly available reversed-phase LC-MS2 data in the MassBank reference database, including 4,327 molecules measured using 18 different LC conditions from 16 contributors, greatly expanding the chemical analytical space covered in previous multi-MS2 scorer evaluations. LC-MS2Struct obtains significantly higher annotation accuracy than earlier methods and improves the annotation accuracy of state-of-the-art MS2 scorers by up to 106%. The use of stereochemistry-aware molecular fingerprints improves prediction performance, which highlights limitations in existing approaches and has strong implications for future computational LC-MS2 developments.
Collapse
|
25
|
Simple nearest-neighbour analysis meets the accuracy of compound potency predictions using complex machine learning models. NAT MACH INTELL 2022. [DOI: 10.1038/s42256-022-00581-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
|
26
|
Griffiths RR, Greenfield JL, Thawani AR, Jamasb AR, Moss HB, Bourached A, Jones P, McCorkindale W, Aldrick AA, Fuchter MJ, Lee AA. Data-driven discovery of molecular photoswitches with multioutput Gaussian processes. Chem Sci 2022; 13:13541-13551. [PMID: 36507171 PMCID: PMC9682911 DOI: 10.1039/d2sc04306h] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2022] [Accepted: 09/16/2022] [Indexed: 11/11/2022] Open
Abstract
Photoswitchable molecules display two or more isomeric forms that may be accessed using light. Separating the electronic absorption bands of these isomers is key to selectively addressing a specific isomer and achieving high photostationary states whilst overall red-shifting the absorption bands serves to limit material damage due to UV-exposure and increases penetration depth in photopharmacological applications. Engineering these properties into a system through synthetic design however, remains a challenge. Here, we present a data-driven discovery pipeline for molecular photoswitches underpinned by dataset curation and multitask learning with Gaussian processes. In the prediction of electronic transition wavelengths, we demonstrate that a multioutput Gaussian process (MOGP) trained using labels from four photoswitch transition wavelengths yields the strongest predictive performance relative to single-task models as well as operationally outperforming time-dependent density functional theory (TD-DFT) in terms of the wall-clock time for prediction. We validate our proposed approach experimentally by screening a library of commercially available photoswitchable molecules. Through this screen, we identified several motifs that displayed separated electronic absorption bands of their isomers, exhibited red-shifted absorptions, and are suited for information transfer and photopharmacological applications. Our curated dataset, code, as well as all models are made available at https://github.com/Ryan-Rhys/The-Photoswitch-Dataset.
Collapse
Affiliation(s)
- Ryan-Rhys Griffiths
- The Cavendish Laboratory, Department of Physics, University of Cambridge Cambridge CB3 0HE UK
| | - Jake L Greenfield
- Molecular Sciences Research Hub, Department of Chemistry, Imperial College London London W12 0BZ UK
- Center for Nanosystems Chemistry (CNC), Institut für Organische Chemie, Universität Würzburg Würzburg 97074 Germany
| | - Aditya R Thawani
- Molecular Sciences Research Hub, Department of Chemistry, Imperial College London London W12 0BZ UK
| | - Arian R Jamasb
- The Computer Laboratory, University of Cambridge Cambridge CB3 0FD UK
| | | | - Anthony Bourached
- The Institute of Neurology, Department of Neurology, University College London London WC1N 3BG UK
| | - Penelope Jones
- The Cavendish Laboratory, Department of Physics, University of Cambridge Cambridge CB3 0HE UK
| | - William McCorkindale
- The Cavendish Laboratory, Department of Physics, University of Cambridge Cambridge CB3 0HE UK
| | - Alexander A Aldrick
- The Cavendish Laboratory, Department of Physics, University of Cambridge Cambridge CB3 0HE UK
| | - Matthew J Fuchter
- Molecular Sciences Research Hub, Department of Chemistry, Imperial College London London W12 0BZ UK
| | - Alpha A Lee
- The Cavendish Laboratory, Department of Physics, University of Cambridge Cambridge CB3 0HE UK
| |
Collapse
|
27
|
Machine Learning-Based Virtual Screening for the Identification of Cdk5 Inhibitors. Int J Mol Sci 2022; 23:ijms231810653. [PMID: 36142566 PMCID: PMC9502400 DOI: 10.3390/ijms231810653] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2022] [Revised: 09/07/2022] [Accepted: 09/09/2022] [Indexed: 12/04/2022] Open
Abstract
Cyclin-dependent kinase 5 (Cdk5) is an atypical proline-directed serine/threonine protein kinase well-characterized for its role in the central nervous system rather than in the cell cycle. Indeed, its dysregulation has been strongly implicated in the progression of synaptic dysfunction and neurodegenerative diseases, such as Alzheimer’s disease (AD) and Parkinson’s disease (PD), and also in the development and progression of a variety of cancers. For this reason, Cdk5 is considered as a promising target for drug design, and the discovery of novel small-molecule Cdk5 inhibitors is of great interest in the medicinal chemistry field. In this context, we employed a machine learning-based virtual screening protocol with subsequent molecular docking, molecular dynamics simulations and binding free energy evaluations. Our virtual screening studies resulted in the identification of two novel Cdk5 inhibitors, highlighting an experimental hit rate of 50% and thus validating the reliability of the in silico workflow. Both identified ligands, compounds CPD1 and CPD4, showed a promising enzyme inhibitory activity and CPD1 also demonstrated a remarkable antiproliferative activity in ovarian and colon cancer cells. These ligands represent a valuable starting point for structure-based hit-optimization studies aimed at identifying new potent Cdk5 inhibitors.
Collapse
|
28
|
Baranwal M, Magner A, Saldinger J, Turali-Emre ES, Elvati P, Kozarekar S, VanEpps JS, Kotov NA, Violi A, Hero AO. Struct2Graph: a graph attention network for structure based predictions of protein-protein interactions. BMC Bioinformatics 2022; 23:370. [PMID: 36088285 PMCID: PMC9464414 DOI: 10.1186/s12859-022-04910-9] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Accepted: 08/26/2022] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND Development of new methods for analysis of protein-protein interactions (PPIs) at molecular and nanometer scales gives insights into intracellular signaling pathways and will improve understanding of protein functions, as well as other nanoscale structures of biological and abiological origins. Recent advances in computational tools, particularly the ones involving modern deep learning algorithms, have been shown to complement experimental approaches for describing and rationalizing PPIs. However, most of the existing works on PPI predictions use protein-sequence information, and thus have difficulties in accounting for the three-dimensional organization of the protein chains. RESULTS In this study, we address this problem and describe a PPI analysis based on a graph attention network, named Struct2Graph, for identifying PPIs directly from the structural data of folded protein globules. Our method is capable of predicting the PPI with an accuracy of 98.89% on the balanced set consisting of an equal number of positive and negative pairs. On the unbalanced set with the ratio of 1:10 between positive and negative pairs, Struct2Graph achieves a fivefold cross validation average accuracy of 99.42%. Moreover, Struct2Graph can potentially identify residues that likely contribute to the formation of the protein-protein complex. The identification of important residues is tested for two different interaction types: (a) Proteins with multiple ligands competing for the same binding area, (b) Dynamic protein-protein adhesion interaction. Struct2Graph identifies interacting residues with 30% sensitivity, 89% specificity, and 87% accuracy. CONCLUSIONS In this manuscript, we address the problem of prediction of PPIs using a first of its kind, 3D-structure-based graph attention network (code available at https://github.com/baranwa2/Struct2Graph ). Furthermore, the novel mutual attention mechanism provides insights into likely interaction sites through its unsupervised knowledge selection process. This study demonstrates that a relatively low-dimensional feature embedding learned from graph structures of individual proteins outperforms other modern machine learning classifiers based on global protein features. In addition, through the analysis of single amino acid variations, the attention mechanism shows preference for disease-causing residue variations over benign polymorphisms, demonstrating that it is not limited to interface residues.
Collapse
Affiliation(s)
- Mayank Baranwal
- Division of Data and Decision Sciences, Tata Consultancy Services Research, Mumbai, India
- Systems and Control Engineering Group, Indian Institute of Technology, Bombay, India
| | - Abram Magner
- Department of Computer Science, University of Albany, SUNY, Albany, USA
| | - Jacob Saldinger
- Department of Chemical Engineering, University of Michigan, Ann Arbor, USA
| | | | - Paolo Elvati
- Department of Mechanical Engineering, University of Michigan, Ann Arbor, USA
| | - Shivani Kozarekar
- Department of Chemical Engineering, University of Michigan, Ann Arbor, USA
| | - J. Scott VanEpps
- Department of Biomedical Engineering, University of Michigan, Ann Arbor, USA
- Department of Emergency Medicine, University of Michigan, Ann Arbor, USA
- Biointerfaces Institute, University of Michigan, Ann Arbor, USA
| | - Nicholas A. Kotov
- Department of Chemical Engineering, University of Michigan, Ann Arbor, USA
- Department of Biomedical Engineering, University of Michigan, Ann Arbor, USA
- Biointerfaces Institute, University of Michigan, Ann Arbor, USA
- Department of Materials Science and Engineering, University of Michigan, Ann Arbor, USA
| | - Angela Violi
- Department of Chemical Engineering, University of Michigan, Ann Arbor, USA
- Department of Mechanical Engineering, University of Michigan, Ann Arbor, USA
- Biophysics Program, University of Michigan, Ann Arbor, USA
| | - Alfred O. Hero
- Department of Biomedical Engineering, University of Michigan, Ann Arbor, USA
- Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, USA
- Department of Statistics, University of Michigan, Ann Arbor, USA
- Program in Applied Interdisciplinary Mathematics, University of Michigan, Ann Arbor, USA
- Program in Bioinformatics, University of Michigan, Ann Arbor, USA
| |
Collapse
|
29
|
Wang Z, Cao Q, Shen H, Xu B, Cen K, Cheng X. Location-aware convolutional neural networks for graph classification. Neural Netw 2022; 155:74-83. [PMID: 36041282 DOI: 10.1016/j.neunet.2022.07.035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Revised: 06/06/2022] [Accepted: 07/30/2022] [Indexed: 11/25/2022]
Abstract
Graph patterns play a critical role in various graph classification tasks, e.g., chemical patterns often determine the properties of molecular graphs. Researchers devote themselves to adapting Convolutional Neural Networks (CNNs) to graph classification due to their powerful capability in pattern learning. The varying numbers of neighbor nodes and the lack of canonical order of nodes on graphs pose challenges in constructing receptive fields for CNNs. Existing methods generally follow a heuristic ranking-based framework, which constructs receptive fields by selecting a fixed number of nodes and dropping the others according to predetermined rules. However, such methods may lose important structure information through dropping nodes, and they also cannot learn task-oriented graph patterns. In this paper, we propose a Location learning-based Convolutional Neural Networks (LCNN) for graph classification. LCNN constructs receptive fields by learning the location of each node according to its embedding that contains structures and features information, then standard CNNs are applied to capture graph patterns. Such a location learning mechanism not only retains the information of all nodes, but also provides the ability for task-oriented pattern learning. Experimental results show the effectiveness of the proposed LCNN, and visualization results further illustrate the valid pattern learning ability of our method for graph classification.
Collapse
Affiliation(s)
- Zhaohui Wang
- Data Intelligence System Research Center, Institute of Computing Technology, Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China.
| | - Qi Cao
- Data Intelligence System Research Center, Institute of Computing Technology, Chinese Academy of Sciences, China.
| | - Huawei Shen
- Data Intelligence System Research Center, Institute of Computing Technology, Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China; Beijing Academy of Artificial Intelligence, China.
| | - Bingbing Xu
- Data Intelligence System Research Center, Institute of Computing Technology, Chinese Academy of Sciences, China.
| | - Keting Cen
- Data Intelligence System Research Center, Institute of Computing Technology, Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China.
| | - Xueqi Cheng
- CAS Key Laboratory of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, China; University of Chinese Academy of Sciences, China.
| |
Collapse
|
30
|
García-Ortegón M, Simm GNC, Tripp AJ, Hernández-Lobato JM, Bender A, Bacallado S. DOCKSTRING: Easy Molecular Docking Yields Better Benchmarks for Ligand Design. J Chem Inf Model 2022; 62:3486-3502. [PMID: 35849793 PMCID: PMC9364321 DOI: 10.1021/acs.jcim.1c01334] [Citation(s) in RCA: 43] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2021] [Indexed: 01/05/2023]
Abstract
The field of machine learning for drug discovery is witnessing an explosion of novel methods. These methods are often benchmarked on simple physicochemical properties such as solubility or general druglikeness, which can be readily computed. However, these properties are poor representatives of objective functions in drug design, mainly because they do not depend on the candidate compound's interaction with the target. By contrast, molecular docking is a widely applied method in drug discovery to estimate binding affinities. However, docking studies require a significant amount of domain knowledge to set up correctly, which hampers adoption. Here, we present dockstring, a bundle for meaningful and robust comparison of ML models using docking scores. dockstring consists of three components: (1) an open-source Python package for straightforward computation of docking scores, (2) an extensive dataset of docking scores and poses of more than 260,000 molecules for 58 medically relevant targets, and (3) a set of pharmaceutically relevant benchmark tasks such as virtual screening or de novo design of selective kinase inhibitors. The Python package implements a robust ligand and target preparation protocol that allows nonexperts to obtain meaningful docking scores. Our dataset is the first to include docking poses, as well as the first of its size that is a full matrix, thus facilitating experiments in multiobjective optimization and transfer learning. Overall, our results indicate that docking scores are a more realistic evaluation objective than simple physicochemical properties, yielding benchmark tasks that are more challenging and more closely related to real problems in drug discovery.
Collapse
Affiliation(s)
- Miguel García-Ortegón
- Statistical
Laboratory, Centre for Mathematical Sciences, University of Cambridge, Wilberforce Rd., Cambridge CB3 0WB, United Kingdom
| | - Gregor N. C. Simm
- Department
of Engineering, University of Cambridge, Trumpington St., Cambridge CB2 1PZ, United Kingdom
| | - Austin J. Tripp
- Department
of Engineering, University of Cambridge, Trumpington St., Cambridge CB2 1PZ, United Kingdom
| | | | - Andreas Bender
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Lensfield
Rd., Cambridge CB2 1EW, United Kingdom
| | - Sergio Bacallado
- Statistical
Laboratory, Centre for Mathematical Sciences, University of Cambridge, Wilberforce Rd., Cambridge CB3 0WB, United Kingdom
| |
Collapse
|
31
|
Asahara R, Miyao T. Extended Connectivity Fingerprints as a Chemical Reaction Representation for Enantioselective Organophosphorus-Catalyzed Asymmetric Reaction Prediction. ACS OMEGA 2022; 7:26952-26964. [PMID: 35936487 PMCID: PMC9352214 DOI: 10.1021/acsomega.2c03812] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/18/2022] [Accepted: 07/07/2022] [Indexed: 06/15/2023]
Abstract
Predicting the outcomes of organic reactions using data-driven approaches aids in the acceleration of research. In laboratory-scale experiments, only a small number of reaction data can be accessed for machine learning model construction, where reaction representations play a pivotal role in the success of model construction. Nevertheless, representation comparison for a small data set is not adequate. Herein, focusing on the enantioselectivity of phosphoric-acid-catalyzed reactions, various two-dimensional and three-dimensional reaction representations (descriptors) were compared. Overall, the concatenated form of the extended connectivity fingerprints showed the best predictive capability for the two types of data sets: high-throughput experimental data and manually collected literature data sets. Furthermore, highlighting the substructure contribution to the prediction outcome was shown to be informative for guiding catalyst development.
Collapse
Affiliation(s)
- Ryosuke Asahara
- Graduate
School of Science and Technology, Nara Institute
of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan
| | - Tomoyuki Miyao
- Graduate
School of Science and Technology, Nara Institute
of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan
- Data
Science Center, Nara Institute of Science
and Technology, 8916-5
Takayama-cho, Ikoma, Nara 630-0192, Japan
| |
Collapse
|
32
|
Feldmann C, Bajorath J. Calculation of Exact Shapley Values for Support Vector Machines with Tanimoto Kernel Enables Model Interpretation. iScience 2022; 25:105023. [PMID: 36105596 PMCID: PMC9464958 DOI: 10.1016/j.isci.2022.105023] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2022] [Revised: 08/09/2022] [Accepted: 08/20/2022] [Indexed: 11/24/2022] Open
Abstract
The support vector machine (SVM) algorithm is popular in chemistry and drug discovery. SVM models have black box character. Their predictions can be interpreted through feature weighting or the model-agnostic Shapley additive explanations (SHAP) formalism that locally approximates Shapley values (SVs) originating from game theory. We introduce an algorithm termed SV-expressed Tanimoto similarity (SVETA) for the exact calculation of SVs to explain SVM models employing the Tanimoto kernel, the gold standard for the assessment of molecular similarity. For a model system, the exact calculation of SVs is demonstrated. In an SVM-based compound classification task from drug discovery, only a limited correlation between exact SV and SHAP values is observed, prohibiting the use of approximate values for rationalizing predictions. For exemplary test compounds, atom-based mapping of prioritized features delineates coherent substructures that closely resemble those obtained by analyzing independently derived random forest models, thus providing consistent explanations. SVETA: new methodology for explaining support vector machine (SVM) predictions Tanimoto similarity-based SVM models are popular in chemistry SVETA enables the calculation of exact Shapley values for rationalizing SVM models SVETA-based feature mapping provides intuitive explanations of SVM decisions
Collapse
|
33
|
Yang P, Henle EA, Fern XZ, Simon CM. Classifying the toxicity of pesticides to honey bees via support vector machines with random walk graph kernels. J Chem Phys 2022; 157:034102. [DOI: 10.1063/5.0090573] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
Pesticides benefit agriculture by increasing crop yield, quality, and security. However, pesticides may inadvertently harm bees, which are valuable as pollinators. Thus, candidate pesticides in development pipelines must be assessed for toxicity to bees. Leveraging a dataset of 382 molecules with toxicity labels from honey bee exposure experiments, we train a support vector machine (SVM) to predict the toxicity of pesticides to honey bees. We compare two representations of the pesticide molecules: (i) a random walk feature vector listing counts of length- L walks on the molecular graph with each vertex- and edge-label sequence and (ii) the Molecular ACCess System (MACCS) structural key fingerprint (FP), a bit vector indicating the presence/absence of a list of pre-defined subgraph patterns in the molecular graph. We explicitly construct the MACCS FPs but rely on the fixed-length- L random walk graph kernel (RWGK) in place of the dot product for the random walk representation. The L-RWGK-SVM achieves an accuracy, precision, recall, and F1 score (mean over 2000 runs) of 0.81, 0.68, 0.71, and 0.69, respectively, on the test data set—with L = 4 being the mode optimal walk length. The MACCS-FP-SVM performs on par/marginally better than the L-RWGK-SVM, lends more interpretability, but varies more in performance. We interpret the MACCS-FP-SVM by illuminating which subgraph patterns in the molecules tend to strongly push them toward the toxic/non-toxic side of the separating hyperplane.
Collapse
Affiliation(s)
- Ping Yang
- School of Chemical, Biological, and Environmental Engineering, Oregon State University, Corvallis, Oregon 97331, USA
| | - E. Adrian Henle
- School of Chemical, Biological, and Environmental Engineering, Oregon State University, Corvallis, Oregon 97331, USA
| | - Xiaoli Z. Fern
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon 97331, USA
| | - Cory M. Simon
- School of Chemical, Biological, and Environmental Engineering, Oregon State University, Corvallis, Oregon 97331, USA
| |
Collapse
|
34
|
Multi-task convolutional neural networks for predicting in vitro clearance endpoints from molecular images. J Comput Aided Mol Des 2022; 36:443-457. [PMID: 35618861 DOI: 10.1007/s10822-022-00458-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Accepted: 05/04/2022] [Indexed: 10/18/2022]
Abstract
Optimization of compound metabolic stability is a highly topical issue in pharmaceutical research. Accordingly, application of predictive in silico models can potentially reduce the number of design-make-test-analyze iterations and consequently speed up the progression of novel candidate molecules. Herein, we have investigated the question if multiple in vitro clearance endpoints could be accurately predicted from image-based molecular representations. Thus, compound measurements for four commonly investigated clearance endpoints were curated from AstraZeneca internal sources, providing a sound basis for building multi-task convolutional neural network models. Application of several increasingly challenging data splitting strategies confirmed that convolutional neural network models were successful at capturing implicit chemical relationships contained in training and test data, similar to what is commonly observed for structural fingerprints. Furthermore, model benchmarking against state-of-the-art machine learning methods, including deep neural networks and graph convolutional neural networks, trained with structure- and graph-based representations, respectively, revealed on par or increased accuracy of convolutional neural networks with clear benefit of multi-task learning across all clearance endpoints. Our findings indicate that image-based molecular representations can be applied to predict multiple clearance endpoints, suggesting a potential follow-up to investigate model interpretability from molecular images.
Collapse
|
35
|
Janela T, Takeuchi K, Bajorath J. Introducing a Chemically Intuitive Core-Substituent Fingerprint Designed to Explore Structural Requirements for Effective Similarity Searching and Machine Learning. MOLECULES (BASEL, SWITZERLAND) 2022; 27:molecules27072331. [PMID: 35408730 PMCID: PMC9000322 DOI: 10.3390/molecules27072331] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 03/29/2022] [Accepted: 04/01/2022] [Indexed: 11/16/2022]
Abstract
Fingerprint (FP) representations of chemical structure continue to be one of the most widely used types of molecular descriptors in chemoinformatics and computational medicinal chemistry. One often distinguishes between two- and three-dimensional (2D and 3D) FPs depending on whether they are derived from molecular graphs or conformations, respectively. Primary application areas for FPs include similarity searching and compound classification via machine learning, especially for hit identification. For these applications, 2D FPs are particularly popular, given their robustness and for the most part comparable (or better) performance to 3D FPs. While a variety of FP prototypes has been designed and evaluated during earlier times of chemoinformatics research, new developments have been rare over the past decade. At least in part, this has been due to the situation that topological (atom environment) FPs derived from molecular graphs have evolved as a gold standard in the field. We were interested in exploring the question of whether the amount of structural information captured by state-of-the-art 2D FPs is indeed required for effective similarity searching and compound classification or whether accounting for fewer structural features might be sufficient. Therefore, pursuing a "structural minimalist" approach, we designed and implemented a new 2D FP based upon ring and substituent fragments obtained by systematically decomposing large numbers of compounds from medicinal chemistry. The resulting FP termed core-substituent FP (CSFP) captures much smaller numbers of structural features than state-of-the-art 2D FPs. However, CSFP achieves high performance in similarity searching and machine learning, demonstrating that less structural information is required for establishing molecular similarity relationships than is often believed. Given its high performance and chemical tangibility, CSFP is also relevant for practical applications in medicinal chemistry.
Collapse
|
36
|
Ligand-based approaches to activity prediction for the early stage of structure–activity–relationship progression. J Comput Aided Mol Des 2022; 36:237-252. [DOI: 10.1007/s10822-022-00449-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2021] [Accepted: 03/07/2022] [Indexed: 11/27/2022]
|
37
|
Rodríguez-Pérez R, Bajorath J. Evolution of Support Vector Machine and Regression Modeling in Chemoinformatics and Drug Discovery. J Comput Aided Mol Des 2022; 36:355-362. [PMID: 35304657 PMCID: PMC9325859 DOI: 10.1007/s10822-022-00442-9] [Citation(s) in RCA: 43] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Accepted: 02/15/2022] [Indexed: 11/05/2022]
Abstract
The support vector machine (SVM) algorithm is one of the most widely used machine learning (ML) methods for predicting active compounds and molecular properties. In chemoinformatics and drug discovery, SVM has been a state-of-the-art ML approach for more than a decade. A unique attribute of SVM is that it operates in feature spaces of increasing dimensionality. Hence, SVM conceptually departs from the paradigm of low dimensionality that applies to many other methods for chemical space navigation. The SVM approach is applicable to compound classification, and ranking, multi-class predictions, and –in algorithmically modified form– regression modeling. In the emerging era of deep learning (DL), SVM retains its relevance as one of the premier ML methods in chemoinformatics, for reasons discussed herein. We describe the SVM methodology including strengths and weaknesses and discuss selected applications that have contributed to the evolution of SVM as a premier approach for compound classification, property predictions, and virtual compound screening.
Collapse
Affiliation(s)
- Raquel Rodríguez-Pérez
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115, Bonn, Germany.,Novartis Institutes for Biomedical Research, Novartis Campus, CH-4002, Basel, Switzerland
| | - Jürgen Bajorath
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115, Bonn, Germany. .,Novartis Institutes for Biomedical Research, Novartis Campus, CH-4002, Basel, Switzerland.
| |
Collapse
|
38
|
Capecchi A, Reymond JL. Classifying natural products from plants, fungi or bacteria using the COCONUT database and machine learning. J Cheminform 2021; 13:82. [PMID: 34663470 PMCID: PMC8524952 DOI: 10.1186/s13321-021-00559-3] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2021] [Accepted: 10/02/2021] [Indexed: 01/13/2023] Open
Abstract
Natural products (NPs) represent one of the most important resources for discovering new drugs. Here we asked whether NP origin can be assigned from their molecular structure in a subset of 60,171 NPs in the recently reported Collection of Open Natural Products (COCONUT) database assigned to plants, fungi, or bacteria. Visualizing this subset in an interactive tree-map (TMAP) calculated using MAP4 (MinHashed atom pair fingerprint) clustered NPs according to their assigned origin ( https://tm.gdb.tools/map4/coconut_tmap/ ), and a support vector machine (SVM) trained with MAP4 correctly assigned the origin for 94% of plant, 89% of fungal, and 89% of bacterial NPs in this subset. An online tool based on an SVM trained with the entire subset correctly assigned the origin of further NPs with similar performance ( https://np-svm-map4.gdb.tools/ ). Origin information might be useful when searching for biosynthetic genes of NPs isolated from plants but produced by endophytic microorganisms.
Collapse
Affiliation(s)
- Alice Capecchi
- 1 Department of Chemistry, Biochemistry and Pharmaceutical Sciences, University of Bern, Freiestrasse 3, 3012, Bern, Switzerland
| | - Jean-Louis Reymond
- 1 Department of Chemistry, Biochemistry and Pharmaceutical Sciences, University of Bern, Freiestrasse 3, 3012, Bern, Switzerland.
| |
Collapse
|
39
|
Tamura S, Jasial S, Miyao T, Funatsu K. Interpretation of Ligand-Based Activity Cliff Prediction Models Using the Matched Molecular Pair Kernel. Molecules 2021; 26:molecules26164916. [PMID: 34443503 PMCID: PMC8401777 DOI: 10.3390/molecules26164916] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Revised: 08/09/2021] [Accepted: 08/10/2021] [Indexed: 11/16/2022] Open
Abstract
Activity cliffs (ACs) are formed by two structurally similar compounds with a large difference in potency. Accurate AC prediction is expected to help researchers' decisions in the early stages of drug discovery. Previously, predictive models based on matched molecular pair (MMP) cliffs have been proposed. However, the proposed methods face a challenge of interpretability due to the black-box character of the predictive models. In this study, we developed interpretable MMP fingerprints and modified a model-specific interpretation approach for models based on a support vector machine (SVM) and MMP kernel. We compared important features highlighted by this SVM-based interpretation approach and the SHapley Additive exPlanations (SHAP) as a major model-independent approach. The model-specific approach could capture the difference between AC and non-AC, while SHAP assigned high weights to the features not present in the test instances. For specific MMPs, the feature weights mapped by the SVM-based interpretation method were in agreement with the previously confirmed binding knowledge from X-ray co-crystal structures, indicating that this method is able to interpret the AC prediction model in a chemically intuitive manner.
Collapse
Affiliation(s)
- Shunsuke Tamura
- Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma 630-0192, Japan; (S.T.); (S.J.); (T.M.)
| | - Swarit Jasial
- Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma 630-0192, Japan; (S.T.); (S.J.); (T.M.)
- Data Science Center, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma 630-0192, Japan
| | - Tomoyuki Miyao
- Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma 630-0192, Japan; (S.T.); (S.J.); (T.M.)
- Data Science Center, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma 630-0192, Japan
| | - Kimito Funatsu
- Data Science Center, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma 630-0192, Japan
- Correspondence: ; Tel.: +81-354-400-396; Fax: +81-743-726-037
| |
Collapse
|
40
|
Bach E, Rogers S, Williamson J, Rousu J. Probabilistic framework for integration of mass spectrum and retention time information in small molecule identification. Bioinformatics 2021; 37:1724-1731. [PMID: 33244585 PMCID: PMC8289373 DOI: 10.1093/bioinformatics/btaa998] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Revised: 10/27/2020] [Accepted: 11/17/2020] [Indexed: 11/14/2022] Open
Abstract
Motivation Identification of small molecules in a biological sample remains a major bottleneck in molecular biology, despite a decade of rapid development of computational approaches for predicting molecular structures using mass spectrometry (MS) data. Recently, there has been increasing interest in utilizing other information sources, such as liquid chromatography (LC) retention time (RT), to improve identifications solely based on MS information, such as precursor mass-per-charge and tandem mass spectrometry (MS2). Results We put forward a probabilistic modelling framework to integrate MS and RT data of multiple features in an LC-MS experiment. We model the MS measurements and all pairwise retention order information as a Markov random field and use efficient approximate inference for scoring and ranking potential molecular structures. Our experiments show improved identification accuracy by combining MS2 data and retention orders using our approach, thereby outperforming state-of-the-art methods. Furthermore, we demonstrate the benefit of our model when only a subset of LC-MS features has MS2 measurements available besides MS1. Availability and implementation Software and data are freely available at https://github.com/aalto-ics-kepaco/msms_rt_score_integration. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Eric Bach
- Department of Computer Science, School of Science, Aalto University, Espoo, Finland
| | - Simon Rogers
- School of Computing Science, University of Glasgow, Glasgow, UK
| | - John Williamson
- School of Computing Science, University of Glasgow, Glasgow, UK
| | - Juho Rousu
- Department of Computer Science, School of Science, Aalto University, Espoo, Finland
| |
Collapse
|
41
|
Safizadeh H, Simpkins SW, Nelson J, Li SC, Piotrowski JS, Yoshimura M, Yashiroda Y, Hirano H, Osada H, Yoshida M, Boone C, Myers CL. Improving Measures of Chemical Structural Similarity Using Machine Learning on Chemical-Genetic Interactions. J Chem Inf Model 2021; 61:4156-4172. [PMID: 34318674 PMCID: PMC8479812 DOI: 10.1021/acs.jcim.0c00993] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
![]()
A common strategy
for identifying molecules likely to possess a
desired biological activity is to search large databases of compounds
for high structural similarity to a query molecule that demonstrates
this activity, under the assumption that structural similarity is
predictive of similar biological activity. However, efforts to systematically
benchmark the diverse array of available molecular fingerprints and
similarity coefficients have been limited by a lack of large-scale
datasets that reflect biological similarities of compounds. To elucidate
the relative performance of these alternatives, we systematically
benchmarked 11 different molecular fingerprint encodings, each combined
with 13 different similarity coefficients, using a large set of chemical–genetic
interaction data from the yeast Saccharomyces cerevisiae as a systematic proxy for biological activity. We found that the
performance of different molecular fingerprints and similarity coefficients
varied substantially and that the all-shortest path fingerprints paired
with the Braun-Blanquet similarity coefficient provided superior performance
that was robust across several compound collections. We further proposed
a machine learning pipeline based on support vector machines that
offered a fivefold improvement relative to the best unsupervised approach.
Our results generally suggest that using high-dimensional chemical–genetic
data as a basis for refining molecular fingerprints can be a powerful
approach for improving prediction of biological functions from chemical
structures.
Collapse
Affiliation(s)
- Hamid Safizadeh
- Department of Electrical and Computer Engineering, University of Minnesota-Twin Cities, Minneapolis, Minnesota 55455, United States.,Department of Computer Science and Engineering, University of Minnesota-Twin Cities, Minneapolis, Minnesota 55455, United States
| | - Scott W Simpkins
- Bioinformatics and Computational Biology Graduate Program, University of Minnesota-Twin Cities, Minneapolis, Minnesota 55455, United States
| | - Justin Nelson
- Bioinformatics and Computational Biology Graduate Program, University of Minnesota-Twin Cities, Minneapolis, Minnesota 55455, United States
| | - Sheena C Li
- The Donnelly Centre, University of Toronto, Toronto, Ontario M5S 3E1, Canada.,RIKEN Center for Sustainable Resource Science (CSRS), Wako, Saitama 351-0198, Japan
| | - Jeff S Piotrowski
- RIKEN Center for Sustainable Resource Science (CSRS), Wako, Saitama 351-0198, Japan
| | - Mami Yoshimura
- RIKEN Center for Sustainable Resource Science (CSRS), Wako, Saitama 351-0198, Japan
| | - Yoko Yashiroda
- RIKEN Center for Sustainable Resource Science (CSRS), Wako, Saitama 351-0198, Japan
| | - Hiroyuki Hirano
- RIKEN Center for Sustainable Resource Science (CSRS), Wako, Saitama 351-0198, Japan
| | - Hiroyuki Osada
- RIKEN Center for Sustainable Resource Science (CSRS), Wako, Saitama 351-0198, Japan
| | - Minoru Yoshida
- RIKEN Center for Sustainable Resource Science (CSRS), Wako, Saitama 351-0198, Japan.,Department of Biotechnology and Collaborative Research Institute for Innovative Microbiology, The University of Tokyo, Bunkyo City, Tokyo 113-8654, Japan
| | - Charles Boone
- The Donnelly Centre, University of Toronto, Toronto, Ontario M5S 3E1, Canada.,Department of Molecular Genetics, University of Toronto, Toronto, Ontario M5S 3E1, Canada.,RIKEN Center for Sustainable Resource Science (CSRS), Wako, Saitama 351-0198, Japan
| | - Chad L Myers
- Department of Computer Science and Engineering, University of Minnesota-Twin Cities, Minneapolis, Minnesota 55455, United States.,Bioinformatics and Computational Biology Graduate Program, University of Minnesota-Twin Cities, Minneapolis, Minnesota 55455, United States
| |
Collapse
|
42
|
Dash T, Srinivasan A, Vig L. Incorporating symbolic domain knowledge into graph neural networks. Mach Learn 2021. [DOI: 10.1007/s10994-021-05966-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
43
|
Casier B, Chagas da Silva M, Badawi M, Pascale F, Bučko T, Lebègue S, Rocca D. Hybrid localized graph kernel for machine learning energy-related properties of molecules and solids. J Comput Chem 2021; 42:1390-1401. [PMID: 34009668 DOI: 10.1002/jcc.26550] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2021] [Revised: 04/07/2021] [Accepted: 04/21/2021] [Indexed: 11/10/2022]
Abstract
Nowadays, the coupling of electronic structure and machine learning techniques serves as a powerful tool to predict chemical and physical properties of a broad range of systems. With the aim of improving the accuracy of predictions, a large number of representations for molecules and solids for machine learning applications has been developed. In this work we propose a novel descriptor based on the notion of molecular graph. While graphs are largely employed in classification problems in cheminformatics or bioinformatics, they are not often used in regression problem, especially of energy-related properties. Our method is based on a local decomposition of atomic environments and on the hybridization of two kernel functions: a graph kernel contribution that describes the chemical pattern and a Coulomb label contribution that encodes finer details of the local geometry. The accuracy of this new kernel method in energy predictions of molecular and condensed phase systems is demonstrated by considering the popular QM7 and BA10 datasets. These examples show that the hybrid localized graph kernel outperforms traditional approaches such as, for example, the smooth overlap of atomic positions and the Coulomb matrices.
Collapse
Affiliation(s)
- Bastien Casier
- Université de Lorraine and CNRS, LPCT, UMR 7019, F-54000 Nancy, France
| | | | - Michael Badawi
- Université de Lorraine and CNRS, LPCT, UMR 7019, F-54000 Nancy, France
| | | | - Tomáš Bučko
- Department of Physical and Theoretical Chemistry, Faculty of Natural Sciences, Comenius University in Bratislava, Bratislava, Slovakia.,Institute of Inorganic Chemistry, Slovak Academy of Sciences, Bratislava, Slovakia
| | - Sébastien Lebègue
- Université de Lorraine and CNRS, LPCT, UMR 7019, F-54000 Nancy, France
| | - Dario Rocca
- Université de Lorraine and CNRS, LPCT, UMR 7019, F-54000 Nancy, France
| |
Collapse
|
44
|
Errica F, Giulini M, Bacciu D, Menichetti R, Micheli A, Potestio R. A Deep Graph Network-Enhanced Sampling Approach to Efficiently Explore the Space of Reduced Representations of Proteins. Front Mol Biosci 2021; 8:637396. [PMID: 33996896 PMCID: PMC8116519 DOI: 10.3389/fmolb.2021.637396] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2020] [Accepted: 02/17/2021] [Indexed: 12/12/2022] Open
Abstract
The limits of molecular dynamics (MD) simulations of macromolecules are steadily pushed forward by the relentless development of computer architectures and algorithms. The consequent explosion in the number and extent of MD trajectories induces the need for automated methods to rationalize the raw data and make quantitative sense of them. Recently, an algorithmic approach was introduced by some of us to identify the subset of a protein's atoms, or mapping, that enables the most informative description of the system. This method relies on the computation, for a given reduced representation, of the associated mapping entropy, that is, a measure of the information loss due to such simplification; albeit relatively straightforward, this calculation can be time-consuming. Here, we describe the implementation of a deep learning approach aimed at accelerating the calculation of the mapping entropy. We rely on Deep Graph Networks, which provide extreme flexibility in handling structured input data and whose predictions prove to be accurate and-remarkably efficient. The trained network produces a speedup factor as large as 105 with respect to the algorithmic computation of the mapping entropy, enabling the reconstruction of its landscape by means of the Wang-Landau sampling scheme. Applications of this method reach much further than this, as the proposed pipeline is easily transferable to the computation of arbitrary properties of a molecular structure.
Collapse
Affiliation(s)
- Federico Errica
- Department of Computer Science, University of Pisa, Pisa, Italy
| | - Marco Giulini
- Physics Department, University of Trento, Trento, Italy
- INFN-TIFPA, Trento Institute for Fundamental Physics and Applications, Trento, Italy
| | - Davide Bacciu
- Department of Computer Science, University of Pisa, Pisa, Italy
| | - Roberto Menichetti
- Physics Department, University of Trento, Trento, Italy
- INFN-TIFPA, Trento Institute for Fundamental Physics and Applications, Trento, Italy
| | - Alessio Micheli
- Department of Computer Science, University of Pisa, Pisa, Italy
| | - Raffaello Potestio
- Physics Department, University of Trento, Trento, Italy
- INFN-TIFPA, Trento Institute for Fundamental Physics and Applications, Trento, Italy
| |
Collapse
|
45
|
Kunkel C, Margraf JT, Chen K, Oberhofer H, Reuter K. Active discovery of organic semiconductors. Nat Commun 2021; 12:2422. [PMID: 33893287 PMCID: PMC8065160 DOI: 10.1038/s41467-021-22611-4] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2020] [Accepted: 03/15/2021] [Indexed: 01/16/2023] Open
Abstract
The versatility of organic molecules generates a rich design space for organic semiconductors (OSCs) considered for electronics applications. Offering unparalleled promise for materials discovery, the vastness of this design space also dictates efficient search strategies. Here, we present an active machine learning (AML) approach that explores an unlimited search space through consecutive application of molecular morphing operations. Evaluating the suitability of OSC candidates on the basis of charge injection and mobility descriptors, the approach successively queries predictive-quality first-principles calculations to build a refining surrogate model. The AML approach is optimized in a truncated test space, providing deep methodological insight by visualizing it as a chemical space network. Significantly outperforming a conventional computational funnel, the optimized AML approach rapidly identifies well-known and hitherto unknown molecular OSC candidates with superior charge conduction properties. Most importantly, it constantly finds further candidates with highest efficiency while continuing its exploration of the endless design space.
Collapse
Affiliation(s)
- Christian Kunkel
- Chair for Theoretical Chemistry and Catalysis Research Center, Technische Universität München, Garching, Germany
| | - Johannes T Margraf
- Chair for Theoretical Chemistry and Catalysis Research Center, Technische Universität München, Garching, Germany
| | - Ke Chen
- Chair for Theoretical Chemistry and Catalysis Research Center, Technische Universität München, Garching, Germany
| | - Harald Oberhofer
- Chair for Theoretical Chemistry and Catalysis Research Center, Technische Universität München, Garching, Germany
| | - Karsten Reuter
- Chair for Theoretical Chemistry and Catalysis Research Center, Technische Universität München, Garching, Germany.
- Fritz-Haber-Institut der Max-Planck-Gesellschaft, Berlin, Germany.
| |
Collapse
|
46
|
Jia L, Gaüzère B, Honeine P. graphkit-learn: A Python library for graph kernels based on linear patterns. Pattern Recognit Lett 2021. [DOI: 10.1016/j.patrec.2021.01.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
47
|
Galati S, Yonchev D, Rodríguez-Pérez R, Vogt M, Tuccinardi T, Bajorath J. Predicting Isoform-Selective Carbonic Anhydrase Inhibitors via Machine Learning and Rationalizing Structural Features Important for Selectivity. ACS OMEGA 2021; 6:4080-4089. [PMID: 33585783 PMCID: PMC7876851 DOI: 10.1021/acsomega.0c06153] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Accepted: 01/14/2021] [Indexed: 05/03/2023]
Abstract
Carbonic anhydrases (CAs) catalyze the physiological hydration of carbon dioxide and are among the most intensely studied pharmaceutical target enzymes. A hallmark of CA inhibition is the complexation of the catalytic zinc cation in the active site. Human (h) CA isoforms belonging to different families are implicated in a wide range of diseases and of very high interest for therapeutic intervention. Given the conserved catalytic mechanisms and high similarity of many hCA isoforms, a major challenge for CA-based therapy is achieving inhibitor selectivity for hCA isoforms that are associated with specific pathologies over other widely distributed isoforms such as hCA I or hCA II that are of critical relevance for the integrity of many physiological processes. To address this challenge, we have attempted to predict compounds that are selective for isoform hCA IX, which is a tumor-associated protein and implicated in metastasis, over hCA II on the basis of a carefully curated data set of selective and nonselective inhibitors. Machine learning achieved surprisingly high accuracy in predicting hCA IX-selective inhibitors. The results were further investigated, and compound features determining successful predictions were identified. These features were then studied on the basis of X-ray structures of hCA isoform-inhibitor complexes and found to include substructures that explain compound selectivity. Our findings lend credence to selectivity predictions and indicate that the machine learning models derived herein have considerable potential to aid in the identification of new hCA IX-selective compounds.
Collapse
Affiliation(s)
- Salvatore Galati
- Department
of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology
and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115 Bonn, Germany
- Department
of Pharmacy, University of Pisa, 56126 Pisa, Italy
| | - Dimitar Yonchev
- Department
of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology
and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115 Bonn, Germany
| | - Raquel Rodríguez-Pérez
- Department
of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology
and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115 Bonn, Germany
| | - Martin Vogt
- Department
of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology
and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115 Bonn, Germany
| | - Tiziano Tuccinardi
- Department
of Pharmacy, University of Pisa, 56126 Pisa, Italy
- . Phone: 39-050-2219595
| | - Jürgen Bajorath
- Department
of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology
and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Friedrich-Hirzebruch-Allee 6, D-53115 Bonn, Germany
- . Phone: 49-228-7369-100
| |
Collapse
|
48
|
Shibayama S, Funatsu K. Industrial Case Study: Identification of Important Substructures and Exploration of Monomers for the Rapid Design of Novel Network Polymers with Distributed Representation. BULLETIN OF THE CHEMICAL SOCIETY OF JAPAN 2021. [DOI: 10.1246/bcsj.20200220] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Affiliation(s)
- Shojiro Shibayama
- Department of Chemical System Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan
| | - Kimito Funatsu
- Department of Chemical System Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan
| |
Collapse
|
49
|
Blaschke T, Feldmann C, Bajorath J. Prediction of Promiscuity Cliffs Using Machine Learning. Mol Inform 2021; 40:e2000196. [PMID: 32881355 PMCID: PMC7816223 DOI: 10.1002/minf.202000196] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2020] [Accepted: 09/03/2020] [Indexed: 12/22/2022]
Abstract
Compounds with the ability to interact with multiple targets, also called promiscuous compounds, provide the basis for polypharmacological drug discovery. In recent years, a plethora of structural analogs with different promiscuity has been identified. Nevertheless, the molecular origins of promiscuity remain to be elucidated. In this study, we systematically extracted different structural analogs with varying promiscuity using the matched molecular pair (MMP) formalism from public biological screening and medicinal chemistry data. Care was taken to eliminate all compounds with potential false-positive activity annotations from the analysis. Promiscuity predictions were then attempted at the level of compound pairs representing promiscuity cliffs (PCs; formed by analogs with large promiscuity differences) and corresponding non-PC MMPs (analog pairs without significant promiscuity differences). To address this prediction task, different machine learning models were generated and the results were compared with single compound predictions. PCs encoding promiscuity differences were found to contain more structure-promiscuity relationship information than sets of individual promiscuous compounds. In addition, feature analysis was carried out revealing key contributions to the correct prediction of PCs and non-PC MMPs via machine learning.
Collapse
Affiliation(s)
- Thomas Blaschke
- Department of Life Science InformaticsB-ITLIMES Program Unit Chemical Biology and Medicinal ChemistryRheinische Friedrich-Wilhelms-UniversitätEndenicher Allee 19cD-53115BonnGermany
| | - Christian Feldmann
- Department of Life Science InformaticsB-ITLIMES Program Unit Chemical Biology and Medicinal ChemistryRheinische Friedrich-Wilhelms-UniversitätEndenicher Allee 19cD-53115BonnGermany
| | - Jürgen Bajorath
- Department of Life Science InformaticsB-ITLIMES Program Unit Chemical Biology and Medicinal ChemistryRheinische Friedrich-Wilhelms-UniversitätEndenicher Allee 19cD-53115BonnGermany
| |
Collapse
|
50
|
Yonchev D, Bajorath J. DeepCOMO: from structure-activity relationship diagnostics to generative molecular design using the compound optimization monitor methodology. J Comput Aided Mol Des 2020; 34:1207-1218. [PMID: 33015739 PMCID: PMC7595974 DOI: 10.1007/s10822-020-00349-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Accepted: 09/29/2020] [Indexed: 11/26/2022]
Abstract
The compound optimization monitor (COMO) approach was originally developed as a diagnostic approach to aid in evaluating development stages of analog series and progress made during lead optimization. COMO uses virtual analog populations for the assessment of chemical saturation of analog series and has been further developed to bridge between optimization diagnostics and compound design. Herein, we discuss key methodological features of COMO in its scientific context and present a deep learning extension of COMO for generative molecular design, leading to the introduction of DeepCOMO. Applications on exemplary analog series are reported to illustrate the entire DeepCOMO repertoire, ranging from chemical saturation and structure-activity relationship progression diagnostics to the evaluation of different analog design strategies and prioritization of virtual candidates for optimization efforts, taking into account the development stage of individual analog series.
Collapse
Affiliation(s)
- Dimitar Yonchev
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Endenicher Allee 19c, 53115, Bonn, Germany
| | - Jürgen Bajorath
- Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Endenicher Allee 19c, 53115, Bonn, Germany.
| |
Collapse
|