1
|
Pikalyova K, Orlov A, Lin A, Tarasova O, Marcou M, Horvath D, Poroikov V, Varnek A. HIV-1 drug resistance profiling using amino acid sequence space cartography. Bioinformatics 2022; 38:2307-2314. [PMID: 35157024 DOI: 10.1093/bioinformatics/btac090] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2021] [Revised: 01/03/2022] [Accepted: 02/08/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Human immunodeficiency virus (HIV) drug resistance is a global healthcare issue. The emergence of drug resistance influenced the efficacy of treatment regimens, thus stressing the importance of treatment adaptation. Computational methods predicting the drug resistance profile from genomic data of HIV isolates are advantageous for monitoring drug resistance in patients. However, existing computational methods for drug resistance prediction are either not suitable for emerging HIV strains with complex mutational patterns or lack interpretability, which is of paramount importance in clinical practice. The approach reported here overcomes these limitations and combines high accuracy of predictions and interpretability of the models. RESULTS In this work, a new methodology based on generative topographic mapping (GTM) for biological sequence space representation and quantitative genotype-phenotype relationships prediction purposes was introduced. The GTM-based resistance landscapes allowed us to predict the resistance of HIV strains based on sequencing and drug resistance data for three viral proteins [integrase (IN), protease (PR) and reverse transcriptase (RT)] from Stanford HIV drug resistance database. The average balanced accuracy for PR inhibitors was 0.89 ± 0.01, for IN inhibitors 0.85 ± 0.01, for non-nucleoside RT inhibitors 0.73 ± 0.01 and for nucleoside RT inhibitors 0.84 ± 0.01. We have demonstrated in several case studies that GTM-based resistance landscapes are useful for visualization and analysis of sequence space as well as for treatment optimization purposes. Here, GTMs were applied for the in-depth analysis of the relationships between mutation pattern and drug resistance using mutation landscapes. This allowed us to predict retrospectively the importance of the presence of particular mutations (e.g. V32I, L10F and L33F in HIV PR) for the resistance development. This study highlights some perspectives of GTM applications in clinical informatics and particularly in the field of sequence space exploration. AVAILABILITY AND IMPLEMENTATION https://github.com/karinapikalyova/ISIDASeq. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Karina Pikalyova
- Laboratoire de Chémoinformatique, UMR 7140, Université de Strasbourg, Strasbourg 67000, France
| | - Alexey Orlov
- Laboratoire de Chémoinformatique, UMR 7140, Université de Strasbourg, Strasbourg 67000, France
| | - Arkadii Lin
- Laboratoire de Chémoinformatique, UMR 7140, Université de Strasbourg, Strasbourg 67000, France
| | - Olga Tarasova
- Institute of Biomedical Chemistry, Moscow 119121, Russia
| | - MarcouGilles Marcou
- Laboratoire de Chémoinformatique, UMR 7140, Université de Strasbourg, Strasbourg 67000, France
| | - Dragos Horvath
- Laboratoire de Chémoinformatique, UMR 7140, Université de Strasbourg, Strasbourg 67000, France
| | | | - Alexandre Varnek
- Laboratoire de Chémoinformatique, UMR 7140, Université de Strasbourg, Strasbourg 67000, France
| |
Collapse
|
2
|
Bort W, Baskin II, Gimadiev T, Mukanov A, Nugmanov R, Sidorov P, Marcou G, Horvath D, Klimchuk O, Madzhidov T, Varnek A. Discovery of novel chemical reactions by deep generative recurrent neural network. Sci Rep 2021; 11:3178. [PMID: 33542271 PMCID: PMC7862614 DOI: 10.1038/s41598-021-81889-y] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2020] [Accepted: 01/06/2021] [Indexed: 12/18/2022] Open
Abstract
The "creativity" of Artificial Intelligence (AI) in terms of generating de novo molecular structures opened a novel paradigm in compound design, weaknesses (stability & feasibility issues of such structures) notwithstanding. Here we show that "creative" AI may be as successfully taught to enumerate novel chemical reactions that are stoichiometrically coherent. Furthermore, when coupled to reaction space cartography, de novo reaction design may be focused on the desired reaction class. A sequence-to-sequence autoencoder with bidirectional Long Short-Term Memory layers was trained on on-purpose developed "SMILES/CGR" strings, encoding reactions of the USPTO database. The autoencoder latent space was visualized on a generative topographic map. Novel latent space points were sampled around a map area populated by Suzuki reactions and decoded to corresponding reactions. These can be critically analyzed by the expert, cleaned of irrelevant functional groups and eventually experimentally attempted, herewith enlarging the synthetic purpose of popular synthetic pathways.
Collapse
Affiliation(s)
- William Bort
- Laboratory of Chemoinformatics, UMR 7140 CNRS, University of Strasbourg, 1, rue Blaise Pascal, 67000, Strasbourg, France
| | - Igor I Baskin
- Laboratory of Chemoinformatics, UMR 7140 CNRS, University of Strasbourg, 1, rue Blaise Pascal, 67000, Strasbourg, France
- Laboratory of Chemoinformatics and Molecular Modeling, Butlerov Institute of Chemistry, Kazan Federal University, Kremlyovskaya str. 18, 420008, Kazan, Russia
- Department of Materials Science and Engineering, Technion - Israel Institute of Technology, 3200003, Haifa, Israel
| | - Timur Gimadiev
- Institute for Chemical Reaction Design and Discovery (WPI-ICReDD), Hokkaido University, Kita 21 Nishi 10, Kita-ku, Sapporo, 001-0021, Japan
| | - Artem Mukanov
- Laboratory of Chemoinformatics and Molecular Modeling, Butlerov Institute of Chemistry, Kazan Federal University, Kremlyovskaya str. 18, 420008, Kazan, Russia
| | - Ramil Nugmanov
- Laboratory of Chemoinformatics and Molecular Modeling, Butlerov Institute of Chemistry, Kazan Federal University, Kremlyovskaya str. 18, 420008, Kazan, Russia
| | - Pavel Sidorov
- Institute for Chemical Reaction Design and Discovery (WPI-ICReDD), Hokkaido University, Kita 21 Nishi 10, Kita-ku, Sapporo, 001-0021, Japan
| | - Gilles Marcou
- Laboratory of Chemoinformatics, UMR 7140 CNRS, University of Strasbourg, 1, rue Blaise Pascal, 67000, Strasbourg, France
| | - Dragos Horvath
- Laboratory of Chemoinformatics, UMR 7140 CNRS, University of Strasbourg, 1, rue Blaise Pascal, 67000, Strasbourg, France
| | - Olga Klimchuk
- Laboratory of Chemoinformatics, UMR 7140 CNRS, University of Strasbourg, 1, rue Blaise Pascal, 67000, Strasbourg, France
| | - Timur Madzhidov
- Laboratory of Chemoinformatics and Molecular Modeling, Butlerov Institute of Chemistry, Kazan Federal University, Kremlyovskaya str. 18, 420008, Kazan, Russia
| | - Alexandre Varnek
- Laboratory of Chemoinformatics, UMR 7140 CNRS, University of Strasbourg, 1, rue Blaise Pascal, 67000, Strasbourg, France.
- Institute for Chemical Reaction Design and Discovery (WPI-ICReDD), Hokkaido University, Kita 21 Nishi 10, Kita-ku, Sapporo, 001-0021, Japan.
| |
Collapse
|
3
|
Shibayama S, Funatsu K. Industrial Case Study: Identification of Important Substructures and Exploration of Monomers for the Rapid Design of Novel Network Polymers with Distributed Representation. BULLETIN OF THE CHEMICAL SOCIETY OF JAPAN 2021. [DOI: 10.1246/bcsj.20200220] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Affiliation(s)
- Shojiro Shibayama
- Department of Chemical System Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan
| | - Kimito Funatsu
- Department of Chemical System Engineering, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan
| |
Collapse
|
4
|
Horvath D, Orlov A, Osolodkin DI, Ishmukhametov AA, Marcou G, Varnek A. A Chemographic Audit of anti-Coronavirus Structure-activity Information from Public Databases (ChEMBL). Mol Inform 2020; 39:e2000080. [PMID: 32363750 PMCID: PMC7267182 DOI: 10.1002/minf.202000080] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2020] [Accepted: 04/26/2020] [Indexed: 01/30/2023]
Abstract
Discovery of drugs against newly emerged pathogenic agents like the SARS-CoV-2 coronavirus (CoV) must be based on previous research against related species. Scientists need to get acquainted with and develop a global oversight over so-far tested molecules. Chemography (herein used Generative Topographic Mapping, in particular) places structures on a human-readable 2D map (obtained by dimensionality reduction of the chemical space of molecular descriptors) and is thus well suited for such an audit. The goal is to map medicinal chemistry efforts so far targeted against CoVs. This includes comparing libraries tested against various virus species/genera, predicting their polypharmacological profiles and highlighting often encountered chemotypes. Maps are challenged to provide predictive activity landscapes against viral proteins. Definition of "anti-CoV" map zones led to selection of therein residing 380 potential anti-CoV agents, out of a vast pool of 800 M organic compounds.
Collapse
Affiliation(s)
- Dragos Horvath
- Chemoinformatics LaboratoryUMR 7140 CNRS/University of Strasbourg4, rue Blaise Pascal67000Strasbourg
| | - Alexey Orlov
- Chemoinformatics LaboratoryUMR 7140 CNRS/University of Strasbourg4, rue Blaise Pascal67000Strasbourg
- FSBSI “Chumakov FSC R&D IBP RAS”Poselok Instituta Poliomielita 8 bd. 1Poselenie MoskovskyMoscow108819Russia
| | - Dmitry I. Osolodkin
- FSBSI “Chumakov FSC R&D IBP RAS”Poselok Instituta Poliomielita 8 bd. 1Poselenie MoskovskyMoscow108819Russia
- Institute of Translational Medicine and BiotechnologySechenov First Moscow State Medical UniversityTrubetskaya ul. 8Moscow119991Russia
| | - Aydar A. Ishmukhametov
- FSBSI “Chumakov FSC R&D IBP RAS”Poselok Instituta Poliomielita 8 bd. 1Poselenie MoskovskyMoscow108819Russia
- Institute of Translational Medicine and BiotechnologySechenov First Moscow State Medical UniversityTrubetskaya ul. 8Moscow119991Russia
| | - Gilles Marcou
- Chemoinformatics LaboratoryUMR 7140 CNRS/University of Strasbourg4, rue Blaise Pascal67000Strasbourg
| | - Alexandre Varnek
- Chemoinformatics LaboratoryUMR 7140 CNRS/University of Strasbourg4, rue Blaise Pascal67000Strasbourg
| |
Collapse
|
5
|
Baskin II, Lozano S, Durot M, Marcou G, Horvath D, Varnek A. Autoignition temperature: comprehensive data analysis and predictive models. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2020; 31:597-613. [PMID: 32646236 DOI: 10.1080/1062936x.2020.1785933] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/18/2020] [Accepted: 06/18/2020] [Indexed: 06/11/2023]
Abstract
Here we report a new predictive model for autoignition temperature (AIT), an important physical parameter widely used to assess potential safety hazards of combustible materials. Available structure-AIT data extracted from different sources were critically analysed. Support vector regression (SVR) models on different data subsets were built in order to identify a reliable compound set on which a realistic model could be built. This led to a selection of the dataset containing 875 compounds annotated with AIT values. The thereupon-based SVR model performs reasonably well in cross-validation with the determination coefficient r 2 = 0.77 and mean absolute error MAE = 37.8°C. External validation on 20 industrial compounds missing in the training set confirmed its good predictive power (MAE = 28.7°C).
Collapse
Affiliation(s)
- I I Baskin
- Laboratory of Chemoinformatics, University of Strasbourg, UMR 7140 CNRS/UniStra , Strasbourg, France
| | - S Lozano
- BioLab, Centre de Recherche de Solaize, Total , Solaize, France
| | - M Durot
- BioLab, Centre de Recherche de Solaize, Total , Solaize, France
| | - G Marcou
- Laboratory of Chemoinformatics, University of Strasbourg, UMR 7140 CNRS/UniStra , Strasbourg, France
| | - D Horvath
- Laboratory of Chemoinformatics, University of Strasbourg, UMR 7140 CNRS/UniStra , Strasbourg, France
| | - A Varnek
- Laboratory of Chemoinformatics, University of Strasbourg, UMR 7140 CNRS/UniStra , Strasbourg, France
| |
Collapse
|
6
|
Sattarov B, Baskin II, Horvath D, Marcou G, Bjerrum EJ, Varnek A. De Novo Molecular Design by Combining Deep Autoencoder Recurrent Neural Networks with Generative Topographic Mapping. J Chem Inf Model 2019; 59:1182-1196. [PMID: 30785751 DOI: 10.1021/acs.jcim.8b00751] [Citation(s) in RCA: 85] [Impact Index Per Article: 14.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Here we show that Generative Topographic Mapping (GTM) can be used to explore the latent space of the SMILES-based autoencoders and generate focused molecular libraries of interest. We have built a sequence-to-sequence neural network with Bidirectional Long Short-Term Memory layers and trained it on the SMILES strings from ChEMBL23. Very high reconstruction rates of the test set molecules were achieved (>98%), which are comparable to the ones reported in related publications. Using GTM, we have visualized the autoencoder latent space on the two-dimensional topographic map. Targeted map zones can be used for generating novel molecular structures by sampling associated latent space points and decoding them to SMILES. The sampling method based on a genetic algorithm was introduced to optimize compound properties "on the fly". The generated focused molecular libraries were shown to contain original and a priori feasible compounds which, pending actual synthesis and testing, showed encouraging behavior in independent structure-based affinity estimation procedures (pharmacophore matching, docking).
Collapse
Affiliation(s)
- Boris Sattarov
- Laboratory of Chemoinformatics , UMR 7177 University of Strasbourg/CNRS , 4 rue B. Pascal , 67000 Strasbourg , France
| | - Igor I Baskin
- Faculty of Physics , M.V. Lomonosov Moscow State University , Leninskie Gory , Moscow 19991 , Russia
| | - Dragos Horvath
- Laboratory of Chemoinformatics , UMR 7177 University of Strasbourg/CNRS , 4 rue B. Pascal , 67000 Strasbourg , France
| | - Gilles Marcou
- Laboratory of Chemoinformatics , UMR 7177 University of Strasbourg/CNRS , 4 rue B. Pascal , 67000 Strasbourg , France
| | - Esben Jannik Bjerrum
- Wildcard Pharmaceutical Consulting, Zeaborg Science Center, Frødings Allé 41 , 2860 Søborg , Denmark
| | - Alexandre Varnek
- Laboratory of Chemoinformatics , UMR 7177 University of Strasbourg/CNRS , 4 rue B. Pascal , 67000 Strasbourg , France
| |
Collapse
|
7
|
Lin A, Horvath D, Afonina V, Marcou G, Reymond JL, Varnek A. Mapping of the Available Chemical Space versus the Chemical Universe of Lead-Like Compounds. ChemMedChem 2018; 13:540-554. [PMID: 29154440 DOI: 10.1002/cmdc.201700561] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2017] [Revised: 11/07/2017] [Indexed: 12/15/2022]
Abstract
This is, to our knowledge, the most comprehensive analysis to date based on generative topographic mapping (GTM) of fragment-like chemical space (40 million molecules with no more than 17 heavy atoms, both from the theoretically enumerated GDB-17 and real-world PubChem/ChEMBL databases). The challenge was to prove that a robust map of fragment-like chemical space can actually be built, in spite of a limited (≪105 ) maximal number of compounds ("frame set") usable for fitting the GTM manifold. An evolutionary map building strategy has been updated with a "coverage check" step, which discards manifolds failing to accommodate compounds outside the frame set. The evolved map has a good propensity to separate actives from inactives for more than 20 external structure-activity sets. It was proven to properly accommodate the entire collection of 40 m compounds. Next, it served as a library comparison tool to highlight biases of real-world molecules (PubChem and ChEMBL) versus the universe of all possible species represented by FDB-17, a fragment-like subset of GDB-17 containing 10 million molecules. Specific patterns, proper to some libraries and absent from others (diversity holes), were highlighted.
Collapse
Affiliation(s)
- Arkadii Lin
- Laboratory of Chemoinformatics, Faculty of Chemistry, University of Strasbourg, 4 Blaise Pascal str., 67081, Strasbourg, France
| | - Dragos Horvath
- Laboratory of Chemoinformatics, Faculty of Chemistry, University of Strasbourg, 4 Blaise Pascal str., 67081, Strasbourg, France
| | - Valentina Afonina
- Laboratory of Chemoinformatics, Faculty of Chemistry, University of Strasbourg, 4 Blaise Pascal str., 67081, Strasbourg, France.,Laboratory of Chemoinformatics and Molecular Modeling, Department of Organic Chemistry, A.M. Butlerov Institute of Chemistry, Kazan Federal University, 18 Kremlyovskaya str., 420008, Kazan, Russia
| | - Gilles Marcou
- Laboratory of Chemoinformatics, Faculty of Chemistry, University of Strasbourg, 4 Blaise Pascal str., 67081, Strasbourg, France
| | - Jean-Louis Reymond
- Department of Chemistry and Biochemistry, University of Berne, 3 Freiestrasse, 3012, Berne, Switzerland
| | - Alexandre Varnek
- Laboratory of Chemoinformatics, Faculty of Chemistry, University of Strasbourg, 4 Blaise Pascal str., 67081, Strasbourg, France
| |
Collapse
|
8
|
Abstract
Various methods of machine learning, supervised and unsupervised, linear and nonlinear, classification and regression, in combination with various types of molecular descriptors, both "handcrafted" and "data-driven," are considered in the context of their use in computational toxicology. The use of multiple linear regression, variants of naïve Bayes classifier, k-nearest neighbors, support vector machine, decision trees, ensemble learning, random forest, several types of neural networks, and deep learning is the focus of attention of this review. The role of fragment descriptors, graph mining, and graph kernels is highlighted. The application of unsupervised methods, such as Kohonen's self-organizing maps and related approaches, which allow for combining predictions with data analysis and visualization, is also considered. The necessity of applying a wide range of machine learning methods in computational toxicology is underlined.
Collapse
Affiliation(s)
- Igor I Baskin
- Faculty of Physics, M.V. Lomonosov Moscow State University, Moscow, Russian Federation.
- Butlerov Institute of Chemistry, Kazan Federal University, Kazan, Russian Federation.
| |
Collapse
|
9
|
Predictive cartography of metal binders using generative topographic mapping. J Comput Aided Mol Des 2017; 31:701-714. [DOI: 10.1007/s10822-017-0033-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2017] [Accepted: 06/11/2017] [Indexed: 12/27/2022]
|
10
|
QSAR modeling and chemical space analysis of antimalarial compounds. J Comput Aided Mol Des 2017; 31:441-451. [PMID: 28374255 DOI: 10.1007/s10822-017-0019-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2016] [Accepted: 03/18/2017] [Indexed: 10/19/2022]
Abstract
Generative topographic mapping (GTM) has been used to visualize and analyze the chemical space of antimalarial compounds as well as to build predictive models linking structure of molecules with their antimalarial activity. For this, a database, including ~3000 molecules tested in one or several of 17 anti-Plasmodium activity assessment protocols, has been compiled by assembling experimental data from in-house and ChEMBL databases. GTM classification models built on subsets corresponding to individual bioassays perform similarly to the earlier reported SVM models. Zones preferentially populated by active and inactive molecules, respectively, clearly emerge in the class landscapes supported by the GTM model. Their analysis resulted in identification of privileged structural motifs of potential antimalarial compounds. Projection of marketed antimalarial drugs on this map allowed us to delineate several areas in the chemical space corresponding to different mechanisms of antimalarial activity. This helped us to make a suggestion about the mode of action of the molecules populating these zones.
Collapse
|