1
|
Song RX, Nicklaus MC, Tarasova NI. Correlation of protein binding pocket properties with hits' chemistries used in generation of ultra-large virtual libraries. J Comput Aided Mol Des 2024; 38:22. [PMID: 38753096 PMCID: PMC11098933 DOI: 10.1007/s10822-024-00562-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Accepted: 04/22/2024] [Indexed: 05/19/2024]
Abstract
Although the size of virtual libraries of synthesizable compounds is growing rapidly, we are still enumerating only tiny fractions of the drug-like chemical universe. Our capability to mine these newly generated libraries also lags their growth. That is why fragment-based approaches that utilize on-demand virtual combinatorial libraries are gaining popularity in drug discovery. These à la carte libraries utilize synthetic blocks found to be effective binders in parts of target protein pockets and a variety of reliable chemistries to connect them. There is, however, no data on the potential impact of the chemistries used for making on-demand libraries on the hit rates during virtual screening. There are also no rules to guide in the selection of these synthetic methods for production of custom libraries. We have used the SAVI (Synthetically Accessible Virtual Inventory) library, constructed using 53 reliable reaction types (transforms), to evaluate the impact of these chemistries on docking hit rates for 40 well-characterized protein pockets. The data shows that the virtual hit rates differ significantly for different chemistries with cross coupling reactions such as Sonogashira, Suzuki-Miyaura, Hiyama and Liebeskind-Srogl coupling producing the highest hit rates. Virtual hit rates appear to depend not only on the property of the formed chemical bond but also on the diversity of available building blocks and the scope of the reaction. The data identifies reactions that deserve wider use through increasing the number of corresponding building blocks and suggests the reactions that are more effective for pockets with certain physical and hydrogen bond-forming properties.
Collapse
Affiliation(s)
- Robert X Song
- Cancer Innovation Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Frederick, MD, 21702, USA
| | - Marc C Nicklaus
- Computer-Aided Drug Design Group, Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, NIH, Frederick, MD, 21702, USA
| | - Nadya I Tarasova
- Cancer Innovation Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Frederick, MD, 21702, USA.
| |
Collapse
|
2
|
Mahjour BA, Coley CW. RDCanon: A Python Package for Canonicalizing the Order of Tokens in SMARTS Queries. J Chem Inf Model 2024; 64:2948-2954. [PMID: 38488634 DOI: 10.1021/acs.jcim.4c00138] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/23/2024]
Abstract
SMARTS is a widely used language in cheminformatics for defining substructural queries for database lookups, reaction templates for chemical transformations, and other applications. As an extension to SMILES, many SMARTS patterns can represent the same query. Despite this, no canonicalization algorithm invariant of the line notation sequence or atomic numbering is publicly available. Here, we introduce RDCanon, an open-source Python package that can be used to standardize SMARTS queries. RDCanon is designed to ensure that the sequence of atomic queries remains consistent for all graphs representing the same substructure query and to ensure a canonical sequence of primitives within each individual atom query; furthermore, the algorithm can be applied to canonicalize the order of reactants, agents, and products and their atom map numbers in reaction SMARTS templates. As part of its canonicalization algorithm, RDCanon provides a mechanism in which the canonicalized SMARTS is optimized for speed against specific molecular databases. Several case studies are provided to showcase improved efficiency in substructure matching and retrosynthetic analysis.
Collapse
Affiliation(s)
- Babak A Mahjour
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
3
|
Kim H, Lee K, Kim C, Lim J, Kim WY. DFRscore: Deep Learning-Based Scoring of Synthetic Complexity with Drug-Focused Retrosynthetic Analysis for High-Throughput Virtual Screening. J Chem Inf Model 2024; 64:2432-2444. [PMID: 37651152 DOI: 10.1021/acs.jcim.3c01134] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
Abstract
Recently emerging generative AI models enable us to produce a vast number of compounds for potential applications. While they can provide novel molecular structures, the synthetic feasibility of the generated molecules is often questioned. To address this issue, a few recent studies have attempted to use deep learning models to estimate the synthetic accessibility of many molecules rapidly. However, retrosynthetic analysis tools used to train the models rely on reaction templates automatically extracted from a large reaction database that are not domain-specific and may exhibit low chemical correctness. To overcome this limitation, we introduce DFRscore (Drug-Focused Retrosynthetic score), a deep learning-based approach for a more practical assessment of synthetic accessibility in drug discovery. The DFRscore model is trained exclusively on drug-focused reactions, providing a predicted number of minimally required synthetic steps for each compound. This approach enables practitioners to filter out compounds that do not meet their desired level of synthetic accessibility at an early stage of high-throughput virtual screening for accelerated drug discovery. The proposed strategy can be easily adapted to other domains by adjusting the synthesis planning setup of the reaction templates and starting materials.
Collapse
Affiliation(s)
- Hyeongwoo Kim
- Department of Chemistry, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
| | - Kyunghoon Lee
- Department of Chemistry, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
| | - Chansu Kim
- Department of Chemistry, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
| | - Jaechang Lim
- HITS Incorporation, 124 Teheran-ro, Gangnam-gu, Seoul 06234, Republic of Korea
| | - Woo Youn Kim
- Department of Chemistry, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
- HITS Incorporation, 124 Teheran-ro, Gangnam-gu, Seoul 06234, Republic of Korea
- AI Institute, KAIST, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea
| |
Collapse
|
4
|
Dodds M, Guo J, Löhr T, Tibo A, Engkvist O, Janet JP. Sample efficient reinforcement learning with active learning for molecular design. Chem Sci 2024; 15:4146-4160. [PMID: 38487235 PMCID: PMC10935729 DOI: 10.1039/d3sc04653b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2023] [Accepted: 02/07/2024] [Indexed: 03/17/2024] Open
Abstract
Reinforcement learning (RL) is a powerful and flexible paradigm for searching for solutions in high-dimensional action spaces. However, bridging the gap between playing computer games with thousands of simulated episodes and solving real scientific problems with complex and involved environments (up to actual laboratory experiments) requires improvements in terms of sample efficiency to make the most of expensive information. The discovery of new drugs is a major commercial application of RL, motivated by the very large nature of the chemical space and the need to perform multiparameter optimization (MPO) across different properties. In silico methods, such as virtual library screening (VS) and de novo molecular generation with RL, show great promise in accelerating this search. However, incorporation of increasingly complex computational models in these workflows requires increasing sample efficiency. Here, we introduce an active learning system linked with an RL model (RL-AL) for molecular design, which aims to improve the sample-efficiency of the optimization process. We identity and characterize unique challenges combining RL and AL, investigate the interplay between the systems, and develop a novel AL approach to solve the MPO problem. Our approach greatly expedites the search for novel solutions relative to baseline-RL for simple ligand- and structure-based oracle functions, with a 5-66-fold increase in hits generated for a fixed oracle budget and a 4-64-fold reduction in computational time to find a specific number of hits. Furthermore, compounds discovered through RL-AL display substantial enrichment of a multi-parameter scoring objective, indicating superior efficacy in curating high-scoring compounds, without a reduction in output diversity. This significant acceleration improves the feasibility of oracle functions that have largely been overlooked in RL due to high computational costs, for example free energy perturbation methods, and in principle is applicable to any RL domain.
Collapse
Affiliation(s)
- Michael Dodds
- Molecular AI, Discovery Sciences, R&D, AstraZeneca 431 50 Gothenburg Sweden
| | - Jeff Guo
- Molecular AI, Discovery Sciences, R&D, AstraZeneca 431 50 Gothenburg Sweden
| | - Thomas Löhr
- Molecular AI, Discovery Sciences, R&D, AstraZeneca 431 50 Gothenburg Sweden
| | - Alessandro Tibo
- Molecular AI, Discovery Sciences, R&D, AstraZeneca 431 50 Gothenburg Sweden
| | - Ola Engkvist
- Molecular AI, Discovery Sciences, R&D, AstraZeneca 431 50 Gothenburg Sweden
| | - Jon Paul Janet
- Molecular AI, Discovery Sciences, R&D, AstraZeneca 431 50 Gothenburg Sweden
| |
Collapse
|
5
|
Ertl P. Database of 4 Million Medicinal Chemistry-Relevant Ring Systems. J Chem Inf Model 2024; 64:1245-1250. [PMID: 38311838 DOI: 10.1021/acs.jcim.3c01812] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2024]
Abstract
Central ring systems are the most important part of bioactive molecules. They determine molecule shape, keep substituents in their proper positions, and also influence global molecular properties. In the present study, a database of 4 million medicinal chemistry-relevant ring systems has been created, not by crude random enumeration but by applying a set of rules derived by analyzing rings present in bioactive molecules. The aromatic properties and tautomer stability of generated rings have also been considered to ensure that the rings in the database are stable and chemically reasonable. 99.2% of these rings are novel and not included in molecules in the ChEMBL or PubChem databases. This large database of ring systems has been created with the goal to provide support for bioisosteric design and scaffold hopping as well as to be used in generative chemistry applications. The complete set of created rings is available for download in the SMILES format from https://peter-ertl.com/molecular/data/.
Collapse
Affiliation(s)
- Peter Ertl
- Global Discovery Chemistry, Biomedical Research, Novartis CH-4056 Basel, Switzerland
| |
Collapse
|
6
|
Tropsha A, Isayev O, Varnek A, Schneider G, Cherkasov A. Integrating QSAR modelling and deep learning in drug discovery: the emergence of deep QSAR. Nat Rev Drug Discov 2024; 23:141-155. [PMID: 38066301 DOI: 10.1038/s41573-023-00832-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/21/2023] [Indexed: 02/08/2024]
Abstract
Quantitative structure-activity relationship (QSAR) modelling, an approach that was introduced 60 years ago, is widely used in computer-aided drug design. In recent years, progress in artificial intelligence techniques, such as deep learning, the rapid growth of databases of molecules for virtual screening and dramatic improvements in computational power have supported the emergence of a new field of QSAR applications that we term 'deep QSAR'. Marking a decade from the pioneering applications of deep QSAR to tasks involved in small-molecule drug discovery, we herein describe key advances in the field, including deep generative and reinforcement learning approaches in molecular design, deep learning models for synthetic planning and the application of deep QSAR models in structure-based virtual screening. We also reflect on the emergence of quantum computing, which promises to further accelerate deep QSAR applications and the need for open-source and democratized resources to support computer-aided drug design.
Collapse
Affiliation(s)
| | | | | | | | - Artem Cherkasov
- University of British Columbia, Vancouver, BC, Canada.
- Photonic Inc., Coquitlam, BC, Canada.
| |
Collapse
|
7
|
John L, Nagamani S, Mahanta HJ, Vaikundamani S, Kumar N, Kumar A, Jamir E, Priyadarsinee L, Sastry GN. Molecular Property Diagnostic Suite Compound Library (MPDS-CL): a structure-based classification of the chemical space. Mol Divers 2023:10.1007/s11030-023-10752-1. [PMID: 37902900 DOI: 10.1007/s11030-023-10752-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2023] [Accepted: 10/17/2023] [Indexed: 11/01/2023]
Abstract
Molecular Property Diagnostic Suite Compound Library (MPDS-CL) is an open-source Galaxy-based cheminformatics web portal which presents a structure-based classification of the molecules. A structure-based classification of nearly 150 million unique compounds, obtained from 42 publicly available databases and curated for redundancy removal through 97 hierarchically well-defined atom composition-based portions, has been done. These are further subjected to 56-bit fingerprint-based classification algorithm which led to the formation of 56 structurally well-defined classes. The classes thus obtained were further divided into clusters based on their molecular weight. Thus, the entire set of molecules was put into 56 different classes and 625 clusters. This led to the assignment of a unique ID, named as MPDS-AadharID, for each of these 149,169,443 molecules. MPDS-AadharID is akin to the unique number given to citizens in India (similar to SSN in the US and NINO in the UK). The unique features of MPDS-CL are (a) several search options, such as exact structure search, substructure search, property-based search, fingerprint-based search, using SMILES, InChIKey and key-in; (b) automatic generation of information for the processing for MPDS and other galaxy tools; (c) providing the class and cluster of a molecule which makes it easier and fast to search for similar molecules and (d) information related to the presence of the molecules in multiple databases. The MPDS-CL can be accessed at https://mpds.neist.res.in:8086/ .
Collapse
Affiliation(s)
- Lijo John
- Advanced Computation and Data Sciences Division, CSIR - North East Institute of Science and Technology, Jorhat, 785006, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India
| | - Selvaraman Nagamani
- Advanced Computation and Data Sciences Division, CSIR - North East Institute of Science and Technology, Jorhat, 785006, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India
| | - Hridoy Jyoti Mahanta
- Advanced Computation and Data Sciences Division, CSIR - North East Institute of Science and Technology, Jorhat, 785006, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India
| | - S Vaikundamani
- Advanced Computation and Data Sciences Division, CSIR - North East Institute of Science and Technology, Jorhat, 785006, India
| | - Nandan Kumar
- Advanced Computation and Data Sciences Division, CSIR - North East Institute of Science and Technology, Jorhat, 785006, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India
| | - Asheesh Kumar
- Advanced Computation and Data Sciences Division, CSIR - North East Institute of Science and Technology, Jorhat, 785006, India
| | - Esther Jamir
- Advanced Computation and Data Sciences Division, CSIR - North East Institute of Science and Technology, Jorhat, 785006, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India
| | - Lipsa Priyadarsinee
- Advanced Computation and Data Sciences Division, CSIR - North East Institute of Science and Technology, Jorhat, 785006, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India
| | - G Narahari Sastry
- Advanced Computation and Data Sciences Division, CSIR - North East Institute of Science and Technology, Jorhat, 785006, India.
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, 201002, India.
| |
Collapse
|
8
|
Liphardt T, Sander T. Fast Substructure Search in Combinatorial Library Spaces. J Chem Inf Model 2023; 63:5133-5141. [PMID: 37221856 DOI: 10.1021/acs.jcim.3c00290] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
We present an efficient algorithm for substructure search in combinatorial libraries defined by synthons, i.e., substructures with connection points. Our method improves on existing approaches by introducing powerful heuristics and fast fingerprint screening to quickly eliminate branches of nonmatching combinations of synthons. With this, we achieve typical response times of a few seconds on a standard desktop computer for searches in large combinatorial libraries like the Enamine REAL Space. We published the Java source as part of the OpenChemLib under the BSD license, and we implemented tools to enable substructure search in custom combinatorial libraries.
Collapse
Affiliation(s)
- Thomas Liphardt
- Scientific Computing Drug Discovery, Idorsia Pharmaceuticals, Ltd., CH-4123 Allschwil, Switzerland
| | - Thomas Sander
- Scientific Computing Drug Discovery, Idorsia Pharmaceuticals, Ltd., CH-4123 Allschwil, Switzerland
| |
Collapse
|
9
|
Venkatraman V. FP-MAP: an extensive library of fingerprint-based molecular activity prediction tools. Front Chem 2023; 11:1239467. [PMID: 37649967 PMCID: PMC10462816 DOI: 10.3389/fchem.2023.1239467] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Accepted: 07/31/2023] [Indexed: 09/01/2023] Open
Abstract
Discovering new drugs for disease treatment is challenging, requiring a multidisciplinary effort as well as time, and resources. With a view to improving hit discovery and lead compound identification, machine learning (ML) approaches are being increasingly used in the decision-making process. Although a number of ML-based studies have been published, most studies only report fragments of the wider range of bioactivities wherein each model typically focuses on a particular disease. This study introduces FP-MAP, an extensive atlas of fingerprint-based prediction models that covers a diverse range of activities including neglected tropical diseases (caused by viral, bacterial and parasitic pathogens) as well as other targets implicated in diseases such as Alzheimer's. To arrive at the best predictive models, performance of ≈4,000 classification/regression models were evaluated on different bioactivity data sets using 12 different molecular fingerprints. The best performing models that achieved test set AUC values of 0.62-0.99 have been integrated into an easy-to-use graphical user interface that can be downloaded from https://gitlab.com/vishsoft/fpmap.
Collapse
Affiliation(s)
- Vishwesh Venkatraman
- Department of Chemistry, Norwegian University of Science and Technology, Trondheim, Norway
| |
Collapse
|
10
|
Bonilla PA, Hoop CL, Stefanisko K, Tarasov SG, Sinha S, Nicklaus MC, Tarasova NI. Virtual screening of ultra-large chemical libraries identifies cell-permeable small-molecule inhibitors of a "non-druggable" target, STAT3 N-terminal domain. Front Oncol 2023; 13:1144153. [PMID: 37182134 PMCID: PMC10167007 DOI: 10.3389/fonc.2023.1144153] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Accepted: 03/23/2023] [Indexed: 05/16/2023] Open
Abstract
STAT3 N-terminal domain is a promising molecular target for cancer treatment and modulation of immune responses. However, STAT3 is localized in the cytoplasm, mitochondria, and nuclei, and thus, is inaccessible to therapeutic antibodies. Its N-terminal domain lacks deep pockets on the surface and represents a typical "non-druggable" protein. In order to successfully identify potent and selective inhibitors of the domain, we have used virtual screening of billion structure-sized virtual libraries of make-on-demand screening samples. The results suggest that the expansion of accessible chemical space by cutting-edge ultra-large virtual compound databases can lead to successful development of small molecule drugs for hard-to-target intracellular proteins.
Collapse
Affiliation(s)
- Pedro Andrade Bonilla
- Cancer Innovation Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Frederick, MD, United States
| | - Cody L. Hoop
- Cancer Innovation Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Frederick, MD, United States
| | - Karen Stefanisko
- Cancer Innovation Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Frederick, MD, United States
| | - Sergey G. Tarasov
- Center for Structural Biology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Frederick, MD, United States
| | | | - Marc C. Nicklaus
- Computer-Aided Drug Design Group, Chemical Biology Laboratory, Center for Cancer Research, National Cancer Institute, National Institute of Health (NIH), Frederick, MD, United States
| | - Nadya I. Tarasova
- Cancer Innovation Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Frederick, MD, United States
| |
Collapse
|
11
|
Korn M, Ehrt C, Ruggiu F, Gastreich M, Rarey M. Navigating large chemical spaces in early-phase drug discovery. Curr Opin Struct Biol 2023; 80:102578. [PMID: 37019067 DOI: 10.1016/j.sbi.2023.102578] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2022] [Revised: 01/28/2023] [Accepted: 02/26/2023] [Indexed: 04/07/2023]
Abstract
The size of actionable chemical spaces is surging, owing to a variety of novel techniques, both computational and experimental. As a consequence, novel molecular matter is now at our fingertips that cannot and should not be neglected in early-phase drug discovery. Huge, combinatorial, make-on-demand chemical spaces with high probability of synthetic success rise exponentially in content, generative machine learning models go hand in hand with synthesis prediction, and DNA-encoded libraries offer new ways of hit structure discovery. These technologies enable to search for new chemical matter in a much broader and deeper manner with less effort and fewer financial resources. These transformational developments require new cheminformatics approaches to make huge chemical spaces searchable and analyzable with low resources, and with as little energy consumption as possible. Substantial progress has been made in the past years with respect to computation as well as organic synthesis. First examples of bioactive compounds resulting from the successful use of these novel technologies demonstrate their power to contribute to tomorrow's drug discovery programs. This article gives a compact overview of the state-of-the-art.
Collapse
Affiliation(s)
- Malte Korn
- Universität Hamburg, ZBH - Center for Bioinformatics, Bundesstr. 43, 20146 Hamburg, Germany
| | - Christiane Ehrt
- Universität Hamburg, ZBH - Center for Bioinformatics, Bundesstr. 43, 20146 Hamburg, Germany
| | - Fiorella Ruggiu
- insitro, 279 E Grand Ave., CA 94608, South San Francisco, USA
| | - Marcus Gastreich
- BioSolveIT GmbH, An der Ziegelei 79, 53757 Sankt Augustin, Germany
| | - Matthias Rarey
- Universität Hamburg, ZBH - Center for Bioinformatics, Bundesstr. 43, 20146 Hamburg, Germany.
| |
Collapse
|
12
|
Abstract
Computer-aided drug discovery has been around for decades, although the past few years have seen a tectonic shift towards embracing computational technologies in both academia and pharma. This shift is largely defined by the flood of data on ligand properties and binding to therapeutic targets and their 3D structures, abundant computing capacities and the advent of on-demand virtual libraries of drug-like small molecules in their billions. Taking full advantage of these resources requires fast computational methods for effective ligand screening. This includes structure-based virtual screening of gigascale chemical spaces, further facilitated by fast iterative screening approaches. Highly synergistic are developments in deep learning predictions of ligand properties and target activities in lieu of receptor structure. Here we review recent advances in ligand discovery technologies, their potential for reshaping the whole process of drug discovery and development, as well as the challenges they encounter. We also discuss how the rapid identification of highly diverse, potent, target-selective and drug-like ligands to protein targets can democratize the drug discovery process, presenting new opportunities for the cost-effective development of safer and more effective small-molecule treatments.
Collapse
Affiliation(s)
- Anastasiia V Sadybekov
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
- Center for New Technologies in Drug Discovery and Development, Bridge Institute, Michelson Center for Convergent Biosciences, University of Southern California, Los Angeles, CA, USA
| | - Vsevolod Katritch
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
- Center for New Technologies in Drug Discovery and Development, Bridge Institute, Michelson Center for Convergent Biosciences, University of Southern California, Los Angeles, CA, USA.
- Department of Chemistry, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
13
|
Neves P, McClure K, Verhoeven J, Dyubankova N, Nugmanov R, Gedich A, Menon S, Shi Z, Wegner JK. Global reactivity models are impactful in industrial synthesis applications. J Cheminform 2023; 15:20. [PMID: 36774523 PMCID: PMC9921076 DOI: 10.1186/s13321-023-00685-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2022] [Accepted: 01/22/2023] [Indexed: 02/13/2023] Open
Abstract
Artificial Intelligence is revolutionizing many aspects of the pharmaceutical industry. Deep learning models are now routinely applied to guide drug discovery projects leading to faster and improved findings, but there are still many tasks with enormous unrealized potential. One such task is the reaction yield prediction. Every year more than one fifth of all synthesis attempts result in product yields which are either zero or too low. This equates to chemical and human resources being spent on activities which ultimately do not progress the programs, leading to a triple loss when accounting for the cost of opportunity in time wasted. In this work we pre-train a BERT model on more than 16 million reactions from 4 different data sources, and fine tune it to achieve an uncertainty calibrated global yield prediction model. This model is an improvement upon state of the art not just from the increase in pre-train data but also by introducing a new embedding layer which solves a few limitations of SMILES and enables integration of additional information such as equivalents and molecule role into the reaction encoding, the model is called BERT Enriched Embedding (BEE). The model is benchmarked on an open-source dataset against a state-of-the-art synthesis focused BERT showing a near 20-point improvement in r2 score. The model is fine-tuned and tested on an internal company data benchmark, and a prospective study shows that the application of the model can reduce the total number of negative reactions (yield under 5%) ran in Janssen by at least 34%. Lastly, we corroborate the previous results through experimental validation, by directly deploying the model in an on-going drug discovery project and showing that it can also be used successfully as a reagent recommender due to its fast inference speed and reliable confidence estimation, a critical feature for industry application.
Collapse
Affiliation(s)
- Paulo Neves
- In-Silico Discovery and External Innovation (ISDEI), Janssen Research & Development, Janssen Pharmaceutica N.V, Beerse, Belgium.
| | - Kelly McClure
- Discovery Chemistry LJ, Janssen Research & Development, Janssen Pharmaceutica N.V, Philadelphia, United States of America
| | - Jonas Verhoeven
- grid.419619.20000 0004 0623 0341In-Silico Discovery and External Innovation (ISDEI), Janssen Research & Development, Janssen Pharmaceutica N.V, Beerse, Belgium
| | - Natalia Dyubankova
- grid.419619.20000 0004 0623 0341In-Silico Discovery and External Innovation (ISDEI), Janssen Research & Development, Janssen Pharmaceutica N.V, Beerse, Belgium
| | - Ramil Nugmanov
- grid.419619.20000 0004 0623 0341In-Silico Discovery and External Innovation (ISDEI), Janssen Research & Development, Janssen Pharmaceutica N.V, Beerse, Belgium
| | | | - Sairam Menon
- grid.419619.20000 0004 0623 0341Pharma R&D Information Tech, Janssen Research & Development, Janssen Pharmaceutica N.V, Beerse, Belgium
| | - Zhicai Shi
- Discovery Chemistry LJ, Janssen Research & Development, Janssen Pharmaceutica N.V, Philadelphia, United States of America
| | - Jörg K. Wegner
- grid.419619.20000 0004 0623 0341In-Silico Discovery and External Innovation (ISDEI), Janssen Research & Development, Janssen Pharmaceutica N.V, Beerse, Belgium
| |
Collapse
|
14
|
Clyde A, Liu X, Brettin T, Yoo H, Partin A, Babuji Y, Blaiszik B, Mohd-Yusof J, Merzky A, Turilli M, Jha S, Ramanathan A, Stevens R. AI-accelerated protein-ligand docking for SARS-CoV-2 is 100-fold faster with no significant change in detection. Sci Rep 2023; 13:2105. [PMID: 36747041 PMCID: PMC9901402 DOI: 10.1038/s41598-023-28785-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Accepted: 01/24/2023] [Indexed: 02/08/2023] Open
Abstract
Protein-ligand docking is a computational method for identifying drug leads. The method is capable of narrowing a vast library of compounds down to a tractable size for downstream simulation or experimental testing and is widely used in drug discovery. While there has been progress in accelerating scoring of compounds with artificial intelligence, few works have bridged these successes back to the virtual screening community in terms of utility and forward-looking development. We demonstrate the power of high-speed ML models by scoring 1 billion molecules in under a day (50 k predictions per GPU seconds). We showcase a workflow for docking utilizing surrogate AI-based models as a pre-filter to a standard docking workflow. Our workflow is ten times faster at screening a library of compounds than the standard technique, with an error rate less than 0.01% of detecting the underlying best scoring 0.1% of compounds. Our analysis of the speedup explains that another order of magnitude speedup must come from model accuracy rather than computing speed. In order to drive another order of magnitude of acceleration, we share a benchmark dataset consisting of 200 million 3D complex structures and 2D structure scores across a consistent set of 13 million "in-stock" molecules over 15 receptors, or binding sites, across the SARS-CoV-2 proteome. We believe this is strong evidence for the community to begin focusing on improving the accuracy of surrogate models to improve the ability to screen massive compound libraries 100 × or even 1000 × faster than current techniques and reduce missing top hits. The technique outlined aims to be a fast drop-in replacement for docking for screening billion-scale molecular libraries.
Collapse
Affiliation(s)
- Austin Clyde
- Argonne National Laboratory, Data Science and Learning Division, Chicago, Lemont, 60439, USA.
- Department of Computer Science, University of Chicago, Chicago, 60637, USA.
| | - Xuefeng Liu
- Department of Computer Science, University of Chicago, Chicago, 60637, USA
| | - Thomas Brettin
- Department of Computer Science, University of Chicago, Chicago, 60637, USA
- Argonne National Laboratory, Computing, Environment, and Life Sciences Directorate, Lemont, 60439, USA
| | - Hyunseung Yoo
- Argonne National Laboratory, Data Science and Learning Division, Chicago, Lemont, 60439, USA
| | - Alexander Partin
- Argonne National Laboratory, Data Science and Learning Division, Chicago, Lemont, 60439, USA
| | - Yadu Babuji
- Department of Computer Science, University of Chicago, Chicago, 60637, USA
| | - Ben Blaiszik
- Argonne National Laboratory, Data Science and Learning Division, Chicago, Lemont, 60439, USA
- University of Chicago, Globus, Chicago, 60637, USA
| | - Jamaludin Mohd-Yusof
- Los Alamos National Laboratory, Computer, Computational, and Statistical Sciences, Los Alamos, 87545, USA
| | - Andre Merzky
- Department of Electrical and Computer Engineering, Rutgers University, Piscataway, 08854, USA
- Brookhaven National Laboratory, Computational Sciences Initiative, Upton, 11973, USA
| | - Matteo Turilli
- Department of Electrical and Computer Engineering, Rutgers University, Piscataway, 08854, USA
- Brookhaven National Laboratory, Computational Sciences Initiative, Upton, 11973, USA
| | - Shantenu Jha
- Department of Electrical and Computer Engineering, Rutgers University, Piscataway, 08854, USA
- Brookhaven National Laboratory, Computational Sciences Initiative, Upton, 11973, USA
| | - Arvind Ramanathan
- Argonne National Laboratory, Data Science and Learning Division, Chicago, Lemont, 60439, USA
| | - Rick Stevens
- Department of Computer Science, University of Chicago, Chicago, 60637, USA
- Argonne National Laboratory, Computing, Environment, and Life Sciences Directorate, Lemont, 60439, USA
| |
Collapse
|
15
|
Zeng X, Wang F, Luo Y, Kang SG, Tang J, Lightstone FC, Fang EF, Cornell W, Nussinov R, Cheng F. Deep generative molecular design reshapes drug discovery. Cell Rep Med 2022; 3:100794. [PMID: 36306797 DOI: 10.1016/j.xcrm.2022.100794] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2022] [Revised: 08/05/2022] [Accepted: 09/30/2022] [Indexed: 11/05/2022]
Abstract
Recent advances and accomplishments of artificial intelligence (AI) and deep generative models have established their usefulness in medicinal applications, especially in drug discovery and development. To correctly apply AI, the developer and user face questions such as which protocols to consider, which factors to scrutinize, and how the deep generative models can integrate the relevant disciplines. This review summarizes classical and newly developed AI approaches, providing an updated and accessible guide to the broad computational drug discovery and development community. We introduce deep generative models from different standpoints and describe the theoretical frameworks for representing chemical and biological structures and their applications. We discuss the data and technical challenges and highlight future directions of multimodal deep generative models for accelerating drug discovery.
Collapse
|
16
|
Zabolotna Y, Bonachera F, Horvath D, Lin A, Marcou G, Klimchuk O, Varnek A. Chemspace Atlas: Multiscale Chemography of Ultralarge Libraries for Drug Discovery. J Chem Inf Model 2022; 62:4537-4548. [DOI: 10.1021/acs.jcim.2c00509] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Yuliana Zabolotna
- University of Strasbourg, Laboratoire de Chemoinformatique, 4, rue B. Pascal, Strasbourg 67081, France
| | - Fanny Bonachera
- University of Strasbourg, Laboratoire de Chemoinformatique, 4, rue B. Pascal, Strasbourg 67081, France
| | - Dragos Horvath
- University of Strasbourg, Laboratoire de Chemoinformatique, 4, rue B. Pascal, Strasbourg 67081, France
| | - Arkadii Lin
- University of Strasbourg, Laboratoire de Chemoinformatique, 4, rue B. Pascal, Strasbourg 67081, France
| | - Gilles Marcou
- University of Strasbourg, Laboratoire de Chemoinformatique, 4, rue B. Pascal, Strasbourg 67081, France
| | - Olga Klimchuk
- University of Strasbourg, Laboratoire de Chemoinformatique, 4, rue B. Pascal, Strasbourg 67081, France
| | - Alexandre Varnek
- University of Strasbourg, Laboratoire de Chemoinformatique, 4, rue B. Pascal, Strasbourg 67081, France
| |
Collapse
|
17
|
Gaur AS, John L, Kumar N, Vivek MR, Nagamani S, Mahanta HJ, Sastry GN. Towards systematic exploration of chemical space: building the fragment library module in molecular property diagnostic suite. Mol Divers 2022. [PMID: 35925528 DOI: 10.1007/s11030-022-10506-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Accepted: 07/23/2022] [Indexed: 11/04/2022]
Abstract
A fragment-based drug discovery (FBDD) approach has traditionally been of utmost significance in drug design studies. It allows the exploration of large chemical space to find novel scaffolds and chemotypes which can be improved into selective inhibitors with good affinity. In the current work, several public domain chemical libraries (ChEMBL, DrugCentral, PDB ligands, COCONUT, and SAVI) comprising bioactive and virtual molecules were retrieved to develop a fragment library. A systematic fragmentation method that breaks a given molecule into rings, linkers, and substituents was used to cleave the molecules and the fragments were analyzed. Further, only the ring framework was taken into the consideration to develop a fragment library that consists of a total number of 107,614 unique fragments. This set represents a rich diverse structure framework that covers a wide variety of yet-to-be-explored fragments for a wide range of small molecule-based applications. This fragment library is an integral part of the molecular property diagnostic suite (MPDS) suite that can be used with other modeling and informatics methods for FBDD approaches. The fragment library module of MPDS can be accessed at http://mpds.neist.res.in:8085.
Collapse
|
18
|
Abstract
We present a comprehensive analysis of all ring systems (both heterocyclic and nonheterocyclic) in clinical trial compounds and FDA-approved drugs. We show 67% of small molecules in clinical trials comprise only ring systems found in marketed drugs, which mirrors previously published findings for newly approved drugs. We also show there are approximately 450 000 unique ring systems derived from 2.24 billion molecules currently available in synthesized chemical space, and molecules in clinical trials utilize only 0.1% of this available pool. Moreover, there are fewer ring systems in drugs compared with those in clinical trials, but this is balanced by the drug ring systems being reused more often. Furthermore, systematic changes of up to two atoms on existing drug and clinical trial ring systems give a set of 3902 future clinical trial ring systems, which are predicted to cover approximately 50% of the novel ring systems entering clinical trials.
Collapse
Affiliation(s)
| | | | | | - Malcolm MacCoss
- Bohicket Pharma Consulting Limited Liability Company, 2556 Seabrook Island Road, Seabrook Island, South Carolina29455, United States
| | | |
Collapse
|
19
|
Abstract
![]()
One application area
of computational methods in drug discovery
is the automated design of small molecules. Despite the large number
of publications describing methods and their application in both retrospective
and prospective studies, there is a lack of agreement on terminology
and key attributes to distinguish these various systems. We introduce
Automated Chemical Design (ACD) Levels to clearly define the level
of autonomy along the axes of ideation and decision making. To fully
illustrate this framework, we provide literature exemplars and place
some notable methods and applications into the levels. The ACD framework
provides a common language for describing automated small molecule
design systems and enables medicinal chemists to better understand
and evaluate such systems.
Collapse
Affiliation(s)
- Brian Goldman
- Relay Therapeutics, 399 Binney Street, Cambridge, Massachusetts 02139, United States
| | - Steven Kearnes
- Relay Therapeutics, 399 Binney Street, Cambridge, Massachusetts 02139, United States
| | - Trevor Kramer
- Relay Therapeutics, 399 Binney Street, Cambridge, Massachusetts 02139, United States
| | - Patrick Riley
- Relay Therapeutics, 399 Binney Street, Cambridge, Massachusetts 02139, United States
| | - W Patrick Walters
- Relay Therapeutics, 399 Binney Street, Cambridge, Massachusetts 02139, United States
| |
Collapse
|
20
|
Goodman JM, Blanke G, Kraut H. Analysing a billion reactions with the RInChI. PURE APPL CHEM 2022. [DOI: 10.1515/pac-2021-2008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Abstract
The RInChI is a canonical identifier for reactions which is widely used in reaction databases. It can be used to handle large collections of reactions and to link information from diverse data sources. How much information can it handle? Studies of the SAVI database, which contains more than a billion reactions, demonstrate that the RInChI is useful in analysing such a large collection of molecular data, and the reduced form of the Web-RInChIKey contains enough information to be an effective differentiator of reactions. Issues of NH tautomerism and stereochemistry are handled effectively. The RInChI illustrates that some of the properties of the algorithmically-generated SAVI database differ from SPRESI, which is a collection of experimental data. The RInChI has different properties to Reaction SMILES and both approaches provide useful and distinct information. We recommend that the RInChI be included in data models for reactions.
Collapse
Affiliation(s)
- Jonathan M. Goodman
- Yusuf Hamied Department of Chemistry , Lensfield Road , Cambridge , CB2 1EW , UK
| | - Gerd Blanke
- StructurePendium Technologies GmbH , Reulsbergweg 5, D-45257 Essen , Germany
| | - Hans Kraut
- InfoChem GmbH , Aschauerstr. 30, D-81549 Munich , Germany
| |
Collapse
|
21
|
Venkatraman V, Colligan TH, Lesica GT, Olson DR, Gaiser J, Copeland CJ, Wheeler TJ, Roy A. Drugsniffer: An Open Source Workflow for Virtually Screening Billions of Molecules for Binding Affinity to Protein Targets. Front Pharmacol 2022; 13:874746. [PMID: 35559261 PMCID: PMC9086895 DOI: 10.3389/fphar.2022.874746] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2022] [Accepted: 04/04/2022] [Indexed: 11/13/2022] Open
Abstract
The SARS-CoV2 pandemic has highlighted the importance of efficient and effective methods for identification of therapeutic drugs, and in particular has laid bare the need for methods that allow exploration of the full diversity of synthesizable small molecules. While classical high-throughput screening methods may consider up to millions of molecules, virtual screening methods hold the promise of enabling appraisal of billions of candidate molecules, thus expanding the search space while concurrently reducing costs and speeding discovery. Here, we describe a new screening pipeline, called drugsniffer, that is capable of rapidly exploring drug candidates from a library of billions of molecules, and is designed to support distributed computation on cluster and cloud resources. As an example of performance, our pipeline required ∼40,000 total compute hours to screen for potential drugs targeting three SARS-CoV2 proteins among a library of ∼3.7 billion candidate molecules.
Collapse
Affiliation(s)
- Vishwesh Venkatraman
- Department of Chemistry, Norwegian University of Science and Technology, Trondheim, Norway
| | - Thomas H. Colligan
- Department of Computer Science, University of Montana, Missoula, MT, United States
| | - George T. Lesica
- Department of Computer Science, University of Montana, Missoula, MT, United States
| | - Daniel R. Olson
- Department of Computer Science, University of Montana, Missoula, MT, United States
| | - Jeremiah Gaiser
- Department of Computer Science, University of Montana, Missoula, MT, United States
| | - Conner J. Copeland
- Department of Computer Science, University of Montana, Missoula, MT, United States
| | - Travis J. Wheeler
- Department of Computer Science, University of Montana, Missoula, MT, United States
| | - Amitava Roy
- Department of Computer Science, University of Montana, Missoula, MT, United States
- Rocky Mountain Laboratories, Bioinformatics and Computational Biosciences Branch, Office of Cyber Infrastructure and Computational Biology, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Hamilton, MT, United States
| |
Collapse
|
22
|
Schadow G, Borodina YV, Delannée V, Ihlenfeldt WD, Godfrey AG, Nicklaus MC. Reaction SPL – extension of a public document markup standard to chemical reactions. PURE APPL CHEM 2022. [PMCID: PMC9189732 DOI: 10.1515/pac-2021-2011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
There are numerous formats and data models for describing reaction-related data. However, each offers only a limited coverage of the multitude of information that can be of interest to a broad user base in the context of chemical reactions. Structured Product Labeling (SPL) is a robust yet fairly light public XML document standard. It uses a highly generic but usefully refinable data schema, which is, like a language, highly expressive. We are therefore presenting an extension of SPL to chemical reactions (“Reaction SPL”). This extension is designed to support chemical manufacturing processes, which include as a minimum the chemical reaction and the procedures and conditions to run it. We provide an overview of the SPL reaction specification structures followed by some examples of documents with reaction data: predicted single-step reactions, a two-step synthesis, an enzymatic reaction, an example how to represent a reaction center, a patent, and a fully annotated reaction with by-products. Special attention is given to a mechanism for atom-atom mapping of reactions as well as to the possibility to integrate Reaction SPL with laboratory automation equipment, in particular automated synthesis devices.
Collapse
Affiliation(s)
| | | | | | | | - Alexander G. Godfrey
- National Center for Advancing Translational Sciences, NIH , Rockville , MD , USA
| | | |
Collapse
|
23
|
Zahoránszky-Kőhalmi G, Lysov N, Vorontcov I, Wang J, Soundararajan J, Metaxotos D, Mathew B, Sarosh R, Michael SG, Godfrey AG. Algorithm for the Pruning of Synthesis Graphs. J Chem Inf Model 2022; 62:2226-2238. [PMID: 35438992 PMCID: PMC9093600 DOI: 10.1021/acs.jcim.1c01202] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Synthesis route planning is in the core of chemical intelligence that will power the autonomous chemistry platforms. In this task, we rely on algorithms to generate possible synthesis routes with the help of retro- and forward-synthetic approaches. Generated synthesis routes can be merged into a synthesis graph which represents theoretical pathways to the target molecule. However, it is often required to modify a synthesis graph due to typical constraints. These constraints might include "undesirable substances", e.g., an intermediate that the chemist does not favor or substances that might be toxic. Consequently, we need to prune the synthesis graph by the elimination of such undesirable substances. Synthesis graphs can be represented as directed (not necessarily acyclic) bipartite graphs, and the pruning of such graphs in the light of a set of undesirable substances has been an open question. In this study, we present the Synthesis Graph Pruning (SGP) algorithm that addresses this question. The input to the SGP algorithm is a synthesis graph and a set of undesirable substances. Furthermore, information for substances is provided as metadata regarding their availability from the inventory. The SGP algorithm operates with a simple local rule set, in order to determine which nodes and edges need to be eliminated from the synthesis graph. In this study, we present the SGP algorithm in detail and provide several case studies that demonstrate the operation of the SGP algorithm. We believe that the SGP algorithm will be an essential component of computer aided synthesis planning.
Collapse
Affiliation(s)
| | - Nikita Lysov
- National Center for Advancing Translational Sciences, Rockville, Maryland 20850, United States
| | - Ilia Vorontcov
- National Center for Advancing Translational Sciences, Rockville, Maryland 20850, United States
| | - Jeffrey Wang
- National Center for Advancing Translational Sciences, Rockville, Maryland 20850, United States
| | - Jeyaraman Soundararajan
- National Center for Advancing Translational Sciences, Rockville, Maryland 20850, United States
| | - Dimitrios Metaxotos
- National Center for Advancing Translational Sciences, Rockville, Maryland 20850, United States
| | - Biju Mathew
- National Center for Advancing Translational Sciences, Rockville, Maryland 20850, United States
| | - Rafat Sarosh
- National Center for Advancing Translational Sciences, Rockville, Maryland 20850, United States
| | - Samuel G Michael
- National Center for Advancing Translational Sciences, Rockville, Maryland 20850, United States
| | - Alexander G Godfrey
- National Center for Advancing Translational Sciences, Rockville, Maryland 20850, United States
| |
Collapse
|
24
|
Abstract
Designing new medicines more cheaply and quickly is tightly linked to the quest of exploring chemical space more widely and efficiently. Chemical space is monumentally large, but recent advances in computer software and hardware have enabled researchers to navigate virtual chemical spaces containing billions of chemical structures. This review specifically concerns collections of many millions or even billions of enumerated chemical structures as well as even larger chemical spaces that are not fully enumerated. We present examples of chemical libraries and spaces and the means used to construct them, and we discuss new technologies for searching huge libraries and for searching combinatorially in chemical space. We also cover space navigation techniques and consider new approaches to de novo drug design and the impact of the "autonomous laboratory" on synthesis of designed compounds. Finally, we summarize some other challenges and opportunities for the future.
Collapse
Affiliation(s)
- Wendy A Warr
- Wendy Warr & Associates, 6 Berwick Court, Holmes Chapel, Crewe, Cheshire CW4 7HZ, United Kingdom
| | - Marc C Nicklaus
- NCI, NIH, CADD Group, NCI-Frederick, Frederick, Maryland 21702, United States
| | - Christos A Nicolaou
- Discovery Chemistry, Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, Indiana 46285, United States
| | - Matthias Rarey
- Universität Hamburg, ZBH Center for Bioinformatics, 20146 Hamburg, Germany
| |
Collapse
|
25
|
Affiliation(s)
- Joel Wahl
- Scientific Computing Drug Discovery, Idorsia Pharmaceuticals Ltd., Hegenheimermattweg 91, CH-4123 Allschwil, Switzerland
| | - Thomas Sander
- Scientific Computing Drug Discovery, Idorsia Pharmaceuticals Ltd., Hegenheimermattweg 91, CH-4123 Allschwil, Switzerland
| |
Collapse
|
26
|
Bender BJ, Gahbauer S, Luttens A, Lyu J, Webb CM, Stein RM, Fink EA, Balius TE, Carlsson J, Irwin JJ, Shoichet BK. A practical guide to large-scale docking. Nat Protoc 2021; 16:4799-832. [PMID: 34561691 DOI: 10.1038/s41596-021-00597-z] [Citation(s) in RCA: 154] [Impact Index Per Article: 51.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Accepted: 06/22/2021] [Indexed: 02/08/2023]
Abstract
Structure-based docking screens of large compound libraries have become common in early drug and probe discovery. As computer efficiency has improved and compound libraries have grown, the ability to screen hundreds of millions, and even billions, of compounds has become feasible for modest-sized computer clusters. This allows the rapid and cost-effective exploration and categorization of vast chemical space into a subset enriched with potential hits for a given target. To accomplish this goal at speed, approximations are used that result in undersampling of possible configurations and inaccurate predictions of absolute binding energies. Accordingly, it is important to establish controls, as are common in other fields, to enhance the likelihood of success in spite of these challenges. Here we outline best practices and control docking calculations that help evaluate docking parameters for a given target prior to undertaking a large-scale prospective screen, with exemplification in one particular target, the melatonin receptor, where following this procedure led to direct docking hits with activities in the subnanomolar range. Additional controls are suggested to ensure specific activity for experimentally validated hit compounds. These guidelines should be useful regardless of the docking software used. Docking software described in the outlined protocol (DOCK3.7) is made freely available for academic research to explore new hits for a range of targets.
Collapse
|
27
|
Graff DE, Shakhnovich EI, Coley CW. Accelerating high-throughput virtual screening through molecular pool-based active learning. Chem Sci 2021; 12:7866-7881. [PMID: 34168840 PMCID: PMC8188596 DOI: 10.1039/d0sc06805e] [Citation(s) in RCA: 80] [Impact Index Per Article: 26.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2020] [Accepted: 04/26/2021] [Indexed: 12/13/2022] Open
Abstract
Structure-based virtual screening is an important tool in early stage drug discovery that scores the interactions between a target protein and candidate ligands. As virtual libraries continue to grow (in excess of 108 molecules), so too do the resources necessary to conduct exhaustive virtual screening campaigns on these libraries. However, Bayesian optimization techniques, previously employed in other scientific discovery problems, can aid in their exploration: a surrogate structure-property relationship model trained on the predicted affinities of a subset of the library can be applied to the remaining library members, allowing the least promising compounds to be excluded from evaluation. In this study, we explore the application of these techniques to computational docking datasets and assess the impact of surrogate model architecture, acquisition function, and acquisition batch size on optimization performance. We observe significant reductions in computational costs; for example, using a directed-message passing neural network we can identify 94.8% or 89.3% of the top-50 000 ligands in a 100M member library after testing only 2.4% of candidate ligands using an upper confidence bound or greedy acquisition strategy, respectively. Such model-guided searches mitigate the increasing computational costs of screening increasingly large virtual libraries and can accelerate high-throughput virtual screening campaigns with applications beyond docking.
Collapse
Affiliation(s)
- David E Graff
- Department of Chemistry and Chemical Biology, Harvard University Cambridge MA USA
| | - Eugene I Shakhnovich
- Department of Chemistry and Chemical Biology, Harvard University Cambridge MA USA
| | - Connor W Coley
- Department of Chemical Engineering, MIT Cambridge MA USA
| |
Collapse
|
28
|
Shrivastava AD, Kell DB. FragNet, a Contrastive Learning-Based Transformer Model for Clustering, Interpreting, Visualizing, and Navigating Chemical Space. Molecules 2021; 26:2065. [PMID: 33916824 PMCID: PMC8038408 DOI: 10.3390/molecules26072065] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2021] [Revised: 03/29/2021] [Accepted: 04/01/2021] [Indexed: 12/12/2022] Open
Abstract
The question of molecular similarity is core in cheminformatics and is usually assessed via a pairwise comparison based on vectors of properties or molecular fingerprints. We recently exploited variational autoencoders to embed 6M molecules in a chemical space, such that their (Euclidean) distance within the latent space so formed could be assessed within the framework of the entire molecular set. However, the standard objective function used did not seek to manipulate the latent space so as to cluster the molecules based on any perceived similarity. Using a set of some 160,000 molecules of biological relevance, we here bring together three modern elements of deep learning to create a novel and disentangled latent space, viz transformers, contrastive learning, and an embedded autoencoder. The effective dimensionality of the latent space was varied such that clear separation of individual types of molecules could be observed within individual dimensions of the latent space. The capacity of the network was such that many dimensions were not populated at all. As before, we assessed the utility of the representation by comparing clozapine with its near neighbors, and we also did the same for various antibiotics related to flucloxacillin. Transformers, especially when as here coupled with contrastive learning, effectively provide one-shot learning and lead to a successful and disentangled representation of molecular latent spaces that at once uses the entire training set in their construction while allowing "similar" molecules to cluster together in an effective and interpretable way.
Collapse
Affiliation(s)
- Aditya Divyakant Shrivastava
- Department of Computer Science and Engineering, Nirma University, Ahmedabad 382481, India;
- Department of Biochemistry and Systems Biology, Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Crown St., Liverpool L69 7ZB, UK
| | - Douglas B. Kell
- Department of Biochemistry and Systems Biology, Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Crown St., Liverpool L69 7ZB, UK
- Novo Nordisk Foundation Centre for Biosustainability, Technical University of Denmark, Building 220, Kemitorvet, 2800 Kgs Lyngby, Denmark
- Mellizyme Ltd., Liverpool Science Park, IC1, 131 Mount Pleasant, Liverpool L3 5TF, UK
| |
Collapse
|