1
|
Popov KI, Wellnitz J, Maxfield T, Tropsha A. HIt Discovery using docking ENriched by GEnerative Modeling (HIDDEN GEM): A novel computational workflow for accelerated virtual screening of ultra-large chemical libraries. Mol Inform 2024; 43:e202300207. [PMID: 37802967 PMCID: PMC11156482 DOI: 10.1002/minf.202300207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2023] [Revised: 10/03/2023] [Accepted: 10/06/2023] [Indexed: 10/08/2023]
Abstract
Recent rapid expansion of make-on-demand, purchasable, chemical libraries comprising dozens of billions or even trillions of molecules has challenged the efficient application of traditional structure-based virtual screening methods that rely on molecular docking. We present a novel computational methodology termed HIDDEN GEM (HIt Discovery using Docking ENriched by GEnerative Modeling) that greatly accelerates virtual screening. This workflow uniquely integrates machine learning, generative chemistry, massive chemical similarity searching and molecular docking of small, selected libraries in the beginning and the end of the workflow. For each target, HIDDEN GEM nominates a small number of top-scoring virtual hits prioritized from ultra-large chemical libraries. We have benchmarked HIDDEN GEM by conducting virtual screening campaigns for 16 diverse protein targets using Enamine REAL Space library comprising 37 billion molecules. We show that HIDDEN GEM yields the highest enrichment factors as compared to state of the art accelerated virtual screening methods, while requiring the least computational resources. HIDDEN GEM can be executed with any docking software and employed by users with limited computational resources.
Collapse
Affiliation(s)
- Konstantin I. Popov
- UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC, USA
- These authors contributed equally: Konstantin I. Popov, James Wellnitz, Travis Maxfield
| | - James Wellnitz
- UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC, USA
- These authors contributed equally: Konstantin I. Popov, James Wellnitz, Travis Maxfield
| | - Travis Maxfield
- UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC, USA
- These authors contributed equally: Konstantin I. Popov, James Wellnitz, Travis Maxfield
| | - Alexander Tropsha
- UNC Eshelman School of Pharmacy, University of North Carolina, Chapel Hill, NC, USA
| |
Collapse
|
2
|
Liu K, Han Y, Gong Z, Xu H. Low-Data Drug Design with Few-Shot Generative Domain Adaptation. Bioengineering (Basel) 2023; 10:1104. [PMID: 37760206 PMCID: PMC10526055 DOI: 10.3390/bioengineering10091104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Revised: 09/04/2023] [Accepted: 09/18/2023] [Indexed: 09/29/2023] Open
Abstract
Developing new drugs for emerging diseases, such as COVID-19, is crucial for promoting public health. In recent years, the application of artificial intelligence (AI) has significantly advanced drug discovery pipelines. Generative models, such as generative adversarial networks (GANs), exhibit the potential for discovering novel drug molecules by relying on a vast number of training samples. However, for new diseases, only a few samples are typically available, posing a significant challenge to learning a generative model that produces both high-quality and diverse molecules under limited supervision. To address this low-data drug generation issue, we propose a novel molecule generative domain adaptation paradigm (Mol-GenDA), which transfers a pre-trained GAN on a large-scale drug molecule dataset to a new disease domain using only a few references. Specifically, we introduce a molecule adaptor into the GAN generator during the fine tuning, allowing the generator to reuse prior knowledge learned in pre-training to the greatest extent and maintain the quality and diversity of the generated molecules. Comprehensive downstream experiments demonstrate that Mol-GenDA can produce high-quality and diverse drug candidates. In summary, the proposed approach offers a promising solution to expedite drug discovery for new diseases, which could lead to the timely development of effective drugs to combat emerging outbreaks.
Collapse
Affiliation(s)
- Ke Liu
- College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China;
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou 311200, China;
| | - Yuqiang Han
- College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China;
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou 311200, China;
| | - Zhichen Gong
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou 311200, China;
- Department of Computer Science, University College London, London WC1E 6BT, UK
| | - Hongxia Xu
- Innovation Institute for Artificial Intelligence in Medicine, Zhejiang University, Hangzhou 310027, China
| |
Collapse
|
3
|
Thomas M, Bender A, de Graaf C. Integrating structure-based approaches in generative molecular design. Curr Opin Struct Biol 2023; 79:102559. [PMID: 36870277 DOI: 10.1016/j.sbi.2023.102559] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2022] [Revised: 01/23/2023] [Accepted: 01/31/2023] [Indexed: 03/06/2023]
Abstract
Generative molecular design for drug discovery and development has seen a recent resurgence promising to improve the efficiency of the design-make-test-analyse cycle; by computationally exploring much larger chemical spaces than traditional virtual screening techniques. However, most generative models thus far have only utilized small-molecule information to train and condition de novo molecule generators. Here, we instead focus on recent approaches that incorporate protein structure into de novo molecule optimization in an attempt to maximize the predicted on-target binding affinity of generated molecules. We summarize these structure integration principles into either distribution learning or goal-directed optimization and for each case whether the approach is protein structure-explicit or implicit with respect to the generative model. We discuss recent approaches in the context of this categorization and provide our perspective on the future direction of the field.
Collapse
Affiliation(s)
- Morgan Thomas
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, CB2 1EW, UK.
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, CB2 1EW, UK. https://twitter.com/@AndreasBenderUK
| | - Chris de Graaf
- Sosei Heptares, Steinmetz Building, Granta Park, Great Abington, Cambridge, CB21 6DG, UK. https://twitter.com/@Chris_de_Graaf
| |
Collapse
|
4
|
Danel T, Łęski J, Podlewska S, Podolak IT. Docking-based generative approaches in the search for new drug candidates. Drug Discov Today 2023; 28:103439. [PMID: 36372330 DOI: 10.1016/j.drudis.2022.103439] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 10/08/2022] [Accepted: 11/08/2022] [Indexed: 11/13/2022]
Abstract
Despite the popularity of virtual screening (VS) of existing compound libraries, the search for new potential drug candidates also takes advantage of generative protocols, where new compound suggestions are enumerated using various algorithms. To increase the activity potency of generative approaches, they have recently been coupled with molecular docking, a leading methodology of structure-based drug design (SBDD). In this review, we summarize progress since docking-based generative models emerged. We propose a new taxonomy for these methods and discuss their importance for the field of computer-aided drug design (CADD). In addition, we discuss the most promising directions for the further development of generative protocols coupled with docking.
Collapse
Affiliation(s)
- Tomasz Danel
- Faculty of Mathematics and Computer Science, Jagiellonian University, 6 Łojasiewicza Street, 30-348 Kraków, Poland.
| | - Jan Łęski
- Faculty of Mathematics and Computer Science, Jagiellonian University, 6 Łojasiewicza Street, 30-348 Kraków, Poland
| | - Sabina Podlewska
- Maj Institute of Pharmacology, Polish Academy of Sciences, Department of Medicinal Chemistry, 31-343 Kraków, Smętna Street 12, Poland
| | - Igor T Podolak
- Faculty of Mathematics and Computer Science, Jagiellonian University, 6 Łojasiewicza Street, 30-348 Kraków, Poland
| |
Collapse
|
5
|
PETrans: De Novo Drug Design with Protein-Specific Encoding Based on Transfer Learning. Int J Mol Sci 2023; 24:ijms24021146. [PMID: 36674658 PMCID: PMC9865828 DOI: 10.3390/ijms24021146] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2022] [Revised: 12/29/2022] [Accepted: 01/04/2023] [Indexed: 01/11/2023] Open
Abstract
Recent years have seen tremendous success in the design of novel drug molecules through deep generative models. Nevertheless, existing methods only generate drug-like molecules, which require additional structural optimization to be developed into actual drugs. In this study, a deep learning method for generating target-specific ligands was proposed. This method is useful when the dataset for target-specific ligands is limited. Deep learning methods can extract and learn features (representations) in a data-driven way with little or no human participation. Generative pretraining (GPT) was used to extract the contextual features of the molecule. Three different protein-encoding methods were used to extract the physicochemical properties and amino acid information of the target protein. Protein-encoding and molecular sequence information are combined to guide molecule generation. Transfer learning was used to fine-tune the pretrained model to generate molecules with better binding ability to the target protein. The model was validated using three different targets. The docking results show that our model is capable of generating new molecules with higher docking scores for the target proteins.
Collapse
|
6
|
Durairaj J, de Ridder D, van Dijk AD. Beyond sequence: Structure-based machine learning. Comput Struct Biotechnol J 2022; 21:630-643. [PMID: 36659927 PMCID: PMC9826903 DOI: 10.1016/j.csbj.2022.12.039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Revised: 12/21/2022] [Accepted: 12/21/2022] [Indexed: 12/31/2022] Open
Abstract
Recent breakthroughs in protein structure prediction demarcate the start of a new era in structural bioinformatics. Combined with various advances in experimental structure determination and the uninterrupted pace at which new structures are published, this promises an age in which protein structure information is as prevalent and ubiquitous as sequence. Machine learning in protein bioinformatics has been dominated by sequence-based methods, but this is now changing to make use of the deluge of rich structural information as input. Machine learning methods making use of structures are scattered across literature and cover a number of different applications and scopes; while some try to address questions and tasks within a single protein family, others aim to capture characteristics across all available proteins. In this review, we look at the variety of structure-based machine learning approaches, how structures can be used as input, and typical applications of these approaches in protein biology. We also discuss current challenges and opportunities in this all-important and increasingly popular field.
Collapse
Affiliation(s)
- Janani Durairaj
- Biozentrum, University of Basel, Basel, Switzerland
- Bioinformatics Group, Department of Plant Sciences, Wageningen University and Research, Wageningen, the Netherlands
| | - Dick de Ridder
- Bioinformatics Group, Department of Plant Sciences, Wageningen University and Research, Wageningen, the Netherlands
| | - Aalt D.J. van Dijk
- Bioinformatics Group, Department of Plant Sciences, Wageningen University and Research, Wageningen, the Netherlands
| |
Collapse
|
7
|
Sauer S, Matter H, Hessler G, Grebner C. Optimizing interactions to protein binding sites by integrating docking-scoring strategies into generative AI methods. Front Chem 2022; 10:1012507. [PMID: 36339033 PMCID: PMC9629386 DOI: 10.3389/fchem.2022.1012507] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Accepted: 09/20/2022] [Indexed: 11/14/2022] Open
Abstract
The identification and optimization of promising lead molecules is essential for drug discovery. Recently, artificial intelligence (AI) based generative methods provided complementary approaches for generating molecules under specific design constraints of relevance in drug design. The goal of our study is to incorporate protein 3D information directly into generative design by flexible docking plus an adapted protein-ligand scoring function, thereby moving towards automated structure-based design. First, the protein-ligand scoring function RFXscore integrating individual scoring terms, ligand descriptors, and combined terms was derived using the PDBbind database and internal data. Next, design results for different workflows are compared to solely ligand-based reward schemes. Our newly proposed, optimal workflow for structure-based generative design is shown to produce promising results, especially for those exploration scenarios, where diverse structures fitting to a protein binding site are requested. Best results are obtained using docking followed by RFXscore, while, depending on the exact application scenario, it was also found useful to combine this approach with other metrics that bias structure generation into "drug-like" chemical space, such as target-activity machine learning models, respectively.
Collapse
Affiliation(s)
| | | | | | - Christoph Grebner
- Synthetic Molecular Design, Integrated Drug Discovery, Sanofi, Frankfurt, Germany
| |
Collapse
|
8
|
Thomas M, O’Boyle NM, Bender A, de Graaf C. Augmented Hill-Climb increases reinforcement learning efficiency for language-based de novo molecule generation. J Cheminform 2022; 14:68. [PMID: 36192789 PMCID: PMC9531503 DOI: 10.1186/s13321-022-00646-z] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2022] [Accepted: 09/23/2022] [Indexed: 11/10/2022] Open
Abstract
A plethora of AI-based techniques now exists to conduct de novo molecule generation that can devise molecules conditioned towards a particular endpoint in the context of drug design. One popular approach is using reinforcement learning to update a recurrent neural network or language-based de novo molecule generator. However, reinforcement learning can be inefficient, sometimes requiring up to 105 molecules to be sampled to optimize more complex objectives, which poses a limitation when using computationally expensive scoring functions like docking or computer-aided synthesis planning models. In this work, we propose a reinforcement learning strategy called Augmented Hill-Climb based on a simple, hypothesis-driven hybrid between REINVENT and Hill-Climb that improves sample-efficiency by addressing the limitations of both currently used strategies. We compare its ability to optimize several docking tasks with REINVENT and benchmark this strategy against other commonly used reinforcement learning strategies including REINFORCE, REINVENT (version 1 and 2), Hill-Climb and best agent reminder. We find that optimization ability is improved ~ 1.5-fold and sample-efficiency is improved ~ 45-fold compared to REINVENT while still delivering appealing chemistry as output. Diversity filters were used, and their parameters were tuned to overcome observed failure modes that take advantage of certain diversity filter configurations. We find that Augmented Hill-Climb outperforms the other reinforcement learning strategies used on six tasks, especially in the early stages of training or for more difficult objectives. Lastly, we show improved performance not only on recurrent neural networks but also on a reinforcement learning stabilized transformer architecture. Overall, we show that Augmented Hill-Climb improves sample-efficiency for language-based de novo molecule generation conditioning via reinforcement learning, compared to the current state-of-the-art. This makes more computationally expensive scoring functions, such as docking, more accessible on a relevant timescale.
Collapse
Affiliation(s)
- Morgan Thomas
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, CB2 1EW UK
| | - Noel M. O’Boyle
- Computational Chemistry, Sosei Heptares, Steinmetz Building, Granta Park, Great Abington, Cambridge, CB21 6DG UK
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Cambridge, CB2 1EW UK
| | - Chris de Graaf
- Computational Chemistry, Sosei Heptares, Steinmetz Building, Granta Park, Great Abington, Cambridge, CB21 6DG UK
| |
Collapse
|
9
|
Kuntz D, Wilson AK. Machine learning, artificial intelligence, and chemistry: how smart algorithms are reshaping simulation and the laboratory. PURE APPL CHEM 2022. [DOI: 10.1515/pac-2022-0202] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Abstract
Machine learning and artificial intelligence are increasingly gaining in prominence through image analysis, language processing, and automation, to name a few applications. Machine learning is also making profound changes in chemistry. From revisiting decades-old analytical techniques for the purpose of creating better calibration curves, to assisting and accelerating traditional in silico simulations, to automating entire scientific workflows, to being used as an approach to deduce underlying physics of unexplained chemical phenomena, machine learning and artificial intelligence are reshaping chemistry, accelerating scientific discovery, and yielding new insights. This review provides an overview of machine learning and artificial intelligence from a chemist’s perspective and focuses on a number of examples of the use of these approaches in computational chemistry and in the laboratory.
Collapse
Affiliation(s)
- David Kuntz
- Department of Chemistry , University of North Texas , Denton , TX 76201 , USA
| | - Angela K. Wilson
- Department of Chemistry , Michigan State University , East Lansing , MI 48824 , USA
| |
Collapse
|
10
|
Lai TT, Kuntz D, Wilson AK. Molecular Screening and Toxicity Estimation of 260,000 Perfluoroalkyl and Polyfluoroalkyl Substances (PFASs) through Machine Learning. J Chem Inf Model 2022; 62:4569-4578. [PMID: 36154169 DOI: 10.1021/acs.jcim.2c00374] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Perfluoroalkyl and polyfluoroalkyl substances (PFASs) are a class of chemicals widely used in industrial applications due to their exceptional properties and stability. However, they do not readily degrade in the environment and are linked to contamination and adverse health effects in humans and wildlife. To find alternatives for the most commonly used PFAS molecules that maintain their desirable chemical properties but are not adverse to biological lifeforms, a novel approach based upon machine learning is utilized. The machine learning model is trained on an existing set of PFAS molecules to generate over 260,000 novel PFAS molecules, which we dub PFAS-AI-Gen. Using molecular descriptors with known relationships to toxicity and industrial suitability followed by molecular docking and molecular dynamics simulations, this set of molecules is screened. In this manner, increasingly complex calculations are performed only for candidate molecules that are most likely to yield the desired properties of low binding affinity toward two selected protein receptors, the human pregnane x receptor (hPXR) and peroxisome proliferator-activated receptor γ (PPAR-γ), and high industrial suitability, defined by critical micelle concentration (CMC). The selection criteria of low binding affinity and high industrial suitability are relative to the popular PFAS alternative GenX. hPXR and PPAR-γ are selected as they are PFAS targets and facilitate a variety of functions, such as drug metabolism and glucose regulation, respectively. Through this approach, 22 promising new PFAS substitutes that may warrant experimental investigation are identified. This integrated approach of molecular screening and toxicity estimation may be applicable to other chemical classes.
Collapse
Affiliation(s)
- Thanh T Lai
- Department of Chemistry, Michigan State University, East Lansing, Michigan 48823, United States
| | - David Kuntz
- Department of Chemistry, Michigan State University, East Lansing, Michigan 48823, United States
| | - Angela K Wilson
- Department of Chemistry, Michigan State University, East Lansing, Michigan 48823, United States
| |
Collapse
|
11
|
Shimazaki T, Tachikawa M. Collaborative Approach between Explainable Artificial Intelligence and Simplified Chemical Interactions to Explore Active Ligands for Cyclin-Dependent Kinase 2. ACS OMEGA 2022; 7:10372-10381. [PMID: 35382271 PMCID: PMC8973106 DOI: 10.1021/acsomega.1c06976] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/14/2021] [Accepted: 03/09/2022] [Indexed: 05/13/2023]
Abstract
To improve virtual screening for drug discovery, we present a collaborative approach between explainable artificial intelligence (AI) and simplified chemical interaction scores to efficiently search for active ligands bound to the target receptor. In particular, we focus on cyclin-dependent kinase 2 (CDK2), which is well known as a cancer target protein. Docking simulation alone is insufficient to distinguish active ligands from decoy molecules. To identify active ligands, in this paper, machine learning is employed together with scoring functions that simplify the screened Coulomb and Lennard-Jones interactions between the ligands and residues of the target receptor. We demonstrate that these simplified interaction scores can significantly improve the classification ability of machine learning models. We also demonstrate that explainable AI together with the simplified scoring method can highlight the important residues of CDK2 for recognizing active ligands.
Collapse
Affiliation(s)
- Tomomi Shimazaki
- Graduate
School of Nanobioscience, Yokohama City
University, 22-2 Seto, Yokohama, Kanagawa 236-0027, Japan
| | - Masanori Tachikawa
- Graduate
School of Data Science, Yokohama City University, 22-2, Seto, Yokohama, Kanagawa 236-0027, Japan
| |
Collapse
|