1
|
The present state and challenges of active learning in drug discovery. Drug Discov Today 2024; 29:103985. [PMID: 38642700 DOI: 10.1016/j.drudis.2024.103985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2024] [Revised: 04/08/2024] [Accepted: 04/15/2024] [Indexed: 04/22/2024]
Abstract
Active learning (AL) is an iterative feedback process that efficiently identifies valuable data within vast chemical space, even with limited labeled data. This characteristic renders it a valuable approach to tackle the ongoing challenges faced in drug discovery, such as the ever-expanding explore space and the limitations of labeled data. Consequently, AL is increasingly gaining prominence in the field of drug development. In this paper, we comprehensively review the application of AL at all stages of drug discovery, including compounds-target interaction prediction, virtual screening, molecular generation and optimization, as well as molecular properties prediction. Additionally, we discuss the challenges and prospects associated with the current applications of AL in drug discovery.
Collapse
|
2
|
The Histone Deacetylase Family: Structural Features and Application of Combined Computational Methods. Pharmaceuticals (Basel) 2024; 17:620. [PMID: 38794190 PMCID: PMC11124352 DOI: 10.3390/ph17050620] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Revised: 05/03/2024] [Accepted: 05/08/2024] [Indexed: 05/26/2024] Open
Abstract
Histone deacetylases (HDACs) are crucial in gene transcription, removing acetyl groups from histones. They also influence the deacetylation of non-histone proteins, contributing to the regulation of various biological processes. Thus, HDACs play pivotal roles in various diseases, including cancer, neurodegenerative disorders, and inflammatory conditions, highlighting their potential as therapeutic targets. This paper reviews the structure and function of the four classes of human HDACs. While four HDAC inhibitors are currently available for treating hematological malignancies, numerous others are undergoing clinical trials. However, their non-selective toxicity necessitates ongoing research into safer and more efficient class-selective or isoform-selective inhibitors. Computational methods have aided the discovery of HDAC inhibitors with the desired potency and/or selectivity. These methods include ligand-based approaches, such as scaffold hopping, pharmacophore modeling, three-dimensional quantitative structure-activity relationships, and structure-based virtual screening (molecular docking). Moreover, recent developments in the field of molecular dynamics simulations, combined with Poisson-Boltzmann/molecular mechanics generalized Born surface area techniques, have improved the prediction of ligand binding affinity. In this review, we delve into the ways in which these methods have contributed to designing and identifying HDAC inhibitors.
Collapse
|
3
|
Machine-Guided Discovery of Acrylate Photopolymer Compositions. ACS APPLIED MATERIALS & INTERFACES 2024; 16:17992-18000. [PMID: 38534124 PMCID: PMC11009904 DOI: 10.1021/acsami.4c00759] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/14/2024] [Revised: 03/14/2024] [Accepted: 03/15/2024] [Indexed: 03/28/2024]
Abstract
Additive manufacturing (AM) can be advanced by the diverse characteristics offered by thermoplastic and thermoset polymers and the further benefits of copolymerization. However, the availability of suitable polymeric materials for AM is limited and may not always be ideal for specific applications. Additionally, the extensive number of potential monomers and their combinations make experimental determination of resin compositions extremely time-consuming and costly. To overcome these challenges, we develop an active learning (AL) approach to effectively choose compositions in a ternary monomer space ranging from rigid to elastomeric. Our AL algorithm dynamically suggests monomer composition ratios for the subsequent round of testing, allowing us to efficiently build a robust machine learning (ML) model capable of predicting polymer properties, including Young's modulus, peak stress, ultimate strain, and Shore A hardness based on composition while minimizing the number of experiments. As a demonstration of the effectiveness of our approach, we use the ML model to drive material selection for a specific property, namely, Young's modulus. The results indicate that the ML model can be used to select material compositions within at least 10% of a targeted value of Young's modulus. We then use the materials designed by the ML model to 3D print a multimaterial "hand" with soft "skin" and rigid "bones". This work presents a promising tool for enabling informed AM material selection tailored to user specifications and accelerating material discovery using a limited monomer space.
Collapse
|
4
|
Streamlining pipeline efficiency: a novel model-agnostic technique for accelerating conditional generative and virtual screening pipelines. Sci Rep 2023; 13:21069. [PMID: 38030689 PMCID: PMC10686981 DOI: 10.1038/s41598-023-42952-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Accepted: 09/16/2023] [Indexed: 12/01/2023] Open
Abstract
The discovery of potential therapeutic agents for life-threatening diseases has become a significant problem. There is a requirement for fast and accurate methods to identify drug-like molecules that can be used as potential candidates for novel targets. Existing techniques like high-throughput screening and virtual screening are time-consuming and inefficient. Traditional molecule generation pipelines are more efficient than virtual screening but use time-consuming docking software. Such docking functions can be emulated using Machine Learning models with comparable accuracy and faster execution times. However, we find that when pre-trained machine learning models are employed in generative pipelines as oracles, they suffer from model degradation in areas where data is scarce. In this study, we propose an active learning-based model that can be added as a supplement to enhanced molecule generation architectures. The proposed method uses uncertainty sampling on the molecules created by the generator model and dynamically learns as the generator samples molecules from different regions of the chemical space. The proposed framework can generate molecules with high binding affinity with [Formula: see text]a 70% improvement in runtime compared to the baseline model by labeling only [Formula: see text]30% of molecules compared to the baseline oracle.
Collapse
|
5
|
Meta-learning for transformer-based prediction of potent compounds. Sci Rep 2023; 13:16145. [PMID: 37752164 PMCID: PMC10522638 DOI: 10.1038/s41598-023-43046-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Accepted: 09/18/2023] [Indexed: 09/28/2023] Open
Abstract
For many machine learning applications in drug discovery, only limited amounts of training data are available. This typically applies to compound design and activity prediction and often restricts machine learning, especially deep learning. For low-data applications, specialized learning strategies can be considered to limit required training data. Among these is meta-learning that attempts to enable learning in low-data regimes by combining outputs of different models and utilizing meta-data from these predictions. However, in drug discovery settings, meta-learning is still in its infancy. In this study, we have explored meta-learning for the prediction of potent compounds via generative design using transformer models. For different activity classes, meta-learning models were derived to predict highly potent compounds from weakly potent templates in the presence of varying amounts of fine-tuning data and compared to other transformers developed for this task. Meta-learning consistently led to statistically significant improvements in model performance, in particular, when fine-tuning data were limited. Moreover, meta-learning models generated target compounds with higher potency and larger potency differences between templates and targets than other transformers, indicating their potential for low-data compound design.
Collapse
|
6
|
Active learning of enhancer and silencer regulatory grammar in photoreceptors. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.21.554146. [PMID: 37662358 PMCID: PMC10473580 DOI: 10.1101/2023.08.21.554146] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2023]
Abstract
Cis-regulatory elements (CREs) direct gene expression in health and disease, and models that can accurately predict their activities from DNA sequences are crucial for biomedicine. Deep learning represents one emerging strategy to model the regulatory grammar that relates CRE sequence to function. However, these models require training data on a scale that exceeds the number of CREs in the genome. We address this problem using active machine learning to iteratively train models on multiple rounds of synthetic DNA sequences assayed in live mammalian retinas. During each round of training the model actively selects sequence perturbations to assay, thereby efficiently generating informative training data. We iteratively trained a model that predicts the activities of sequences containing binding motifs for the photoreceptor transcription factor Cone-rod homeobox (CRX) using an order of magnitude less training data than current approaches. The model's internal confidence estimates of its predictions are reliable guides for designing sequences with high activity. The model correctly identified critical sequence differences between active and inactive sequences with nearly identical transcription factor binding sites, and revealed order and spacing preferences for combinations of motifs. Our results establish active learning as an effective method to train accurate deep learning models of cis-regulatory function after exhausting naturally occurring training examples in the genome.
Collapse
|
7
|
The Impact of Supervised Learning Methods in Ultralarge High-Throughput Docking. J Chem Inf Model 2023; 63:2267-2280. [PMID: 37036491 DOI: 10.1021/acs.jcim.2c01471] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/11/2023]
Abstract
Structure-based virtual screening methods are, nowadays, one of the key pillars of computational drug discovery. In recent years, a series of studies have reported docking-based virtual screening campaigns of large databases ranging from hundreds to thousands of millions compounds, further identifying novel hits after experimental validation. As these larg-scale efforts are not generally accessible, machine learning-based protocols have emerged to accelerate the identification of virtual hits within an ultralarge chemical space, reaching impressive reductions in computational time. Herein, we illustrate the motivation and the problem behind the screening of large databases, providing an overview of key concepts and essential applications of machine learning-accelerated protocols, specifically concerning supervised learning methods. We also discuss where the field stands with these novel developments, highlighting possible insights for future studies.
Collapse
|
8
|
Producing chemically accurate atomic Gaussian process regression models by active learning for molecular simulation. J Comput Chem 2022; 43:2084-2098. [PMID: 36165338 PMCID: PMC9828508 DOI: 10.1002/jcc.27006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2022] [Revised: 08/20/2022] [Accepted: 08/24/2022] [Indexed: 01/12/2023]
Abstract
Machine learning is becoming increasingly more important in the field of force field development. Never has it been more vital to have chemically accurate machine learning potentials because force fields become more sophisticated and their applications expand. In this study a method for developing chemically accurate Gaussian process regression models is demonstrated for an increasingly complex set of molecules. This work is an extension to previous work showing the progression of the active learning technique in producing more accurate models in much less CPU time than ever before. The per-atom active learning approach has unlocked the potential to generate chemically accurate models for molecules such as peptide-capped glycine.
Collapse
|
9
|
Trends and patterns in cancer nanotechnology research: A survey of NCI's caNanoLab and nanotechnology characterization laboratory. Adv Drug Deliv Rev 2022; 191:114591. [PMID: 36332724 PMCID: PMC9712232 DOI: 10.1016/j.addr.2022.114591] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2022] [Revised: 10/22/2022] [Accepted: 10/27/2022] [Indexed: 11/11/2022]
Abstract
Cancer nanotechnologies possess immense potential as therapeutic and diagnostic treatment modalities and have undergone significant and rapid advancement in recent years. With this emergence, the complexities of data standards in the field are on the rise. Data sharing and reanalysis is essential to more fully utilize this complex, interdisciplinary information to answer research questions, promote the technologies, optimize use of funding, and maximize the return on scientific investments. In order to support this, various data-sharing portals and repositories have been developed which not only provide searchable nanomaterial characterization data, but also provide access to standardized protocols for synthesis and characterization of nanomaterials as well as cutting-edge publications. The National Cancer Institute's (NCI) caNanoLab is a dedicated repository for all aspects pertaining to cancer-related nanotechnology data. The searchable database provides a unique opportunity for data mining and the use of artificial intelligence and machine learning, which aims to be an essential arm of future research studies, potentially speeding the design and optimization of next-generation therapies. It also provides an opportunity to track the latest trends and patterns in nanomedicine research. This manuscript provides the first look at such trends extracted from caNanoLab and compares these to similar metrics from the NCI's Nanotechnology Characterization Laboratory, a laboratory providing preclinical characterization of cancer nanotechnologies to researchers around the globe. Together, these analyses provide insight into the emerging interests of the research community and rise of promising nanoparticle technologies.
Collapse
|
10
|
KUALA: a machine learning-driven framework for kinase inhibitors repositioning. Sci Rep 2022; 12:17877. [PMID: 36284125 PMCID: PMC9595087 DOI: 10.1038/s41598-022-22324-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Accepted: 10/12/2022] [Indexed: 01/20/2023] Open
Abstract
The family of protein kinases comprises more than 500 genes involved in numerous functions. Hence, their physiological dysfunction has paved the way toward drug discovery for cancer, cardiovascular, and inflammatory diseases. As a matter of fact, Kinase binding sites high similarity has a double role. On the one hand it is a critical issue for selectivity, on the other hand, according to poly-pharmacology, a synergistic controlled effect on more than one target could be of great pharmacological interest. Another important aspect of binding similarity is the possibility of exploit it for repositioning of drugs on targets of the same family. In this study, we propose our approach called Kinase drUgs mAchine Learning frAmework (KUALA) to automatically identify kinase active ligands by using specific sets of molecular descriptors and provide a multi-target priority score and a repurposing threshold to suggest the best repurposable and non-repurposable molecules. The comprehensive list of all kinase-ligand pairs and their scores can be found at https://github.com/molinfrimed/multi-kinases .
Collapse
|
11
|
Abstract
![]()
The early stages of the drug design process involve identifying
compounds with suitable bioactivities via noisy assays. As databases
of possible drugs are often very large, assays can only be performed
on a subset of the candidates. Selecting which assays to perform is
best done within an active learning process, such as batched Bayesian
optimization, and aims to reduce the number of assays that must be
performed. We compare how noise affects different batched Bayesian
optimization techniques and introduce a retest policy to mitigate
the effect of noise. Our experiments show that batched Bayesian optimization
remains effective, even when large amounts of noise are present, and
that the retest policy enables more active compounds to be identified
in the same number of experiments.
Collapse
|
12
|
Predicting reaction conditions from limited data through active transfer learning. Chem Sci 2022; 13:6655-6668. [PMID: 35756521 PMCID: PMC9172577 DOI: 10.1039/d1sc06932b] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Accepted: 05/10/2022] [Indexed: 12/30/2022] Open
Abstract
Transfer and active learning have the potential to accelerate the development of new chemical reactions, using prior data and new experiments to inform models that adapt to the target area of interest. This article shows how specifically tuned machine learning models, based on random forest classifiers, can expand the applicability of Pd-catalyzed cross-coupling reactions to types of nucleophiles unknown to the model. First, model transfer is shown to be effective when reaction mechanisms and substrates are closely related, even when models are trained on relatively small numbers of data points. Then, a model simplification scheme is tested and found to provide comparative predictivity on reactions of new nucleophiles that include unseen reagent combinations. Lastly, for a challenging target where model transfer only provides a modest benefit over random selection, an active transfer learning strategy is introduced to improve model predictions. Simple models, composed of a small number of decision trees with limited depths, are crucial for securing generalizability, interpretability, and performance of active transfer learning.
Collapse
|
13
|
Abstract
![]()
One application area
of computational methods in drug discovery
is the automated design of small molecules. Despite the large number
of publications describing methods and their application in both retrospective
and prospective studies, there is a lack of agreement on terminology
and key attributes to distinguish these various systems. We introduce
Automated Chemical Design (ACD) Levels to clearly define the level
of autonomy along the axes of ideation and decision making. To fully
illustrate this framework, we provide literature exemplars and place
some notable methods and applications into the levels. The ACD framework
provides a common language for describing automated small molecule
design systems and enables medicinal chemists to better understand
and evaluate such systems.
Collapse
|
14
|
Synergy between machine learning and natural products cheminformatics: Application to the lead discovery of anthraquinone derivatives. Chem Biol Drug Des 2022; 100:185-217. [PMID: 35490393 DOI: 10.1111/cbdd.14062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2022] [Revised: 04/15/2022] [Accepted: 04/23/2022] [Indexed: 11/28/2022]
Abstract
Cheminformatics utilizing machine learning (ML) techniques have opened up a new horizon in drug discovery. This is owing to vast chemical space expansion with rocketing numbers of expected hits and lead compounds that match druggable macromolecular targets, in particular from natural compounds. Due to the natural products' (NP) structural complexity, uniqueness, and diversity, they could occupy a bigger space in pharmaceuticals, allowing the industry to pursue more selective leads in the nanomolar range of binding affinity. ML is an essential part of each step of the drug design pipeline, such as target prediction, compound library preparation, and lead optimization. Notably, molecular mechanic and dynamic simulations, induced docking, and free energy perturbations are essential in predicting best binding poses, binding free energy values, and molecular mechanics force fields. Those applications have leveraged from artificial intelligence (AI), which decreases the computational costs required for such costly simulations. This review aimed to describe chemical space and compound libraries related to NPs. High-throughput screening utilized for fractionating NPs and high-throughput virtual screening and their strategies, and significance, are reviewed. Particular emphasis was given to AI approaches, ML tools, algorithms, and techniques, especially in drug discovery of macrocyclic compounds and approaches in computer-aided and ML-based drug discovery. Anthraquinone derivatives were discussed as a source of new lead compounds that can be developed using ML tools for diverse medicinal uses such as cancer, infectious diseases, and metabolic disorders. Furthermore, the power of principal component analysis in understanding relevant protein conformations, and molecular modeling of protein-ligand interaction were also presented. Apart from being a concise reference for cheminformatics, this review is a useful text to understand the application of ML-based algorithms to molecular dynamics simulation and in silico absorption, distribution, metabolism, excretion, and toxicity prediction.
Collapse
|
15
|
Data-driven discovery of cardiolipin-selective small molecules by computational active learning. Chem Sci 2022; 13:4498-4511. [PMID: 35656132 PMCID: PMC9019913 DOI: 10.1039/d2sc00116k] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2022] [Accepted: 02/24/2022] [Indexed: 12/23/2022] Open
Abstract
Subtle variations in the lipid composition of mitochondrial membranes can have a profound impact on mitochondrial function. The inner mitochondrial membrane contains the phospholipid cardiolipin, which has been demonstrated to act as a biomarker for a number of diverse pathologies. Small molecule dyes capable of selectively partitioning into cardiolipin membranes enable visualization and quantification of the cardiolipin content. Here we present a data-driven approach that combines a deep learning-enabled active learning workflow with coarse-grained molecular dynamics simulations and alchemical free energy calculations to discover small organic compounds able to selectively permeate cardiolipin-containing membranes. By employing transferable coarse-grained models we efficiently navigate the all-atom design space corresponding to small organic molecules with molecular weight less than ≈500 Da. After direct simulation of only 0.42% of our coarse-grained search space we identify molecules with considerably increased levels of cardiolipin selectivity compared to a widely used cardiolipin probe 10-N-nonyl acridine orange. Our accumulated simulation data enables us to derive interpretable design rules linking coarse-grained structure to cardiolipin selectivity. The findings are corroborated by fluorescence anisotropy measurements of two compounds conforming to our defined design rules. Our findings highlight the potential of coarse-grained representations and multiscale modelling for materials discovery and design.
Collapse
|
16
|
Abstract
In chemoinformatics and medicinal chemistry, machine learning has evolved into an important approach. In recent years, increasing computational resources and new deep learning algorithms have put machine learning onto a new level, addressing previously unmet challenges in pharmaceutical research. In silico approaches for compound activity predictions, de novo design, and reaction modeling have been further advanced by new algorithmic developments and the emergence of big data in the field. Herein, novel applications of machine learning and deep learning in chemoinformatics and medicinal chemistry are reviewed. Opportunities and challenges for new methods and applications are discussed, placing emphasis on proper baseline comparisons, robust validation methodologies, and new applicability domains. Expected final online publication date for the Annual Review of Biomedical Data Science, Volume 5 is August 2022. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.
Collapse
|
17
|
Machine learning-based predictive models for identifying high active compounds against HIV-1 integrase. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2022; 33:387-402. [PMID: 35410555 DOI: 10.1080/1062936x.2022.2057588] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Abstract
HIV-integrase is an important drug target because it catalyzes chromosomal integration of proviral DNA towards establishing latent infection. Computer-aided drug design has immensely contributed to identifying and developing novel antiviral drugs. We have developed various machine learning-based predictive models for identifying high activity compounds against HIV-integrase. Multiclass models were built using support vector machine with reasonable accuracy on the test and evaluation sets. The developed models were evaluated by rigorous validation approaches and the best features were selected by Boruta method. As compared to the model developed from all descriptors set, a slight improvement was observed among the selected descriptors. Validated models were further used for virtual screening of potential compounds from ChemBridge library. Of the six high active compounds predicted from selected models, compounds 9103124, 6642917 and 9082952 showed the most reasonable binding-affinity and stable-interaction with HIV-integrase active-site residues Asp64, Glu152 and Asn155. This was in agreement with previous reports on the essentiality of these residues against a wide range of inhibitors. We therefore highlight the rigorosity of validated classification models for accurate prediction and ranking of high active lead drugs against HIV-integrase.
Collapse
|
18
|
Coarse-Grained Density Functional Theory Predictions via Deep Kernel Learning. J Chem Theory Comput 2022; 18:1129-1141. [PMID: 35020388 DOI: 10.1021/acs.jctc.1c01001] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Scalable electronic predictions are critical for soft materials design. Recently, the Electronic Coarse-Graining (ECG) method was introduced to renormalize all-atom quantum chemical (QC) predictions to coarse-grained (CG) resolutions using deep neural networks (DNNs). While DNNs can learn complex representations that prove challenging for kernel-based methods, they are susceptible to overfitting and the overconfidence of uncertainty estimations. Here, we develop ECG within a GPU-accelerated Deep Kernel Learning (DKL) framework to enable CG QC predictions using range-separated hybrid density functional theory (DFT), obtaining a 107 speedup relative to naive all-atom QC. By treating the predicted electronic properties as random Gaussian Processes, DKL incorporates CG mapping degeneracy by learning the distribution of electronic energies as a function of CG configuration. DKL-ECG accurately reproduces molecular orbital energies from range-separated DFT while facilitating efficient training via active learning using the uncertainties provided by DKL. We show that while active learning algorithms enable efficient sampling of a more diverse configurational space relative to random sampling, all explored query methods exhibit comparable performance for the examined system. We attribute this result to the significant overlap of the feature space and output property distributions across multiple temperatures.
Collapse
|
19
|
Active Learning for Drug Design: A Case Study on the Plasma Exposure of Orally Administered Drugs. J Med Chem 2021; 64:16838-16853. [PMID: 34779199 DOI: 10.1021/acs.jmedchem.1c01683] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023]
Abstract
The success of artificial intelligence (AI) models has been limited by the requirement of large amounts of high-quality training data, which is just the opposite of the situation in most drug discovery pipelines. Active learning (AL) is a subfield of AI that focuses on algorithms that select the data they need to improve their models. Here, we propose a two-phase AL pipeline and apply it to the prediction of drug oral plasma exposure. In phase I, the AL-based model demonstrated a remarkable capability to sample informative data from a noisy data set, which used only 30% of the training data to yield a prediction capability with an accuracy of 0.856 on an independent test set. In phase II, the AL-based model explored a large diverse chemical space (855K samples) for experimental testing and feedback. Improved accuracy and new highly confident predictions (50K samples) were observed, which suggest that the model's applicability domain has been significantly expanded.
Collapse
|
20
|
DeepReac+: deep active learning for quantitative modeling of organic chemical reactions. Chem Sci 2021; 12:14459-14472. [PMID: 34880997 PMCID: PMC8580052 DOI: 10.1039/d1sc02087k] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2021] [Accepted: 10/08/2021] [Indexed: 11/21/2022] Open
Abstract
Various computational methods have been developed for quantitative modeling of organic chemical reactions; however, the lack of universality as well as the requirement of large amounts of experimental data limit their broad applications. Here, we present DeepReac+, an efficient and universal computational framework for prediction of chemical reaction outcomes and identification of optimal reaction conditions based on deep active learning. Under this framework, DeepReac is designed as a graph-neural-network-based model, which directly takes 2D molecular structures as inputs and automatically adapts to different prediction tasks. In addition, carefully-designed active learning strategies are incorporated to substantially reduce the number of necessary experiments for model training. We demonstrate the universality and high efficiency of DeepReac+ by achieving the state-of-the-art results with a minimum of labeled data on three diverse chemical reaction datasets in several scenarios. Collectively, DeepReac+ has great potential and utility in the development of AI-aided chemical synthesis. DeepReac+ is freely accessible at https://github.com/bm2-lab/DeepReac.
Collapse
|
21
|
Machine Learning-Reinforced Noninvasive Biosensors for Healthcare. Adv Healthc Mater 2021; 10:e2100734. [PMID: 34165240 DOI: 10.1002/adhm.202100734] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2021] [Revised: 06/06/2021] [Indexed: 12/12/2022]
Abstract
The emergence and development of noninvasive biosensors largely facilitate the collection of physiological signals and the processing of health-related data. The utilization of appropriate machine learning algorithms improves the accuracy and efficiency of biosensors. Machine learning-reinforced biosensors are started to use in clinical practice, health monitoring, and food safety, bringing a digital revolution in healthcare. Herein, the recent advances in machine learning-reinforced noninvasive biosensors applied in healthcare are summarized. First, different types of noninvasive biosensors and physiological signals collected are categorized and summarized. Then machine learning algorithms adopted in subsequent data processing are introduced and their practical applications in biosensors are reviewed. Finally, the challenges faced by machine learning-reinforced biosensors are raised, including data privacy and adaptive learning capability, and their prospects in real-time monitoring, out-of-clinic diagnosis, and onsite food safety detection are proposed.
Collapse
|
22
|
Abstract
This review provides the feasible literature on drug discovery through ML tools and techniques that are enforced in every phase of drug development to accelerate the research process and deduce the risk and expenditure in clinical trials. Machine learning techniques improve the decision-making in pharmaceutical data across various applications like QSAR analysis, hit discoveries, de novo drug architectures to retrieve accurate outcomes. Target validation, prognostic biomarkers, digital pathology are considered under problem statements in this review. ML challenges must be applicable for the main cause of inadequacy in interpretability outcomes that may restrict the applications in drug discovery. In clinical trials, absolute and methodological data must be generated to tackle many puzzles in validating ML techniques, improving decision-making, promoting awareness in ML approaches, and deducing risk failures in drug discovery.
Collapse
|
23
|
Development of Machine Learning Models for Accurately Predicting and Ranking the Activity of Lead Molecules to Inhibit PRC2 Dependent Cancer. Pharmaceuticals (Basel) 2021; 14:ph14070699. [PMID: 34358125 PMCID: PMC8308948 DOI: 10.3390/ph14070699] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Revised: 07/14/2021] [Accepted: 07/14/2021] [Indexed: 12/22/2022] Open
Abstract
Disruption of epigenetic processes to eradicate tumor cells is among the most promising interventions for cancer control. EZH2 (Enhancer of zeste homolog 2), a catalytic component of polycomb repressive complex 2 (PRC2), methylates lysine 27 of histone H3 to promote transcriptional silencing and is an important drug target for controlling cancer via epigenetic processes. In the present study, we have developed various predictive models for modeling the inhibitory activity of EZH2. Binary and multiclass models were built using SVM, random forest and XGBoost methods. Rigorous validation approaches including predictiveness curve, Y-randomization and applicability domain (AD) were employed for evaluation of the developed models. Eighteen descriptors selected from Boruta methods have been used for modeling. For binary classification, random forest and XGBoost achieved an accuracy of 0.80 and 0.82, respectively, on external test set. Contrastingly, for multiclass models, random forest and XGBoost achieved an accuracy of 0.73 and 0.75, respectively. 500 Y-randomization runs demonstrate that the models were robust and the correlations were not by chance. Evaluation metrics from predictiveness curve show that the selected eighteen descriptors predict active compounds with total gain (TG) of 0.79 and 0.59 for XGBoost and random forest, respectively. Validated models were further used for virtual screening and molecular docking in search of potential hits. A total of 221 compounds were commonly predicted as active with above the set probability threshold and also under the AD of training set. Molecular docking revealed that three compounds have reasonable binding energy and favorable interactions with critical residues in the active site of EZH2. In conclusion, we highlighted the potential of rigorously validated models for accurately predicting and ranking the activities of lead molecules against cancer epigenetic targets. The models presented in this study represent the platform for development of EZH2 inhibitors.
Collapse
|
24
|
Combining generative artificial intelligence and on-chip synthesis for de novo drug design. SCIENCE ADVANCES 2021; 7:eabg3338. [PMID: 34117066 PMCID: PMC8195470 DOI: 10.1126/sciadv.abg3338] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/27/2020] [Accepted: 04/23/2021] [Indexed: 05/24/2023]
Abstract
Automating the molecular design-make-test-analyze cycle accelerates hit and lead finding for drug discovery. Using deep learning for molecular design and a microfluidics platform for on-chip chemical synthesis, liver X receptor (LXR) agonists were generated from scratch. The computational pipeline was tuned to explore the chemical space of known LXRα agonists and generate novel molecular candidates. To ensure compatibility with automated on-chip synthesis, the chemical space was confined to the virtual products obtainable from 17 one-step reactions. Twenty-five de novo designs were successfully synthesized in flow. In vitro screening of the crude reaction products revealed 17 (68%) hits, with up to 60-fold LXR activation. The batch resynthesis, purification, and retesting of 14 of these compounds confirmed that 12 of them were potent LXR agonists. These results support the suitability of the proposed design-make-test-analyze framework as a blueprint for automated drug design with artificial intelligence and miniaturized bench-top synthesis.
Collapse
|
25
|
Evaluation of Categorical Matrix Completion Algorithms: Towards Improved Active Learning for Drug Discovery. Bioinformatics 2021; 37:3538-3545. [PMID: 33983377 PMCID: PMC8545350 DOI: 10.1093/bioinformatics/btab322] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Revised: 04/05/2021] [Accepted: 04/29/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION High throughput and high content screening are extensively used to determine the effect of small molecule compounds and other potential therapeutics upon particular targets as part of the early drug development process. However, screening is typically used to find compounds that have a desired effect but not to identify potential undesirable side effects. This is because the size of the search space precludes measuring the potential effect of all compounds on all targets. Active machine learning has been proposed as a solution to this problem. RESULTS In this article, we describe an improved imputation method, Impute By Committee, for completion of matrices containing categorical values. We compare this method to existing approaches in the context of modeling the effects of many compounds on many targets using latent similarities between compounds and conditions. We also compare these methods for the task of driving active learning in well-characterized settings for synthetic and real datasets. Our new approach performed the best overall both in the accuracy of matrix completion itself and in the number of experiments needed to train an accurate predictive model compared to random selection of experiments. We further improved upon the performance of our new method by developing an adaptive switching strategy for active learning that iteratively chooses between different matrix completion methods. AVAILABILITY A Reproducible Research Archive containing all data and code will be made available upon acceptance at http://murphylab.cbd.cmu.edu/software. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
26
|
Abstract
Recent advances in computer hardware and software have led to a revolution in deep neural networks that has impacted fields ranging from language translation to computer vision. Deep learning has also impacted a number of areas in drug discovery, including the analysis of cellular images and the design of novel routes for the synthesis of organic molecules. While work in these areas has been impactful, a complete review of the applications of deep learning in drug discovery would be beyond the scope of a single Account. In this Account, we will focus on two key areas where deep learning has impacted molecular design: the prediction of molecular properties and the de novo generation of suggestions for new molecules.One of the most significant advances in the development of quantitative structure-activity relationships (QSARs) has come from the application of deep learning methods to the prediction of the biological activity and physical properties of molecules in drug discovery programs. Rather than employing the expert-derived chemical features typically used to build predictive models, researchers are now using deep learning to develop novel molecular representations. These representations, coupled with the ability of deep neural networks to uncover complex, nonlinear relationships, have led to state-of-the-art performance. While deep learning has changed the way that many researchers approach QSARs, it is not a panacea. As with any other machine learning task, the design of predictive models is dependent on the quality, quantity, and relevance of available data. Seemingly fundamental issues, such as optimal methods for creating a training set, are still open questions for the field. Another critical area that is still the subject of multiple research efforts is the development of methods for assessing the confidence in a model.Deep learning has also contributed to a renaissance in the application of de novo molecule generation. Rather than relying on manually defined heuristics, deep learning methods learn to generate new molecules based on sets of existing molecules. Techniques that were originally developed for areas such as image generation and language translation have been adapted to the generation of molecules. These deep learning methods have been coupled with the predictive models described above and are being used to generate new molecules with specific predicted biological activity profiles. While these generative algorithms appear promising, there have been only a few reports on the synthesis and testing of molecules based on designs proposed by generative models. The evaluation of the diversity, quality, and ultimate value of molecules produced by generative models is still an open question. While the field has produced a number of benchmarks, it has yet to agree on how one should ultimately assess molecules "invented" by an algorithm.
Collapse
|
27
|
Practical Chemogenomic Modeling and Molecule Discovery Strategies Unveiled by Active Learning. SYSTEMS MEDICINE 2021. [DOI: 10.1016/b978-0-12-801238-3.11533-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
|
28
|
Advanced machine-learning techniques in drug discovery. Drug Discov Today 2020; 26:769-777. [PMID: 33290820 DOI: 10.1016/j.drudis.2020.12.003] [Citation(s) in RCA: 43] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Revised: 11/16/2020] [Accepted: 12/02/2020] [Indexed: 01/20/2023]
Abstract
The popularity of machine learning (ML) across drug discovery continues to grow, yielding impressive results. As their use increases, so do their limitations become apparent. Such limitations include their need for big data, sparsity in data, and their lack of interpretability. It has also become apparent that the techniques are not truly autonomous, requiring retraining even post deployment. In this review, we detail the use of advanced techniques to circumvent these challenges, with examples drawn from drug discovery and allied disciplines. In addition, we present emerging techniques and their potential role in drug discovery. The techniques presented herein are anticipated to expand the applicability of ML in drug discovery.
Collapse
|
29
|
Deep learning of pharmacogenomics resources: moving towards precision oncology. Brief Bioinform 2020; 21:2066-2083. [PMID: 31813953 PMCID: PMC7711267 DOI: 10.1093/bib/bbz144] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2019] [Revised: 08/22/2019] [Accepted: 10/18/2019] [Indexed: 12/13/2022] Open
Abstract
The recent accumulation of cancer genomic data provides an opportunity to understand how a tumor's genomic characteristics can affect its responses to drugs. This field, called pharmacogenomics, is a key area in the development of precision oncology. Deep learning (DL) methodology has emerged as a powerful technique to characterize and learn from rapidly accumulating pharmacogenomics data. We introduce the fundamentals and typical model architectures of DL. We review the use of DL in classification of cancers and cancer subtypes (diagnosis and treatment stratification of patients), prediction of drug response and drug synergy for individual tumors (treatment prioritization for a patient), drug repositioning and discovery and the study of mechanism/mode of action of treatments. For each topic, we summarize current genomics and pharmacogenomics data resources such as pan-cancer genomics data for cancer cell lines (CCLs) and tumors, and systematic pharmacologic screens of CCLs. By revisiting the published literature, including our in-house analyses, we demonstrate the unprecedented capability of DL enabled by rapid accumulation of data resources to decipher complex drug response patterns, thus potentially improving cancer medicine. Overall, this review provides an in-depth summary of state-of-the-art DL methods and up-to-date pharmacogenomics resources and future opportunities and challenges to realize the goal of precision oncology.
Collapse
|
30
|
Abstract
By combining metal nodes with organic linkers we can potentially synthesize millions of possible metal-organic frameworks (MOFs). The fact that we have so many materials opens many exciting avenues but also create new challenges. We simply have too many materials to be processed using conventional, brute force, methods. In this review, we show that having so many materials allows us to use big-data methods as a powerful technique to study these materials and to discover complex correlations. The first part of the review gives an introduction to the principles of big-data science. We show how to select appropriate training sets, survey approaches that are used to represent these materials in feature space, and review different learning architectures, as well as evaluation and interpretation strategies. In the second part, we review how the different approaches of machine learning have been applied to porous materials. In particular, we discuss applications in the field of gas storage and separation, the stability of these materials, their electronic properties, and their synthesis. Given the increasing interest of the scientific community in machine learning, we expect this list to rapidly expand in the coming years.
Collapse
|
31
|
Machine learning-driven new material discovery. NANOSCALE ADVANCES 2020; 2:3115-3130. [PMID: 36134280 PMCID: PMC9419423 DOI: 10.1039/d0na00388c] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/13/2020] [Accepted: 06/22/2020] [Indexed: 05/12/2023]
Abstract
New materials can bring about tremendous progress in technology and applications. However, the commonly used trial-and-error method cannot meet the current need for new materials. Now, a newly proposed idea of using machine learning to explore new materials is becoming popular. In this paper, we review this research paradigm of applying machine learning in material discovery, including data preprocessing, feature engineering, machine learning algorithms and cross-validation procedures. Furthermore, we propose to assist traditional DFT calculations with machine learning for material discovery. Many experiments and literature reports have shown the great effects and prospects of this idea. It is currently showing its potential and advantages in property prediction, material discovery, inverse design, corrosion detection and many other aspects of life.
Collapse
|
32
|
A weighted ensemble-based active learning model to label microarray data. Med Biol Eng Comput 2020; 58:2427-2441. [PMID: 32770460 DOI: 10.1007/s11517-020-02238-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2019] [Accepted: 07/26/2020] [Indexed: 10/23/2022]
Abstract
Classification of cancerous genes from microarray data is an important research area in bioinformatics. Large amount of microarray data are available, but it is very costly to label them. This paper proposes an active learning model, a semi-supervised classification approach, to label the microarray data using which predictions can be made even with lesser amount of labeled data. Initially, a pool of unlabeled instances is given from which some instances are randomly chosen for labeling. Successive selection of instances to be labeled from unlabeled pool is determined by selection algorithms. The proposed method is devised following an ensemble approach to combine the decisions of three classifiers in order to arrive at a consensus which provides a more accurate prediction of the class label to ensure that each individual classifier learns in an uncorrelated manner. Our method combines the heuristic techniques used by an active learning algorithm to choose training samples with the multiple learning paradigm attained by an ensemble to optimize the search space by choosing efficiently from an already sparse learning pool. On evaluating the proposed method on 10 microarray datasets, we achieve performance which is comparable with state-of-the-art methods. The code and datasets are given at https://github.com/anuran-Chakraborty/Active-learning. Flowchart of the proposed ensemble-based active learning framework.
Collapse
|
33
|
Active learning efficiently converges on rational limits of toxicity prediction and identifies patterns for molecule design. ACTA ACUST UNITED AC 2020. [DOI: 10.1016/j.comtox.2020.100129] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
|
34
|
Practical considerations for active machine learning in drug discovery. DRUG DISCOVERY TODAY. TECHNOLOGIES 2020; 32-33:73-79. [PMID: 33386097 DOI: 10.1016/j.ddtec.2020.06.001] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Revised: 06/01/2020] [Accepted: 06/10/2020] [Indexed: 02/01/2023]
Abstract
Active machine learning enables the automated selection of the most valuable next experiments to improve predictive modelling and hasten active retrieval in drug discovery. Although a long established theoretical concept and introduced to drug discovery approximately 15 years ago, the deployment of active learning technology in the discovery pipelines across academia and industry remains slow. With the recent re-discovered enthusiasm for artificial intelligence as well as improved flexibility of laboratory automation, active learning is expected to surge and become a key technology for molecular optimizations. This review recapitulates key findings from previous active learning studies to highlight the challenges and opportunities of applying adaptive machine learning to drug discovery. Specifically, considerations regarding implementation, infrastructural integration, and expected benefits are discussed. By focusing on these practical aspects of active learning, this review aims at providing insights for scientists planning to implement active learning workflows in their discovery pipelines.
Collapse
|
35
|
Autonomous Discovery in the Chemical Sciences Part I: Progress. Angew Chem Int Ed Engl 2020; 59:22858-22893. [DOI: 10.1002/anie.201909987] [Citation(s) in RCA: 100] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2019] [Indexed: 01/05/2023]
|
36
|
|
37
|
An Analysis of QSAR Research Based on Machine Learning Concepts. Curr Drug Discov Technol 2020; 18:17-30. [PMID: 32178612 DOI: 10.2174/1570163817666200316104404] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2019] [Revised: 08/22/2019] [Accepted: 10/28/2019] [Indexed: 11/22/2022]
Abstract
Quantitative Structure-Activity Relationship (QSAR) is a popular approach developed to correlate chemical molecules with their biological activities based on their chemical structures. Machine learning techniques have proved to be promising solutions to QSAR modeling. Due to the significant role of machine learning strategies in QSAR modeling, this area of research has attracted much attention from researchers. A considerable amount of literature has been published on machine learning based QSAR modeling methodologies whilst this domain still suffers from lack of a recent and comprehensive analysis of these algorithms. This study systematically reviews the application of machine learning algorithms in QSAR, aiming to provide an analytical framework. For this purpose, we present a framework called 'ML-QSAR'. This framework has been designed for future research to: a) facilitate the selection of proper strategies among existing algorithms according to the application area requirements, b) help to develop and ameliorate current methods and c) providing a platform to study existing methodologies comparatively. In ML-QSAR, first a structured categorization is depicted which studied the QSAR modeling research based on machine models. Then several criteria are introduced in order to assess the models. Finally, inspired by aforementioned criteria the qualitative analysis is carried out.
Collapse
|
38
|
Iterative experimental design based on active machine learning reduces the experimental burden associated with reaction screening. REACT CHEM ENG 2020. [DOI: 10.1039/d0re00232a] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
Through iterative selection of maximally informative experiments, active learning renders exhaustive screening obsolete. Chosen experiments are used to train models that are accurate over the entire domain, thus reducing the experiment burden.
Collapse
|
39
|
Development and rigorous validation of antimalarial predictive models using machine learning approaches. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2019; 30:543-560. [PMID: 31328578 DOI: 10.1080/1062936x.2019.1635526] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/18/2019] [Accepted: 06/20/2019] [Indexed: 06/10/2023]
Abstract
The large collection of known and experimentally verified compounds from the ChEMBL database was used to build different classification models for predicting the antimalarial activity against Plasmodium falciparum. Four different machine learning methods, namely the support vector machine (SVM), random forest (RF), k-nearest neighbour (kNN) and XGBoost have been used for the development of models using the diverse antimalarial dataset from ChEMBL. A well-established feature selection framework was used to select the best subset from a larger pool of descriptors. Performance of the models was rigorously evaluated by evaluation of the applicability domain, Y-scrambling and AUC-ROC curve. Additionally, the predictive power of the models was also assessed using probability calibration and predictiveness curves. SVM and XGBoost showed the best performances, yielding an accuracy of ~85% on the independent test set. In term of probability prediction, SVM and XGBoost were well calibrated. Total gain (TG) from the predictiveness curve was more related to SVM (TG = 0.67) and XGBoost (TG = 0.75). These models also predict the high-affinity compounds from PubChem antimalarial bioassay (as external validation) with a high probability score. Our findings suggest that the selected models are robust and can be potentially useful for facilitating the discovery of antimalarial agents.
Collapse
|
40
|
Abstract
Artificial intelligence (AI), and, in particular, deep learning as a subcategory of AI, provides opportunities for the discovery and development of innovative drugs. Various machine learning approaches have recently (re)emerged, some of which may be considered instances of domain-specific AI which have been successfully employed for drug discovery and design. This review provides a comprehensive portrayal of these machine learning techniques and of their applications in medicinal chemistry. After introducing the basic principles, alongside some application notes, of the various machine learning algorithms, the current state-of-the art of AI-assisted pharmaceutical discovery is discussed, including applications in structure- and ligand-based virtual screening, de novo drug design, physicochemical and pharmacokinetic property prediction, drug repurposing, and related aspects. Finally, several challenges and limitations of the current methods are summarized, with a view to potential future directions for AI-assisted drug discovery and design.
Collapse
|
41
|
Iterative Screening Methods for Identification of Chemical Compounds with Specific Values of Various Properties. J Chem Inf Model 2019; 59:2626-2641. [PMID: 31058504 DOI: 10.1021/acs.jcim.9b00093] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Identification of chemical compounds having desirable properties is a central goal of screening campaigns. Iterative screening is a means of surveying a set of compounds, during which their property values are determined and used as feedback for regression models. Quantitative models that assess the relationships between chemical structures and property/activity are repeatedly updated through this type of cycle, and the efficient sampling of compounds for the subsequent test is a key factor in the early identification of target compounds. Nevertheless, methodological approaches to comparisons and to establishing the degree of extrapolation of sampled compounds, including the effects of applicability domains, are still required. In the present study, we conducted a series of virtual experiments to assess the characteristics of different iterative screening methods. Genetic algorithm-based partial least-squares regression, support vector regression, Bayesian optimization with Gaussian Process (GP), and batch-based Bayesian optimization with GP (GP_batch) were all compared, based on the analysis of one million compounds extracted from the ZINC database. Our results show that, irrespective of the diversity of the initial set of compounds, it was possible to identify a compound having the desired property value using the appropriate screening method. However, overall, the GP_batch method was found to be preferable when evaluating properties either which are difficult to predict or for which a key factor is present in the set of molecular descriptors.
Collapse
|
42
|
A Structure-Based Drug Discovery Paradigm. Int J Mol Sci 2019; 20:ijms20112783. [PMID: 31174387 PMCID: PMC6601033 DOI: 10.3390/ijms20112783] [Citation(s) in RCA: 245] [Impact Index Per Article: 49.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Revised: 05/31/2019] [Accepted: 06/04/2019] [Indexed: 12/14/2022] Open
Abstract
Structure-based drug design is becoming an essential tool for faster and more cost-efficient lead discovery relative to the traditional method. Genomic, proteomic, and structural studies have provided hundreds of new targets and opportunities for future drug discovery. This situation poses a major problem: the necessity to handle the “big data” generated by combinatorial chemistry. Artificial intelligence (AI) and deep learning play a pivotal role in the analysis and systemization of larger data sets by statistical machine learning methods. Advanced AI-based sophisticated machine learning tools have a significant impact on the drug discovery process including medicinal chemistry. In this review, we focus on the currently available methods and algorithms for structure-based drug design including virtual screening and de novo drug design, with a special emphasis on AI- and deep-learning-based methods used for drug discovery.
Collapse
|
43
|
Data analytics on raw material properties to accelerate pharmaceutical drug development. Int J Pharm 2019; 563:122-134. [PMID: 30951857 DOI: 10.1016/j.ijpharm.2019.04.002] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2019] [Revised: 03/29/2019] [Accepted: 04/01/2019] [Indexed: 12/19/2022]
Abstract
Manufacturability of active pharmaceutical ingredients (APIs) is often evaluated by an empirical approach during development due to limited material availability. This brings challenges in designing flexible yet robust manufacturing processes under highly accelerated timelines. Hence, good utilisation of a limited material dataset is key to accelerate the delivery of high quality final drug product into the market at minimum cost and maximum process capacity. In this study, we present a data-driven method to investigate a raw materials database where the integration of multivariate analysis and machine learning modelling aids the selection of new incoming materials based on their manufacturability. The procedure was applied to an industrial representative database of thirty-four APIs and seven excipients where eight measurements relevant to flow properties for each of those forty-one materials were collected. The models identified four clusters of materials with different flow properties. These models can serve as a risk assessment tool for new API in early product development phases based on the nearest surrogate material which behave similarly, as well as to identify targeted and material sparring experiments to address key risks during secondary process selection.
Collapse
|
44
|
Survey of Machine Learning Techniques in Drug Discovery. Curr Drug Metab 2019; 20:185-193. [DOI: 10.2174/1389200219666180820112457] [Citation(s) in RCA: 111] [Impact Index Per Article: 22.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2017] [Revised: 01/01/2018] [Accepted: 03/19/2018] [Indexed: 12/19/2022]
Abstract
Background:Drug discovery, which is the process of discovering new candidate medications, is very important for pharmaceutical industries. At its current stage, discovering new drugs is still a very expensive and time-consuming process, requiring Phases I, II and III for clinical trials. Recently, machine learning techniques in Artificial Intelligence (AI), especially the deep learning techniques which allow a computational model to generate multiple layers, have been widely applied and achieved state-of-the-art performance in different fields, such as speech recognition, image classification, bioinformatics, etc. One very important application of these AI techniques is in the field of drug discovery.Methods:We did a large-scale literature search on existing scientific websites (e.g, ScienceDirect, Arxiv) and startup companies to understand current status of machine learning techniques in drug discovery.Results:Our experiments demonstrated that there are different patterns in machine learning fields and drug discovery fields. For example, keywords like prediction, brain, discovery, and treatment are usually in drug discovery fields. Also, the total number of papers published in drug discovery fields with machine learning techniques is increasing every year.Conclusion:The main focus of this survey is to understand the current status of machine learning techniques in the drug discovery field within both academic and industrial settings, and discuss its potential future applications. Several interesting patterns for machine learning techniques in drug discovery fields are discussed in this survey.
Collapse
|
45
|
Abstract
Iterative screening has emerged as a promising approach to increase the efficiency of high-throughput screening (HTS) campaigns in drug discovery. By learning from a subset of the compound library, inferences on what compounds to screen next can be made by predictive models. One of the challenges of iterative screening is to decide how many iterations to perform. This is mainly related to difficulties in estimating the prospective hit rate in any given iteration. In this article, a novel method based on Venn-ABERS predictors is proposed. The method provides accurate estimates of the number of hits retrieved in any given iteration during an HTS campaign. The estimates provide the necessary information to support the decision on the number of iterations needed to maximize the screening outcome. Thus, this method offers a prospective screening strategy for early-stage drug discovery.
Collapse
|
46
|
Active learning strategies with COMBINE analysis: new tricks for an old dog. J Comput Aided Mol Des 2019; 33:287-294. [PMID: 30564994 PMCID: PMC7087723 DOI: 10.1007/s10822-018-0181-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2018] [Accepted: 12/14/2018] [Indexed: 11/09/2022]
Abstract
The COMBINE method was designed to study congeneric series of compounds including structural information of ligand-protein complexes. Although very successful, the method has not received the same level of attention than other alternatives to study Quantitative Structure Active Relationships (QSAR) mainly because lack of ways to measure the uncertainty of the predictions and the need for large datasets. Active learning, a semi-supervised learning approach that makes use of uncertainty to enhance models' performance while reducing the size of the training sets, has been used in this work to address both problems. We propose two estimators of uncertainty: the pool of regressors and the distance to the training set. The performance of the methods has been evaluated by testing the resulting active learning workflows in 3 diverse datasets: HIV-1 protease inhibitors, Taxol-derivatives and BRD4 inhibitors. The proposed strategies were successful in 80% of the cases for the taxol-derivatives and BRD4 inhibitors, while outperformed random selection in the case of the HIV-1 protease inhibitors time-split. Our results suggest that AL-COMBINE might be an effective way of producing consistently superior QSAR models with a limited number of samples.
Collapse
|
47
|
Efficient multi-task chemogenomics for drug specificity prediction. PLoS One 2018; 13:e0204999. [PMID: 30286165 PMCID: PMC6171913 DOI: 10.1371/journal.pone.0204999] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2018] [Accepted: 09/18/2018] [Indexed: 01/10/2023] Open
Abstract
Adverse drug reactions, also called side effects, range from mild to fatal clinical events and significantly affect the quality of care. Among other causes, side effects occur when drugs bind to proteins other than their intended target. As experimentally testing drug specificity against the entire proteome is out of reach, we investigate the application of chemogenomics approaches. We formulate the study of drug specificity as a problem of predicting interactions between drugs and proteins at the proteome scale. We build several benchmark datasets, and propose NN-MT, a multi-task Support Vector Machine (SVM) algorithm that is trained on a limited number of data points, in order to solve the computational issues or proteome-wide SVM for chemogenomics. We compare NN-MT to different state-of-the-art methods, and show that its prediction performances are similar or better, at an efficient calculation cost. Compared to its competitors, the proposed method is particularly efficient to predict (protein, ligand) interactions in the difficult double-orphan case, i.e. when no interactions are previously known for the protein nor for the ligand. The NN-MT algorithm appears to be a good default method providing state-of-the-art or better performances, in a wide range of prediction scenario that are considered in the present study: proteome-wide prediction, protein family prediction, test (protein, ligand) pairs dissimilar to pairs in the train set, and orphan cases.
Collapse
|
48
|
Active learning across intermetallics to guide discovery of electrocatalysts for CO2 reduction and H2 evolution. Nat Catal 2018. [DOI: 10.1038/s41929-018-0142-1] [Citation(s) in RCA: 323] [Impact Index Per Article: 53.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
49
|
Integration of Lead Discovery Tactics and the Evolution of the Lead Discovery Toolbox. SLAS DISCOVERY 2018; 23:881-897. [PMID: 29874524 DOI: 10.1177/2472555218778503] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
There has been much debate around the success rates of various screening strategies to identify starting points for drug discovery. Although high-throughput target-based and phenotypic screening has been the focus of this debate, techniques such as fragment screening, virtual screening, and DNA-encoded library screening are also increasingly reported as a source of new chemical equity. Here, we provide examples in which integration of more than one screening approach has improved the campaign outcome and discuss how strengths and weaknesses of various methods can be used to build a complementary toolbox of approaches, giving researchers the greatest probability of successfully identifying leads. Among others, we highlight case studies for receptor-interacting serine/threonine-protein kinase 1 and the bromo- and extra-terminal domain family of bromodomains. In each example, the unique insight or chemistries individual approaches provided are described, emphasizing the synergy of information obtained from the various tactics employed and the particular question each tactic was employed to answer. We conclude with a short prospective discussing how screening strategies are evolving, what this screening toolbox might look like in the future, how to maximize success through integration of multiple tactics, and scenarios that drive selection of one combination of tactics over another.
Collapse
|
50
|
Recognition of protein allosteric states and residues: Machine learning approaches. J Comput Chem 2018; 39:1481-1490. [PMID: 29604117 DOI: 10.1002/jcc.25218] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2017] [Revised: 03/02/2018] [Accepted: 03/11/2018] [Indexed: 01/28/2023]
Abstract
Allostery is a process by which proteins transmit the effect of perturbation at one site to a distal functional site upon certain perturbation. As an intrinsically global effect of protein dynamics, it is difficult to associate protein allostery with individual residues, hindering effective selection of key residues for mutagenesis studies. The machine learning models including decision tree (DT) and artificial neural network (ANN) models were applied to develop classification model for a cell signaling allosteric protein with two states showing extremely similar tertiary structures in both crystallographic structures and molecular dynamics simulations. Both DT and ANN models were developed with 75% and 80% of predicting accuracy, respectively. Good agreement between machine learning models and previous experimental as well as computational studies of the same protein validates this approach as an alternative way to analyze protein dynamics simulations and allostery. In addition, the difference of distributions of key features in two allosteric states also underlies the population shift hypothesis of dynamics-driven allostery model. © 2018 Wiley Periodicals, Inc.
Collapse
|