1
|
From NMR to AI: Designing a Novel Chemical Representation to Enhance Machine Learning Predictions of Physicochemical Properties. J Chem Inf Model 2024; 64:3302-3321. [PMID: 38529877 DOI: 10.1021/acs.jcim.3c02039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/27/2024]
Abstract
A novel approach to the utilization of nuclear magnetic resonance (NMR) spectroscopy data in the prediction of logD through machine learning algorithms is shown. In the analysis, a data set of 754 chemical compounds, organized into 30 clusters, was evaluated using advanced machine learning models, such as Support Vector Regression (SVR), Gradient Boosting, and AdaBoost, and comprehensive validation and testing methods were employed, including 10-fold cross-validation, bootstrapping, and leave-one-out. The study revealed the superior performance of the Bucket Integration method for dimensionality reduction, consistently yielding the lowest root mean square error (RMSE) across all data sets and normalization schemes. The SVR prediction models demonstrated remarkable computational efficiency and low cost, with the best RMSE value reaching 0.66. Our best model outperformed existing tools like JChem Suite's logD Predictor (0.91) and CplogD (1.27), and a comparison with traditional molecular representations yielded a comparable RMSE (0.50), emphasizing the robustness of our NMR data integration. The widespread availability of NMR data in pharmaceutical and industrial research presents an untapped resource for predictive modeling, highlighting the need for accessible methodologies like ours that complement the analytical toolbox beyond conventional 2D approaches. Our approach, designed to leverage the rich spatial data from NMR spectroscopy, provides additional insights and enriches drug discovery and computational chemistry with a freely accessible tool.
Collapse
|
2
|
FOTF-CPI: A compound-protein interaction prediction transformer based on the fusion of optimal transport fragments. iScience 2024; 27:108756. [PMID: 38230261 PMCID: PMC10790010 DOI: 10.1016/j.isci.2023.108756] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Revised: 11/05/2023] [Accepted: 12/13/2023] [Indexed: 01/18/2024] Open
Abstract
Compound-protein interaction (CPI) affinity prediction plays an important role in reducing the cost and time of drug discovery. However, the interpretability of how fragments function in CPI is impacted by the fact that current methods ignore the affinity relationships between fragments of compounds and fragments of proteins in CPI modeling. This article introduces an improved Transformer called FOTF-CPI (a Fusion of Optimal Transport Fragments compound-protein interaction prediction model). We use an optimal transport-based fragmentation approach to improve the model's understanding of compound and protein sequences. Additionally, a fused attention mechanism is employed, which combines the features of fragments to capture full affinity information. This fused attention redistributes higher attention scores to fragments with higher affinity. Experimental results show FOTF-CPI achieves an average 2% higher performance than other models on all three datasets. Furthermore, the visualization confirms the potential of FOTF-CPI for drug discovery applications.
Collapse
|
3
|
Extreme Gradient Boosting Combined with Conformal Predictors for Informative Solubility Estimation. Molecules 2023; 29:19. [PMID: 38202602 PMCID: PMC10779886 DOI: 10.3390/molecules29010019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Revised: 12/15/2023] [Accepted: 12/17/2023] [Indexed: 01/12/2024] Open
Abstract
We used the extreme gradient boosting (XGB) algorithm to predict the experimental solubility of chemical compounds in water and organic solvents and to select significant molecular descriptors. The accuracy of prediction of our forward stepwise top-importance XGB (FSTI-XGB) on curated solubility data sets in terms of RMSE was found to be 0.59-0.76 Log(S) for two water data sets, while for organic solvent data sets it was 0.69-0.79 Log(S) for the Methanol data set, 0.65-0.79 for the Ethanol data set, and 0.62-0.70 Log(S) for the Acetone data set. That was the first step. In the second step, we used uncurated and curated AquaSolDB data sets for applicability domain (AD) tests of Drugbank, PubChem, and COCONUT databases and determined that more than 95% of studied ca. 500,000 compounds were within the AD. In the third step, we applied conformal prediction to obtain narrow prediction intervals and we successfully validated them using test sets' true solubility values. With prediction intervals obtained in the last fourth step, we were able to estimate individual error margins and the accuracy class of the solubility prediction for molecules within the AD of three public databases. All that was possible without the knowledge of experimental database solubilities. We find these four steps novel because usually, solubility-related works only study the first step or the first two steps.
Collapse
|
4
|
LogD7.4 prediction enhanced by transferring knowledge from chromatographic retention time, microscopic pKa and logP. J Cheminform 2023; 15:76. [PMID: 37670374 PMCID: PMC10478446 DOI: 10.1186/s13321-023-00754-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Accepted: 08/25/2023] [Indexed: 09/07/2023] Open
Abstract
Lipophilicity is a fundamental physical property that significantly affects various aspects of drug behavior, including solubility, permeability, metabolism, distribution, protein binding, and toxicity. Accurate prediction of lipophilicity, measured by the logD7.4 value (the distribution coefficient between n-octanol and buffer at physiological pH 7.4), is crucial for successful drug discovery and design. However, the limited availability of data for logD modeling poses a significant challenge to achieving satisfactory generalization capability. To address this challenge, we have developed a novel logD7.4 prediction model called RTlogD, which leverages knowledge from multiple sources. RTlogD combines pre-training on a chromatographic retention time (RT) dataset since the RT is influenced by lipophilicity. Additionally, microscopic pKa values are incorporated as atomic features, providing valuable insights into ionizable sites and ionization capacity. Furthermore, logP is integrated as an auxiliary task within a multitask learning framework. We conducted ablation studies and presented a detailed analysis, showcasing the effectiveness and interpretability of RT, pKa, and logP in the RTlogD model. Notably, our RTlogD model demonstrated superior performance compared to commonly used algorithms and prediction tools. These results underscore the potential of the RTlogD model to improve the accuracy and generalization of logD prediction in drug discovery and design. In summary, the RTlogD model addresses the challenge of limited data availability in logD modeling by leveraging knowledge from RT, microscopic pKa, and logP. Incorporating these factors enhances the predictive capabilities of our model, and it holds promise for real-world applications in drug discovery and design scenarios.
Collapse
|
5
|
Synthesis, Optical Properties, and In Vivo Biodistribution Performance of Polymethine Cyanine Fluorophores. ACS Pharmacol Transl Sci 2023; 6:1192-1206. [PMID: 37588753 PMCID: PMC10425993 DOI: 10.1021/acsptsci.3c00101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Indexed: 08/18/2023]
Abstract
Near-infrared (NIR) cyanine dyes showed enhanced properties for biomedical imaging. A systematic modification within the cyanine skeleton has been made through a facile design and synthetic route for optimal bioimaging. Herein, we report the synthesis of 11 NIR cyanine fluorophores and an investigation of their physicochemical properties, optical characteristics, photostability, and in vivo performance. All synthesized fluorophores absorb and emit within 610-817 nm in various solvents. These dyes also showed high molar extinction coefficients ranging from 27,000 to 270,000 cm-1 M-1, quantum yields 0.01 to 0.33, and molecular brightness 208-79,664 cm-1 M-1 in the tested solvents. Photostability data demonstrate that all tested fluorophores 28, 18, 20, 19, 25, and 24 are more photostable than the FDA-approved indocyanine green. In the biodistribution study, most compounds showed tissue-specific targeting to selectively accumulate in the adrenal glands, lymph nodes, or gallbladder while excreted to the hepatobiliary clearance route. Among the tested, compound 23 showed the best targetability to the bone marrow and lymph nodes. Since the safety of cyanine fluorophores is well established, rationally designed cyanine fluorophores established in the current study will expand an inventory of contrast agents for NIR imaging of not only normal tissues but also cancerous regions originating from these organs/tissues.
Collapse
|
6
|
ALipSol: An Attention-Driven Mixture-of-Experts Model for Lipophilicity and Solubility Prediction. J Chem Inf Model 2022; 62:5975-5987. [PMID: 36417544 DOI: 10.1021/acs.jcim.2c01290] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Lipophilicity (logD) and aqueous solubility (logSw) play a central role in drug development. The accurate prediction of these properties remains to be solved due to data scarcity. Current methodologies neglect the intrinsic relationships between physicochemical properties and usually ignore the ionization effects. Here, we propose an attention-driven mixture-of-experts (MoE) model named ALipSol, which explicitly reproduces the hierarchy of task relationships. We adopt the principle of divide-and-conquer by breaking down the complex end point (logD or logSw) into simpler ones (acidic pKa, basic pKa, and logP) and allocating a specific expert network for each subproblem. Subsequently, we implement transfer learning to extract knowledge from related tasks, thus alleviating the dilemma of limited data. Additionally, we substitute the gating network with an attention mechanism to better capture the dynamic task relationships on a per-example basis. We adopt local fine-tuning and consensus prediction to further boost model performance. Extensive evaluation experiments verify the success of the ALipSol model, which achieves RMSE improvement of 8.04%, 2.49%, 8.57%, 12.8%, and 8.60% on the Lipop, ESOL, AqSolDB, external logD, and external logS data sets, respectively, compared with Attentive FP and the state-of-the-art in silico tools. In particular, our model yields more significant advantages (Welch's t-test) for small training data, implying its high robustness and generalizability. The interpretability analysis proves that the atom contributions learned by ALipSol are more reasonable compared with the vanilla Attentive FP, and the substitution effects in benzene derivatives agreed well with empirical constants, revealing the potential of our model to extract useful patterns from data and provide guidance for lead optimization.
Collapse
|
7
|
Discovery of Phenylcarbamoylazinane-1,2,4-Triazole Amides Derivatives as the Potential Inhibitors of Aldo-Keto Reductases (AKR1B1 & AKRB10): Potential Lead Molecules for Treatment of Colon Cancer. Molecules 2022; 27:molecules27133981. [PMID: 35807227 PMCID: PMC9268700 DOI: 10.3390/molecules27133981] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2022] [Revised: 05/19/2022] [Accepted: 05/23/2022] [Indexed: 12/12/2022] Open
Abstract
Both members of the aldo-keto reductases (AKRs) family, AKR1B1 and AKR1B10, are over-expressed in various type of cancer, making them potential targets for inflammation-mediated cancers such as colon, lung, breast, and prostate cancers. This is the first comprehensive study which focused on the identification of phenylcarbamoylazinane-1, 2,4-triazole amides (7a−o) as the inhibitors of aldo-keto reductases (AKR1B1, AKR1B10) via detailed computational analysis. Firstly, the stability and reactivity of compounds were determined by using the Guassian09 programme in which the density functional theory (DFT) calculations were performed by using the B3LYP/SVP level. Among all the derivatives, the 7d, 7e, 7f, 7h, 7j, 7k, and 7m were found chemically reactive. Then the binding interactions of the optimized compounds within the active pocket of the selected targets were carried out by using molecular docking software: AutoDock tools and Molecular operation environment (MOE) software, and during analysis, the Autodock (academic software) results were found to be reproducible, suggesting this software is best over the MOE (commercial software). The results were found in correlation with the DFT results, suggesting 7d as the best inhibitor of AKR1B1 with the energy value of −49.40 kJ/mol and 7f as the best inhibitor of AKR1B10 with the energy value of −52.84 kJ/mol. The other potent compounds also showed comparable binding energies. The best inhibitors of both targets were validated by the molecular dynamics simulation studies where the root mean square value of <2 along with the other physicochemical properties, hydrogen bond interactions, and binding energies were observed. Furthermore, the anticancer potential of the potent compounds was confirmed by cell viability (MTT) assay. The studied compounds fall into the category of drug-like properties and also supported by physicochemical and pharmacological ADMET properties. It can be suggested that the further synthesis of derivatives of 7d and 7f may lead to the potential drug-like molecules for the treatment of colon cancer associated with the aberrant expression of either AKR1B1 or AKR1B10 and other associated malignancies.
Collapse
|
8
|
Dose-dependent alkaloid sequestration and N-methylation of decahydroquinoline in poison frogs. JOURNAL OF EXPERIMENTAL ZOOLOGY. PART A, ECOLOGICAL AND INTEGRATIVE PHYSIOLOGY 2022; 337:537-546. [PMID: 35201668 DOI: 10.1002/jez.2587] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/18/2021] [Revised: 12/22/2021] [Accepted: 01/27/2022] [Indexed: 06/14/2023]
Abstract
Sequestration of chemical defenses from dietary sources is dependent on the availability of compounds in the environment and the mechanism of sequestration. Previous experiments have shown that sequestration efficiency varies among alkaloids in poison frogs, but little is known about the underlying mechanism. The aim of this study was to quantify the extent to which alkaloid sequestration and modification are dependent on alkaloid availability and/or sequestration mechanism. To do this, we administered different doses of histrionicotoxin (HTX) 235A and decahydroquinoline (DHQ) to captive-bred Adelphobates galactonotus and measured alkaloid quantity in muscle, kidney, liver, and feces. HTX 235A and DHQ were detected in all organs, whereas only DHQ was present in trace amounts in feces. For both liver and skin, the quantity of alkaloid accumulated increased at higher doses for both alkaloids. Accumulation efficiency in the skin increased at higher doses for HTX 235A but remained constant for DHQ. In contrast, the efficiency of HTX 235A accumulation in the liver was inversely related to dose and a similar, albeit statistically nonsignificant, pattern was observed for DHQ. We identified and quantified the N-methylation of DHQ in A. galactonotus, which represents a previously unknown example of alkaloid modification in poison frogs. Our study suggests that variation in alkaloid composition among individuals and species can result from differences in sequestration efficiency related to the type and amount of alkaloids available in the environment.
Collapse
|
9
|
In silico predictions of the gastrointestinal uptake of macrocycles in man using conformal prediction methodology. J Pharm Sci 2022; 111:2614-2619. [DOI: 10.1016/j.xphs.2022.05.010] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2022] [Revised: 05/16/2022] [Accepted: 05/16/2022] [Indexed: 11/17/2022]
|
10
|
Comparison of logP and logD correction models trained with public and proprietary data sets. J Comput Aided Mol Des 2022; 36:253-262. [PMID: 35359246 DOI: 10.1007/s10822-022-00450-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2021] [Accepted: 03/15/2022] [Indexed: 10/18/2022]
Abstract
In drug discovery, partition and distribution coefficients, logP and logD for octanol/water, are widely used as metrics of the lipophilicity of molecules, which in turn have a strong influence on the bioactivity and bioavailability of potential drugs. There are a variety of established methods, mostly fragment or atom-based, to calculate logP while logD prediction generally relies on calculated logP and pKa for the estimation of neutral and ionized populations at a given pH. Algorithms such as ClogP have limitations generally leading to systematic errors for chemically related molecules while pKa estimation is generally more difficult due to the interplay of electronic, inductive and conjugation effects for ionizable moieties. We propose an integrated machine learning QSAR modeling approach to predict logD by training the model with experimental data while using ClogP and pKa predicted by commercial software as model descriptors. By optimizing the loss function for the ClogD calculated by the software, we build a correction model that incorporates both descriptors from the software and available experimental logD data. Additionally, we calculate logP from the logD model using the software predicted pKa's. Here, we have trained models using publicly or commercial available logD data to show that this approach can improve on commercial software predictions of lipophilicity. When applied to other logD data sets, this approach extends the domain of applicability of logD and logP predictions over commercial software. Performance of these models favorably compare with models built with a larger set of proprietary logD data.
Collapse
|
11
|
In silico predictions of the human pharmacokinetics/toxicokinetics of 65 chemicals from various classes using conformal prediction methodology. Xenobiotica 2022; 52:113-118. [PMID: 35238270 DOI: 10.1080/00498254.2022.2049397] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/20/2023]
Abstract
Pharmacokinetic/toxicokinetic (PK/TK) information for chemicals in humans is generally lacking. Here we applied machine learning, conformal prediction and a new physiologically-based PK/TK model for prediction of the human PK/TK of 65 chemicals from different classes, including carcinogens, food constituents and preservatives, vitamins, sweeteners, dyes and colours, pesticides, alternative medicines, flame retardants, psychoactive drugs, dioxins, poisons, UV-absorbents, surfactants, solvents and cosmetics.About 80% of the main human PK/TK (fraction absorbed, oral bioavailability, half-life, unbound fraction in plasma, clearance, volume of distribution, fraction excreted) for the selected chemicals was missing in the literature. This information was now added (from in silico predictions). Median and mean prediction errors for these parameters were 1.3- to 2.7-fold and 1.4- to 4.8-fold, respectively. In total, 59 and 86% of predictions had errors <2- and <5-fold, respectively. Predicted and observed PK/TK for the chemicals was generally within the range for pharmaceutical drugs.The results validated the new integrated system for prediction of the human PK/TK for different chemicals and added important missing information. No general difference in PK/TK-characteristics was found between the selected chemicals and pharmaceutical drugs.
Collapse
|
12
|
Physicochemical and biopharmaceutical aspects influencing skin permeation and role of SLN and NLC for skin drug delivery. Heliyon 2022; 8:e08938. [PMID: 35198788 PMCID: PMC8851252 DOI: 10.1016/j.heliyon.2022.e08938] [Citation(s) in RCA: 38] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2021] [Revised: 01/30/2022] [Accepted: 02/08/2022] [Indexed: 12/28/2022] Open
Abstract
The skin is a complex and multifunctional organ, in which the static versus dynamic balance is responsible for its constant adaptation to variations in the external environment that is continuously exposed. One of the most important functions of the skin is its ability to act as a protective barrier, against the entry of foreign substances and against the excessive loss of endogenous material. Human skin imposes physical, chemical and biological limitations on all types of permeating agents that can cross the epithelial barrier. For a molecule to be passively permeated through the skin, it must have properties, such as dimensions, molecular weight, pKa and hydrophilic-lipophilic gradient, appropriate to the anatomy and physiology of the skin. These requirements have limited the number of commercially available products for dermal and transdermal administration of drugs. To understand the mechanisms involved in the drug permeation process through the skin, the approach should be multidisciplinary in order to overcome biological and pharmacotechnical barriers. The study of the mechanisms involved in the permeation process, and the ways to control it, can make this route of drug administration cease to be a constant promise and become a reality. In this work, we address the physicochemical and biopharmaceutical aspects encountered in the pathway of drugs through the skin, and the potential added value of using solid lipid nanoparticles (SLN) and nanostructured lipid vectors (NLC) to drug permeation/penetration through this route. The technology and architecture for obtaining lipid nanoparticles are described in detail, namely the composition, production methods and the ability to release pharmacologically active substances, as well as the application of these systems in the vectorization of various pharmacologically active substances for dermal and transdermal applications. The characteristics of these systems in terms of dermal application are addressed, such as biocompatibility, occlusion, hydration, emollience and the penetration of pharmacologically active substances. The advantages of using these systems over conventional formulations are described and explored from a pharmaceutical point of view.
Collapse
|
13
|
In silico prediction of volume of distribution of drugs in man using conformal prediction performs on par with animal data-based models. Xenobiotica 2021; 51:1366-1371. [PMID: 34845977 DOI: 10.1080/00498254.2021.2011471] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
Volume of distribution at steady state (Vss) is an important pharmacokinetic endpoint. In this study we apply machine learning and conformal prediction for human Vss prediction, and make a head-to-head comparison with rat-to-man scaling, allometric scaling and the Rodgers-Lukova method on combined in silico and in vitro data, using a test set of 105 compounds with experimentally observed Vss.The mean prediction error and % with <2-fold prediction error for our method were 2.4-fold and 64%, respectively. 69% of test compounds had an observed Vss within the prediction interval at a 70% confidence level. In comparison, 2.2-, 2.9- and 3.1-fold mean errors and 69, 64 and 61% of predictions with <2-fold error was reached with rat-to-man and allometric scaling and Rodgers-Lukova method, respectively.We conclude that our method has theoretically proven validity that was empirically confirmed, and showing predictive accuracy on par with animal models and superior to an alternative widely used in silico-based method. The option for the user to select the level of confidence in predictions offers better guidance on how to optimise Vss in drug discovery applications.
Collapse
|
14
|
Abstract
Animal experimentation has been fundamental in biological and biomedical research. To guarantee the maximum quality, efficacy and/or safety of products intended for the use in humans in vivo testing is necessary; however, for over 60 years, alternative methods have been developed in response to the necessity to reduce the number of animals used in experimentation, to guarantee their welfare; resorting to animal models only when strictly necessary. The three Rs (Replacement, Reduction, and Refinement), seek to ensure the rational and respectful use of laboratory animals and maintain an adequate projection in terms of bioethical considerations. This article describes different approaches to apply 3Rs in preclinical experimentation for either research or regulatory purposes.
Collapse
|
15
|
Abstract
Machine learning is widely used in drug development to predict activity in biological assays based on chemical structure. However, the process of transitioning from one experimental setup to another for the same biological endpoint has not been extensively studied. In a retrospective study, we here explore different modeling strategies of how to combine data from the old and new assays when training conformal prediction models using data from hERG and NaV assays. We suggest to continuously monitor the validity and efficiency of models as more data is accumulated from the new assay and select a modeling strategy based on these metrics. In order to maximize the utility of data from the old assay, we propose a strategy that augments the proper training set of an inductive conformal predictor by adding data from the old assay but only having data from the new assay in the calibration set, which results in valid (well-calibrated) models with improved efficiency compared to other strategies. We study the results for varying sizes of new and old assays, allowing for discussion of different practical scenarios. We also conclude that our proposed assay transition strategy is more beneficial, and the value of data from the new assay is higher, for the harder case of regression compared to classification problems.
Collapse
|
16
|
Abstract
Introduction: Artificial intelligence (AI) and machine learning (ML) are increasingly used in many aspects of drug discovery. Larger data sizes and methods such as Deep Neural Networks contribute to challenges in data management, the required software stack, and computational infrastructure. There is an increasing need in drug discovery to continuously re-train models and make them available in production environments.Areas covered: This article describes how cloud computing can aid the ML life cycle in drug discovery. The authors discuss opportunities with containerization and scientific workflows and introduce the concept of MLOps and describe how it can facilitate reproducible and robust ML modeling in drug discovery organizations. They also discuss ML on private, sensitive and regulated data.Expert opinion: Cloud computing offers a compelling suite of building blocks to sustain the ML life cycle integrated in iterative drug discovery. Containerization and platforms such as Kubernetes together with scientific workflows can enable reproducible and resilient analysis pipelines, and the elasticity and flexibility of cloud infrastructures enables scalable and efficient access to compute resources. Drug discovery commonly involves working with sensitive or private data, and cloud computing and federated learning can contribute toward enabling collaborative drug discovery within and between organizations.Abbreviations: AI = Artificial Intelligence; DL = Deep Learning; GPU = Graphics Processing Unit; IaaS = Infrastructure as a Service; K8S = Kubernetes; ML = Machine Learning; MLOps = Machine Learning and Operations; PaaS = Platform as a Service; QC = Quality Control; SaaS = Software as a Service.
Collapse
|
17
|
A review on compound-protein interaction prediction methods: Data, format, representation and model. Comput Struct Biotechnol J 2021; 19:1541-1556. [PMID: 33841755 PMCID: PMC8008185 DOI: 10.1016/j.csbj.2021.03.004] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Revised: 02/28/2021] [Accepted: 03/01/2021] [Indexed: 01/27/2023] Open
Abstract
There has recently been a rapid progress in computational methods for determining protein targets of small molecule drugs, which will be termed as compound protein interaction (CPI). In this review, we comprehensively review topics related to computational prediction of CPI. Data for CPI has been accumulated and curated significantly both in quantity and quality. Computational methods have become powerful ever to analyze such complex the data. Thus, recent successes in the improved quality of CPI prediction are due to use of both sophisticated computational techniques and higher quality information in the databases. The goal of this article is to provide reviews of topics related to CPI, such as data, format, representation, to computational models, so that researchers can take full advantages of these resources to develop novel prediction methods. Chemical compounds and protein data from various resources were discussed in terms of data formats and encoding schemes. For the CPI methods, we grouped prediction methods into five categories from traditional machine learning techniques to state-of-the-art deep learning techniques. In closing, we discussed emerging machine learning topics to help both experimental and computational scientists leverage the current knowledge and strategies to develop more powerful and accurate CPI prediction methods.
Collapse
|
18
|
A comprehensive comparison of molecular feature representations for use in predictive modeling. Comput Biol Med 2021; 130:104197. [PMID: 33429140 DOI: 10.1016/j.compbiomed.2020.104197] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Revised: 12/21/2020] [Accepted: 12/21/2020] [Indexed: 11/23/2022]
Abstract
Machine learning methods are commonly used for predicting molecular properties to accelerate material and drug design. An important part of this process is deciding how to represent the molecules. Typically, machine learning methods expect examples represented by vectors of values, and many methods for calculating molecular feature representations have been proposed. In this paper, we perform a comprehensive comparison of different molecular features, including traditional methods such as fingerprints and molecular descriptors, and recently proposed learnable representations based on neural networks. Feature representations are evaluated on 11 benchmark datasets, used for predicting properties and measures such as mutagenicity, melting points, activity, solubility, and IC50. Our experiments show that several molecular features work similarly well over all benchmark datasets. The ones that stand out most are Spectrophores, which give significantly worse performance than other features on most datasets. Molecular descriptors from the PaDEL library seem very well suited for predicting physical properties of molecules. Despite their simplicity, MACCS fingerprints performed very well overall. The results show that learnable representations achieve competitive performance compared to expert based representations. However, task-specific representations (graph convolutions and Weave methods) rarely offer any benefits, even though they are computationally more demanding. Lastly, combining different molecular feature representations typically does not give a noticeable improvement in performance compared to individual feature representations.
Collapse
|
19
|
Abstract
The cannabinoid type 2 receptor (CB2R) is an important therapeutic target for pain and inflammatory disorders. G protein-coupled receptors (GPCRs) are conventionally thought to signal exclusively at the plasma membrane; however, recently this has been challenged by the notion of intracellular signalling receptors. Better understanding of GPCR location requires tools that can differentiate cell surface versus subcellular receptors as well as accessing different parts of the body. Herein, we report the synthesis and pharmacological evaluation of polar chromenopyrazole-based CB2R-selective agonists that contain short peptides that could be useful tools for interrogating CB2R.
Collapse
|
20
|
Artificial intelligence in the early stages of drug discovery. Arch Biochem Biophys 2020; 698:108730. [PMID: 33347838 DOI: 10.1016/j.abb.2020.108730] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Revised: 12/11/2020] [Accepted: 12/14/2020] [Indexed: 02/07/2023]
Abstract
Although the use of computational methods within the pharmaceutical industry is well established, there is an urgent need for new approaches that can improve and optimize the pipeline of drug discovery and development. In spite of the fact that there is no unique solution for this need for innovation, there has recently been a strong interest in the use of Artificial Intelligence for this purpose. As a matter of fact, not only there have been major contributions from the scientific community in this respect, but there has also been a growing partnership between the pharmaceutical industry and Artificial Intelligence companies. Beyond these contributions and efforts there is an underlying question, which we intend to discuss in this review: can the intrinsic difficulties within the drug discovery process be overcome with the implementation of Artificial Intelligence? While this is an open question, in this work we will focus on the advantages that these algorithms provide over the traditional methods in the context of early drug discovery.
Collapse
|
21
|
Computational Approaches in Preclinical Studies on Drug Discovery and Development. Front Chem 2020; 8:726. [PMID: 33062633 PMCID: PMC7517894 DOI: 10.3389/fchem.2020.00726] [Citation(s) in RCA: 84] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2020] [Accepted: 07/14/2020] [Indexed: 12/11/2022] Open
Abstract
Because undesirable pharmacokinetics and toxicity are significant reasons for the failure of drug development in the costly late stage, it has been widely recognized that drug ADMET properties should be considered as early as possible to reduce failure rates in the clinical phase of drug discovery. Concurrently, drug recalls have become increasingly common in recent years, prompting pharmaceutical companies to increase attention toward the safety evaluation of preclinical drugs. In vitro and in vivo drug evaluation techniques are currently more mature in preclinical applications, but these technologies are costly. In recent years, with the rapid development of computer science, in silico technology has been widely used to evaluate the relevant properties of drugs in the preclinical stage and has produced many software programs and in silico models, further promoting the study of ADMET in vitro. In this review, we first introduce the two ADMET prediction categories (molecular modeling and data modeling). Then, we perform a systematic classification and description of the databases and software commonly used for ADMET prediction. We focus on some widely studied ADMT properties as well as PBPK simulation, and we list some applications that are related to the prediction categories and web tools. Finally, we discuss challenges and limitations in the preclinical area and propose some suggestions and prospects for the future.
Collapse
|
22
|
Abstract
Nonalcoholic steatohepatitis (NASH) is considered as severe hepatic manifestation of the metabolic syndrome and has alarming global prevalence. The ligand-activated transcription factors farnesoid X receptor (FXR) and peroxisome proliferator-activated receptor (PPAR) δ have been validated as molecular targets to counter NASH. To achieve robust therapeutic efficacy in this multifactorial pathology, combined peripheral PPARδ-mediated activity and hepatic effects of FXR activation appear as a promising multitarget approach. We have designed a minimal dual FXR/PPARδ activator scaffold by rational fusion of pharmacophores derived from selective agonists. Our dual agonist lead compound exhibited weak agonism on FXR and PPARδ and was structurally refined to a potent and balanced FXR/PPARδ activator in a computer-aided fashion. The resulting dual FXR/PPARδ modulator comprises high selectivity over related nuclear receptors and activates the two target transcription factors in native cellular settings.
Collapse
|
23
|
Systematic Modeling of log D7.4 Based on Ensemble Machine Learning, Group Contribution, and Matched Molecular Pair Analysis. J Chem Inf Model 2019; 60:63-76. [DOI: 10.1021/acs.jcim.9b00718] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
|
24
|
Advancing Drug Discovery via Artificial Intelligence. Trends Pharmacol Sci 2019; 40:592-604. [DOI: 10.1016/j.tips.2019.06.004] [Citation(s) in RCA: 164] [Impact Index Per Article: 32.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2019] [Revised: 05/23/2019] [Accepted: 06/11/2019] [Indexed: 01/15/2023]
|
25
|
Application of Multivariate Adaptive Regression Splines (MARSplines) for Predicting Hansen Solubility Parameters Based on 1D and 2D Molecular Descriptors Computed from SMILES String. J CHEM-NY 2019. [DOI: 10.1155/2019/9858371] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
A new method of Hansen solubility parameters (HSPs) prediction was developed by combining the multivariate adaptive regression splines (MARSplines) methodology with a simple multivariable regression involving 1D and 2D PaDEL molecular descriptors. In order to adopt the MARSplines approach to QSPR/QSAR problems, several optimization procedures were proposed and tested. The effectiveness of the obtained models was checked via standard QSPR/QSAR internal validation procedures provided by the QSARINS software and by predicting the solubility classification of polymers and drug-like solid solutes in collections of solvents. By utilizing information derived only from SMILES strings, the obtained models allow for computing all of the three Hansen solubility parameters including dispersion, polarization, and hydrogen bonding. Although several descriptors are required for proper parameters estimation, the proposed procedure is simple and straightforward and does not require a molecular geometry optimization. The obtained HSP values are highly correlated with experimental data, and their application for solving solubility problems leads to essentially the same quality as for the original parameters. Based on provided models, it is possible to characterize any solvent and liquid solute for which HSP data are unavailable.
Collapse
|
26
|
Predicting Off-Target Binding Profiles With Confidence Using Conformal Prediction. Front Pharmacol 2018; 9:1256. [PMID: 30459617 PMCID: PMC6233526 DOI: 10.3389/fphar.2018.01256] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2018] [Accepted: 10/15/2018] [Indexed: 01/04/2023] Open
Abstract
Ligand-based models can be used in drug discovery to obtain an early indication of potential off-target interactions that could be linked to adverse effects. Another application is to combine such models into a panel, allowing to compare and search for compounds with similar profiles. Most contemporary methods and implementations however lack valid measures of confidence in their predictions, and only provide point predictions. We here describe a methodology that uses Conformal Prediction for predicting off-target interactions, with models trained on data from 31 targets in the ExCAPE-DB dataset selected for their utility in broad early hazard assessment. Chemicals were represented by the signature molecular descriptor and support vector machines were used as the underlying machine learning method. By using conformal prediction, the results from predictions come in the form of confidence p-values for each class. The full pre-processing and model training process is openly available as scientific workflows on GitHub, rendering it fully reproducible. We illustrate the usefulness of the developed methodology on a set of compounds extracted from DrugBank. The resulting models are published online and are available via a graphical web interface and an OpenAPI interface for programmatic access.
Collapse
|
27
|
Evaluating parameters for ligand-based modeling with random forest on sparse data sets. J Cheminform 2018; 10:49. [PMID: 30306349 PMCID: PMC6755600 DOI: 10.1186/s13321-018-0304-9] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2018] [Accepted: 10/03/2018] [Indexed: 11/10/2022] Open
Abstract
Ligand-based predictive modeling is widely used to generate predictive models aiding decision making in e.g. drug discovery projects. With growing data sets and requirements on low modeling time comes the necessity to analyze data sets efficiently to support rapid and robust modeling. In this study we analyzed four data sets and studied the efficiency of machine learning methods on sparse data structures, utilizing Morgan fingerprints of different radii and hash sizes, and compared with molecular signatures descriptor of different height. We specifically evaluated the effect these parameters had on modeling time, predictive performance, and memory requirements using two implementations of random forest; Scikit-learn as well as FEST. We also compared with a support vector machine implementation. Our results showed that unhashed fingerprints yield significantly better accuracy than hashed fingerprints ([Formula: see text]), with no pronounced deterioration in modeling time and memory usage. Furthermore, the fast execution and low memory usage of the FEST algorithm suggest that it is a good alternative for large, high dimensional sparse data. Both support vector machines and random forest performed equally well but results indicate that the support vector machine was better at using the extra information from larger values of the Morgan fingerprint's radius.
Collapse
|
28
|
How Precise Are Our Quantitative Structure-Activity Relationship Derived Predictions for New Query Chemicals? ACS OMEGA 2018; 3:11392-11406. [PMID: 31459245 PMCID: PMC6645132 DOI: 10.1021/acsomega.8b01647] [Citation(s) in RCA: 71] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/13/2018] [Accepted: 09/06/2018] [Indexed: 05/03/2023]
Abstract
Quantitative structure-activity relationship (QSAR) models have long been used for making predictions and data gap filling in diverse fields including medicinal chemistry, predictive toxicology, environmental fate modeling, materials science, agricultural science, nanoscience, food science, and so forth. Usually a QSAR model is developed based on chemical information of a properly designed training set and corresponding experimental response data while the model is validated using one or more test set(s) for which the experimental response data are available. However, it is interesting to estimate the reliability of predictions when the model is applied to a completely new data set (true external set) even when the new data points are within applicability domain (AD) of the developed model. In the present study, we have categorized the quality of predictions for the test set or true external set into three groups (good, moderate, and bad) based on absolute prediction errors. Then, we have used three criteria [(a) mean absolute error of leave-one-out predictions for 10 most close training compounds for each query molecule; (b) AD in terms of similarity based on the standardization approach; and (c) proximity of the predicted value of the query compound to the mean training response] in different weighting schemes for making a composite score of predictions. It was found that using the most frequently appearing weighting scheme 0.5-0-0.5, the composite score-based categorization showed concordance with absolute prediction error-based categorization for more than 80% test data points while working with 5 different datasets with 15 models for each set derived in three different splitting techniques. These observations were also confirmed with true external sets for another four endpoints suggesting applicability of the scheme to judge the reliability of predictions for new datasets. The scheme has been implemented in a tool "Prediction Reliability Indicator" available at http://dtclab.webs.com/software-tools and http://teqip.jdvu.ac.in/QSAR_Tools/DTCLab/, and the tool is presently valid for multiple linear regression models only.
Collapse
|