1
|
Adrian M, Chung Y, Cheng AC. Denoising Drug Discovery Data for Improved Absorption, Distribution, Metabolism, Excretion, and Toxicity Property Prediction. J Chem Inf Model 2024; 64:6324-6337. [PMID: 39108185 DOI: 10.1021/acs.jcim.4c00639] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/27/2024]
Abstract
Predicting absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties of small molecules is a key task in drug discovery. A major challenge in building better ADMET models is the experimental error inherent in the data. Furthermore, ADMET predictors are typically regression tasks due to the continuous nature of the data, which makes it difficult to apply existing denoising methods from other domains as they largely focus on classification tasks. Here, we develop denoising schemes based on deep learning to address this. We find that the training error (TE) can be used to identify the noise in regression tasks while ensemble-based and forgotten event-based metrics fail to detect the noise. The most significant performance increase occurs when the original model is finetuned with the denoised data using TE as the noise detection metric. Our method has the ability to improve models with medium noise and does not degrade the performance of models with noise outside this range (low noise and high noise regimes). To our knowledge, our denoising scheme is the first to improve model performance for ADMET data and has implications for improving models for experimental assay data in general.
Collapse
Affiliation(s)
- Matthew Adrian
- Modeling and Informatics, Merck & Co., Inc., South San Francisco, California 94080, United States
| | - Yunsie Chung
- Modeling and Informatics, Merck & Co., Inc., South San Francisco, California 94080, United States
| | - Alan C Cheng
- Modeling and Informatics, Merck & Co., Inc., South San Francisco, California 94080, United States
| |
Collapse
|
2
|
Xu Y, Liaw A, Sheridan RP, Svetnik V. Development and Evaluation of Conformal Prediction Methods for Quantitative Structure-Activity Relationship. ACS OMEGA 2024; 9:29478-29490. [PMID: 39005801 PMCID: PMC11238240 DOI: 10.1021/acsomega.4c02017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Revised: 06/10/2024] [Accepted: 06/12/2024] [Indexed: 07/16/2024]
Abstract
The quantitative structure-activity relationship (QSAR) regression model is a commonly used technique for predicting the biological activities of compounds using their molecular descriptors. Besides accurate activity estimation, obtaining a prediction uncertainty metric like a prediction interval is highly desirable. Quantifying prediction uncertainty is an active research area in statistical and machine learning (ML), but the implementation for QSAR remains challenging. However, most ML algorithms with high predictive performance require add-on companions for estimating the uncertainty of their prediction. Conformal prediction (CP) is a promising approach as its main components are agnostic to the prediction modes, and it produces valid prediction intervals under weak assumptions on the data distribution. We proposed computationally efficient CP algorithms tailored to the most widely used ML models, including random forests, deep neural networks, and gradient boosting. The algorithms use a novel approach to the derivation of nonconformity scores from the estimates of prediction uncertainty generated by the ensembles of point predictions. The validity and efficiency of proposed algorithms are demonstrated on a diverse collection of QSAR data sets as well as simulation studies. The provided software implementing our algorithms can be used as stand-alone or easily incorporated into other ML software packages for QSAR modeling.
Collapse
Affiliation(s)
- Yuting Xu
- Early
Development Statistics, Merck & Co.,
Inc., Rahway, New Jersey 07065, United States
| | - Andy Liaw
- Early
Development Statistics, Merck & Co.,
Inc., Rahway, New Jersey 07065, United States
| | - Robert P. Sheridan
- Modeling
and Informatics, Merck & Co., Inc., Rahway, New Jersey 07033, United States
| | - Vladimir Svetnik
- Early
Development Statistics, Merck & Co.,
Inc., Rahway, New Jersey 07065, United States
| |
Collapse
|
3
|
Wossnig L, Furtmann N, Buchanan A, Kumar S, Greiff V. Best practices for machine learning in antibody discovery and development. Drug Discov Today 2024; 29:104025. [PMID: 38762089 DOI: 10.1016/j.drudis.2024.104025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 04/25/2024] [Accepted: 05/13/2024] [Indexed: 05/20/2024]
Abstract
In the past 40 years, therapeutic antibody discovery and development have advanced considerably, with machine learning (ML) offering a promising way to speed up the process by reducing costs and the number of experiments required. Recent progress in ML-guided antibody design and development (D&D) has been hindered by the diversity of data sets and evaluation methods, which makes it difficult to conduct comparisons and assess utility. Establishing standards and guidelines will be crucial for the wider adoption of ML and the advancement of the field. This perspective critically reviews current practices, highlights common pitfalls and proposes method development and evaluation guidelines for various ML-based techniques in therapeutic antibody D&D. Addressing challenges across the ML process, best practices are recommended for each stage to enhance reproducibility and progress.
Collapse
Affiliation(s)
- Leonard Wossnig
- LabGenius Ltd, The Biscuit Factory, 100 Drummond Road, London SE16 4DG, UK; Department of Computer Science, University College London, 66-72 Gower St, London WC1E 6EA, UK.
| | - Norbert Furtmann
- R&D Large Molecules Research Platform, Sanofi Deutschland GmbH, Industriepark Höchst, Frankfurt Am Main, Germany
| | - Andrew Buchanan
- Biologics Engineering, R&D, AstraZeneca, Cambridge CB2 0AA, UK
| | - Sandeep Kumar
- Computational Protein Design and Modeling Group, Computational Science, Moderna Therapeutics, 200 Technology Square, Cambridge, MA 02139, USA
| | - Victor Greiff
- Department of Immunology and Oslo University Hospital, University of Oslo, Oslo, Norway
| |
Collapse
|
4
|
Fan Z, Yu J, Zhang X, Chen Y, Sun S, Zhang Y, Chen M, Xiao F, Wu W, Li X, Zheng M, Luo X, Wang D. Reducing overconfident errors in molecular property classification using Posterior Network. PATTERNS (NEW YORK, N.Y.) 2024; 5:100991. [PMID: 39005492 PMCID: PMC11240180 DOI: 10.1016/j.patter.2024.100991] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Revised: 12/20/2023] [Accepted: 04/15/2024] [Indexed: 07/16/2024]
Abstract
Deep-learning-based classification models are increasingly used for predicting molecular properties in drug development. However, traditional classification models using the Softmax function often give overconfident mispredictions for out-of-distribution samples, highlighting a critical lack of accurate uncertainty estimation. Such limitations can result in substantial costs and should be avoided during drug development. Inspired by advances in evidential deep learning and Posterior Network, we replaced the Softmax function with a normalizing flow to enhance the uncertainty estimation ability of the model in molecular property classification. The proposed strategy was evaluated across diverse scenarios, including simulated experiments based on a synthetic dataset, ADMET predictions, and ligand-based virtual screening. The results demonstrate that compared with the vanilla model, the proposed strategy effectively alleviates the problem of giving overconfident but incorrect predictions. Our findings support the promising application of evidential deep learning in drug development and offer a valuable framework for further research.
Collapse
Affiliation(s)
- Zhehuan Fan
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, 19A Yuquan Road, Beijing 100049, China
| | - Jie Yu
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, 19A Yuquan Road, Beijing 100049, China
| | - Xiang Zhang
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, Nanjing 210023, China
| | - Yijie Chen
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, Nanjing 210023, China
| | - Shihui Sun
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, Nanjing 210023, China
| | - Yuanyuan Zhang
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, 19A Yuquan Road, Beijing 100049, China
| | - Mingan Chen
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- School of Physical Science and Technology, ShanghaiTech University, Shanghai 201210, China
- Lingang Laboratory, Shanghai 200031, China
| | - Fu Xiao
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, Nanjing 210023, China
| | - Wenyong Wu
- Lingang Laboratory, Shanghai 200031, China
| | - Xutong Li
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, 19A Yuquan Road, Beijing 100049, China
| | - Mingyue Zheng
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, 19A Yuquan Road, Beijing 100049, China
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, Nanjing 210023, China
| | - Xiaomin Luo
- Drug Discovery and Design Center, State Key Laboratory of Drug Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, 555 Zuchongzhi Road, Shanghai 201203, China
- University of Chinese Academy of Sciences, 19A Yuquan Road, Beijing 100049, China
- School of Chinese Materia Medica, Nanjing University of Chinese Medicine, Nanjing 210023, China
| | | |
Collapse
|
5
|
Balraadjsing S, J G M Peijnenburg W, Vijver MG. Building species trait-specific nano-QSARs: Model stacking, navigating model uncertainties and limitations, and the effect of dataset size. ENVIRONMENT INTERNATIONAL 2024; 188:108764. [PMID: 38788418 DOI: 10.1016/j.envint.2024.108764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 05/17/2024] [Accepted: 05/19/2024] [Indexed: 05/26/2024]
Abstract
A strong need exists for broadly applicable nano-QSARs, capable of predicting toxicological outcomes towards untested species and nanomaterials, under different environmental conditions. Existing nano-QSARs are generally limited to only a few species but the inclusion of species characteristics into models can aid in making them applicable to multiple species, even when toxicity data is not available for biological species. Species traits were used to create classification- and regression machine learning models to predict acute toxicity towards aquatic species for metallic nanomaterials. Afterwards, the individual classification- and regression models were stacked into a meta-model to improve performance. Additionally, the uncertainty and limitations of the models were assessed in detail (beyond the OECD principles) and it was investigated whether models would benefit from the addition of more data. Results showed a significant improvement in model performance following model stacking. Investigation of model uncertainties and limitations highlighted the discrepancy between the applicability domain and accuracy of predictions. Data points outside of the assessed chemical space did not have higher likelihoods of generating inadequate predictions or vice versa. It is therefore concluded that the applicability domain does not give complete insight into the uncertainty of predictions and instead the generation of prediction intervals can help in this regard. Furthermore, results indicated that an increase of the dataset size did not improve model performance. This implies that larger dataset sizes may not necessarily improve model performance while in turn also meaning that large datasets are not necessarily required for prediction of acute toxicity with nano-QSARs.
Collapse
Affiliation(s)
- Surendra Balraadjsing
- Institute of Environmental Sciences (CML), Leiden University, PO Box 9518, 2300 RA Leiden, the Netherlands.
| | - Willie J G M Peijnenburg
- Institute of Environmental Sciences (CML), Leiden University, PO Box 9518, 2300 RA Leiden, the Netherlands; Centre for Safety of Substances and Products, National Institute of Public Health and the Environment (RIVM), PO Box 1, 3720 BA Bilthoven, the Netherlands
| | - Martina G Vijver
- Institute of Environmental Sciences (CML), Leiden University, PO Box 9518, 2300 RA Leiden, the Netherlands
| |
Collapse
|
6
|
Lovrić M, Wang T, Staffe MR, Šunić I, Časni K, Lasky-Su J, Chawes B, Rasmussen MA. A Chemical Structure and Machine Learning Approach to Assess the Potential Bioactivity of Endogenous Metabolites and Their Association with Early Childhood Systemic Inflammation. Metabolites 2024; 14:278. [PMID: 38786755 PMCID: PMC11122766 DOI: 10.3390/metabo14050278] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2024] [Revised: 04/29/2024] [Accepted: 05/08/2024] [Indexed: 05/25/2024] Open
Abstract
Metabolomics has gained much attention due to its potential to reveal molecular disease mechanisms and present viable biomarkers. This work uses a panel of untargeted serum metabolomes from 602 children from the COPSAC2010 mother-child cohort. The annotated part of the metabolome consists of 517 chemical compounds curated using automated procedures. We created a filtering method for the quantified metabolites using predicted quantitative structure-bioactivity relationships for the Tox21 database on nuclear receptors and stress response in cell lines. The metabolites measured in the children's serums are predicted to affect specific targeted models, known for their significance in inflammation, immune function, and health outcomes. The targets from Tox21 have been used as targets with quantitative structure-activity relationships (QSARs). They were trained for ~7000 structures, saved as models, and then applied to the annotated metabolites to predict their potential bioactivities. The models were selected based on strict accuracy criteria surpassing random effects. After application, 52 metabolites showed potential bioactivity based on structural similarity with known active compounds from the Tox21 set. The filtered compounds were subsequently used and weighted by their bioactive potential to show an association with early childhood hs-CRP levels at six months in a linear model supporting a physiological adverse effect on systemic low-grade inflammation.
Collapse
Affiliation(s)
- Mario Lovrić
- COPSAC, Copenhagen Prospective Studies on Asthma in Childhood, Herlev and Gentofte Hospital, 2820 Gentofte, Denmark
- Centre for Applied Bioanthropology, Institute for Anthropological Research, 10000 Zagreb, Croatia;
- The Lisbon Council, 1040 Brussels, Belgium
| | - Tingting Wang
- COPSAC, Copenhagen Prospective Studies on Asthma in Childhood, Herlev and Gentofte Hospital, 2820 Gentofte, Denmark
| | - Mads Rønnow Staffe
- Department of Food Science, University of Copenhagen, 1958 Frederiksberg, Denmark
| | - Iva Šunić
- Centre for Applied Bioanthropology, Institute for Anthropological Research, 10000 Zagreb, Croatia;
| | | | - Jessica Lasky-Su
- Department of Medicine, Boston, MA 02115, USA
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA
| | - Bo Chawes
- COPSAC, Copenhagen Prospective Studies on Asthma in Childhood, Herlev and Gentofte Hospital, 2820 Gentofte, Denmark
- Department of Clinical Medicine, Faculty of Health and Medical Sciences, University of Copenhagen, 2300 Copenhagen, Denmark
| | - Morten Arendt Rasmussen
- COPSAC, Copenhagen Prospective Studies on Asthma in Childhood, Herlev and Gentofte Hospital, 2820 Gentofte, Denmark
- Department of Food Science, University of Copenhagen, 1958 Frederiksberg, Denmark
| |
Collapse
|
7
|
Rodríguez-Belenguer P, Mangas-Sanjuan V, Soria-Olivas E, Pastor M. Integrating Mechanistic and Toxicokinetic Information in Predictive Models of Cholestasis. J Chem Inf Model 2024; 64:2775-2788. [PMID: 37660324 PMCID: PMC11005038 DOI: 10.1021/acs.jcim.3c00945] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Indexed: 09/05/2023]
Abstract
Drug development involves the thorough assessment of the candidate's safety and efficacy. In silico toxicology (IST) methods can contribute to the assessment, complementing in vitro and in vivo experimental methods, since they have many advantages in terms of cost and time. Also, they are less demanding concerning the requirements of product and experimental animals. One of these methods, Quantitative Structure-Activity Relationships (QSAR), has been proven successful in predicting simple toxicity end points but has more difficulties in predicting end points involving more complex phenomena. We hypothesize that QSAR models can produce better predictions of these end points by combining multiple QSAR models describing simpler biological phenomena and incorporating pharmacokinetic (PK) information, using quantitative in vitro to in vivo extrapolation (QIVIVE) models. In this study, we applied our methodology to the prediction of cholestasis and compared it with direct QSAR models. Our results show a clear increase in sensitivity. The predictive quality of the models was further assessed to mimic realistic conditions where the query compounds show low similarity with the training series. Again, our methodology shows clear advantages over direct QSAR models in these situations. We conclude that the proposed methodology could improve existing methodologies and could be suitable for being applied to other toxicity end points.
Collapse
Affiliation(s)
- Pablo Rodríguez-Belenguer
- Research
Programme on Biomedical Informatics (GRIB), Department of Medicine
and Life Sciences, Universitat Pompeu Fabra,
Hospital del Mar Medical Research Institute, 08003 Barcelona, Spain
- Department
of Pharmacy and Pharmaceutical Technology and Parasitology, Universitat de València, 46100 Valencia, Spain
| | - Victor Mangas-Sanjuan
- Department
of Pharmacy and Pharmaceutical Technology and Parasitology, Universitat de València, 46100 Valencia, Spain
- Interuniversity
Research Institute for Molecular Recognition and Technological Development, Universitat Politècnica de València, 46100 Valencia, Spain
| | - Emilio Soria-Olivas
- IDAL,
Intelligent Data Analysis Laboratory, ETSE, Universitat de València, 46100 Valencia, Spain
| | - Manuel Pastor
- Research
Programme on Biomedical Informatics (GRIB), Department of Medicine
and Life Sciences, Universitat Pompeu Fabra,
Hospital del Mar Medical Research Institute, 08003 Barcelona, Spain
| |
Collapse
|
8
|
Mansouri K, Moreira-Filho JT, Lowe CN, Charest N, Martin T, Tkachenko V, Judson R, Conway M, Kleinstreuer NC, Williams AJ. Free and open-source QSAR-ready workflow for automated standardization of chemical structures in support of QSAR modeling. J Cheminform 2024; 16:19. [PMID: 38378618 PMCID: PMC10880251 DOI: 10.1186/s13321-024-00814-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Accepted: 02/10/2024] [Indexed: 02/22/2024] Open
Abstract
The rapid increase of publicly available chemical structures and associated experimental data presents a valuable opportunity to build robust QSAR models for applications in different fields. However, the common concern is the quality of both the chemical structure information and associated experimental data. This is especially true when those data are collected from multiple sources as chemical substance mappings can contain many duplicate structures and molecular inconsistencies. Such issues can impact the resulting molecular descriptors and their mappings to experimental data and, subsequently, the quality of the derived models in terms of accuracy, repeatability, and reliability. Herein we describe the development of an automated workflow to standardize chemical structures according to a set of standard rules and generate two and/or three-dimensional "QSAR-ready" forms prior to the calculation of molecular descriptors. The workflow was designed in the KNIME workflow environment and consists of three high-level steps. First, a structure encoding is read, and then the resulting in-memory representation is cross-referenced with any existing identifiers for consistency. Finally, the structure is standardized using a series of operations including desalting, stripping of stereochemistry (for two-dimensional structures), standardization of tautomers and nitro groups, valence correction, neutralization when possible, and then removal of duplicates. This workflow was initially developed to support collaborative modeling QSAR projects to ensure consistency of the results from the different participants. It was then updated and generalized for other modeling applications. This included modification of the "QSAR-ready" workflow to generate "MS-ready structures" to support the generation of substance mappings and searches for software applications related to non-targeted analysis mass spectrometry. Both QSAR and MS-ready workflows are freely available in KNIME, via standalone versions on GitHub, and as docker container resources for the scientific community. Scientific contribution: This work pioneers an automated workflow in KNIME, systematically standardizing chemical structures to ensure their readiness for QSAR modeling and broader scientific applications. By addressing data quality concerns through desalting, stereochemistry stripping, and normalization, it optimizes molecular descriptors' accuracy and reliability. The freely available resources in KNIME, GitHub, and docker containers democratize access, benefiting collaborative research and advancing diverse modeling endeavors in chemistry and mass spectrometry.
Collapse
Affiliation(s)
- Kamel Mansouri
- National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods, National Institute of Environmental Health Sciences, Research Triangle Park, NC, 27709, USA.
| | - José T Moreira-Filho
- National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods, National Institute of Environmental Health Sciences, Research Triangle Park, NC, 27709, USA
| | - Charles N Lowe
- Center for Computational Toxicology and Exposure, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| | - Nathaniel Charest
- Center for Computational Toxicology and Exposure, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| | - Todd Martin
- Center for Computational Toxicology and Exposure, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| | | | - Richard Judson
- Center for Computational Toxicology and Exposure, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| | - Mike Conway
- National Institute of Environmental Health Sciences, Research Triangle Park, NC, 27709, USA
| | - Nicole C Kleinstreuer
- National Toxicology Program Interagency Center for the Evaluation of Alternative Toxicological Methods, National Institute of Environmental Health Sciences, Research Triangle Park, NC, 27709, USA
| | - Antony J Williams
- Center for Computational Toxicology and Exposure, Office of Research and Development, U.S. Environmental Protection Agency, Research Triangle Park, NC, 27711, USA
| |
Collapse
|
9
|
Kim Y, Jung H, Kumar S, Paton RS, Kim S. Designing solvent systems using self-evolving solubility databases and graph neural networks. Chem Sci 2024; 15:923-939. [PMID: 38239675 PMCID: PMC10793204 DOI: 10.1039/d3sc03468b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Accepted: 12/04/2023] [Indexed: 01/22/2024] Open
Abstract
Designing solvent systems is key to achieving the facile synthesis and separation of desired products from chemical processes, so many machine learning models have been developed to predict solubilities. However, breakthroughs are needed to address deficiencies in the model's predictive accuracy and generalizability; this can be addressed by expanding and integrating experimental and computational solubility databases. To maximize predictive accuracy, these two databases should not be trained separately, and they should not be simply combined without reconciling the discrepancies from different magnitudes of errors and uncertainties. Here, we introduce self-evolving solubility databases and graph neural networks developed through semi-supervised self-training approaches. Solubilities from quantum-mechanical calculations are referred to during semi-supervised learning, but they are not directly added to the experimental database. Dataset augmentation is performed from 11 637 experimental solubilities to >900 000 data points in the integrated database, while correcting for the discrepancies between experiment and computation. Our model was successfully applied to study solvent selection in organic reactions and separation processes. The accuracy (mean absolute error around 0.2 kcal mol-1 for the test set) is quantitatively useful in exploring Linear Free Energy Relationships between reaction rates and solvation free energies for 11 organic reactions. Our model also accurately predicted the partition coefficients of lignin-derived monomers and drug-like molecules. While there is room for expanding solubility predictions to transition states, radicals, charged species, and organometallic complexes, this approach will be attractive to predictive chemistry areas where experimental, computational, and other heterogeneous data should be combined.
Collapse
Affiliation(s)
- Yeonjoon Kim
- Department of Chemistry, Colorado State University Fort Collins CO 80523 USA
- Department of Chemistry, Pukyong National University Busan 48513 Republic of Korea
| | - Hojin Jung
- Department of Chemistry, Colorado State University Fort Collins CO 80523 USA
| | - Sabari Kumar
- Department of Chemistry, Colorado State University Fort Collins CO 80523 USA
| | - Robert S Paton
- Department of Chemistry, Colorado State University Fort Collins CO 80523 USA
| | - Seonah Kim
- Department of Chemistry, Colorado State University Fort Collins CO 80523 USA
| |
Collapse
|
10
|
Grandits M, Ecker GF. Ligand- and Structure-based Approaches for Transmembrane Transporter Modeling. Curr Drug Res Rev 2024; 16:81-93. [PMID: 37157206 PMCID: PMC11340286 DOI: 10.2174/2589977515666230508123041] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2022] [Revised: 03/15/2023] [Accepted: 03/28/2023] [Indexed: 05/10/2023]
Abstract
The study of transporter proteins is key to understanding the mechanism behind multidrug resistance and drug-drug interactions causing severe side effects. While ATP-binding transporters are well-studied, solute carriers illustrate an understudied family with a high number of orphan proteins. To study these transporters, in silico methods can be used to shed light on the basic molecular machinery by studying protein-ligand interactions. Nowadays, computational methods are an integral part of the drug discovery and development process. In this short review, computational approaches, such as machine learning, are discussed, which try to tackle interactions between transport proteins and certain compounds to locate target proteins. Furthermore, a few cases of selected members of the ATP binding transporter and solute carrier family are covered, which are of high interest in clinical drug interaction studies, especially for regulatory agencies. The strengths and limitations of ligand-based and structure-based methods are discussed to highlight their applicability for different studies. Furthermore, the combination of multiple approaches can improve the information obtained to find crucial amino acids that explain important interactions of protein-ligand complexes in more detail. This allows the design of drug candidates with increased activity towards a target protein, which further helps to support future synthetic efforts.
Collapse
Affiliation(s)
- Melanie Grandits
- Department of Pharmaceutical Sciences, University of Vienna, Vienna, Austria
| | - Gerhard F. Ecker
- Department of Pharmaceutical Sciences, University of Vienna, Vienna, Austria
| |
Collapse
|
11
|
Lovrić M, Wang T, Staffe MR, Šunić I, Časni K, Lasky-Su J, Chawes B, Rasmussen MA. A chemical structure and machine learning approach to assess the potential bioactivity of endogenous metabolites and their association with early-childhood hs-CRP levels. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.15.567095. [PMID: 38014335 PMCID: PMC10680762 DOI: 10.1101/2023.11.15.567095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
Metabolomics has gained much attraction due to its potential to reveal molecular disease mechanisms and present viable biomarkers. In this work we used a panel of untargeted serum metabolomes in 602 childhood patients of the COPSAC2010 mother-child cohort. The annotated part of the metabolome consists of 493 chemical compounds curated using automated procedures. Using predicted quantitative-structure-bioactivity relationships for the Tox21 database on nuclear receptors and stress response in cell lines, we created a filtering method for the vast number of quantified metabolites. The metabolites measured in children's serums used here have predicted potential against the chosen target modelled targets. The targets from Tox21 have been used with quantitative structure-activity relationships (QSARs) and were trained for ~7000 structures, saved as models, and then applied to 493 metabolites to predict their potential bioactivities. The models were selected based on strict accuracy criteria surpassing random effects. After application, 52 metabolites showed potential bioactivity based on structural similarity with known active compounds from the Tox21 set. The filtered compounds were subsequently used and weighted by their bioactive potential to show an association with early childhood hs-CRP levels at six months in a linear model supporting a physiological adverse effect on systemic low-grade inflammation. The significant metabolites were reported.
Collapse
Affiliation(s)
- Mario Lovrić
- COPSAC, Copenhagen Prospective Studies on Asthma in Childhood, Herlev and Gentofte Hospital, University of Copenhagen, Copenhagen, Denmark
- Centre for Applied Bioanthropology, Institute for Anthropological Research, 10000 Zagreb, Croatia
- Faculty of Electrical Engineering, Computer Science and Information Technology Osijek, Josip Juraj Strossmayer University of Osijek, Kneza Trpimira 2b, HR-31000 Osijek, Croatia
| | - Tingting Wang
- COPSAC, Copenhagen Prospective Studies on Asthma in Childhood, Herlev and Gentofte Hospital, University of Copenhagen, Copenhagen, Denmark
| | - Mads Rønnow Staffe
- University of Copenhagen, Department of Food Science, Rolighedsvej 26, 1958 Frb. C., Denmark
| | - Iva Šunić
- Centre for Applied Bioanthropology, Institute for Anthropological Research, 10000 Zagreb, Croatia
| | | | - Jessica Lasky-Su
- COPSAC, Copenhagen Prospective Studies on Asthma in Childhood, Herlev and Gentofte Hospital, University of Copenhagen, Copenhagen, Denmark
- Centre for Applied Bioanthropology, Institute for Anthropological Research, 10000 Zagreb, Croatia
- Faculty of Electrical Engineering, Computer Science and Information Technology Osijek, Josip Juraj Strossmayer University of Osijek, Kneza Trpimira 2b, HR-31000 Osijek, Croatia
- University of Copenhagen, Department of Food Science, Rolighedsvej 26, 1958 Frb. C., Denmark
- Know-Center, Inffeldgasse 13, AT-8010 Graz
- Department of Medicine, Harvard Medical School, Boston, MA 02115, USA
| | - Bo Chawes
- COPSAC, Copenhagen Prospective Studies on Asthma in Childhood, Herlev and Gentofte Hospital, University of Copenhagen, Copenhagen, Denmark
| | - Morten Arendt Rasmussen
- COPSAC, Copenhagen Prospective Studies on Asthma in Childhood, Herlev and Gentofte Hospital, University of Copenhagen, Copenhagen, Denmark
- University of Copenhagen, Department of Food Science, Rolighedsvej 26, 1958 Frb. C., Denmark
| |
Collapse
|
12
|
Deng J, Yang Z, Wang H, Ojima I, Samaras D, Wang F. A systematic study of key elements underlying molecular property prediction. Nat Commun 2023; 14:6395. [PMID: 37833262 PMCID: PMC10575948 DOI: 10.1038/s41467-023-41948-6] [Citation(s) in RCA: 33] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Accepted: 09/18/2023] [Indexed: 10/15/2023] Open
Abstract
Artificial intelligence (AI) has been widely applied in drug discovery with a major task as molecular property prediction. Despite booming techniques in molecular representation learning, key elements underlying molecular property prediction remain largely unexplored, which impedes further advancements in this field. Herein, we conduct an extensive evaluation of representative models using various representations on the MoleculeNet datasets, a suite of opioids-related datasets and two additional activity datasets from the literature. To investigate the predictive power in low-data and high-data space, a series of descriptors datasets of varying sizes are also assembled to evaluate the models. In total, we have trained 62,820 models, including 50,220 models on fixed representations, 4200 models on SMILES sequences and 8400 models on molecular graphs. Based on extensive experimentation and rigorous comparison, we show that representation learning models exhibit limited performance in molecular property prediction in most datasets. Besides, multiple key elements underlying molecular property prediction can affect the evaluation results. Furthermore, we show that activity cliffs can significantly impact model prediction. Finally, we explore into potential causes why representation learning models can fail and show that dataset size is essential for representation learning models to excel.
Collapse
Affiliation(s)
- Jianyuan Deng
- Stony Brook University, Department of Biomedical Informatics, Stony Brook, NY, 11794, USA
| | - Zhibo Yang
- Stony Brook University, Department of Computer Science, Stony Brook, NY, 11794, USA
| | - Hehe Wang
- Stony Brook University, Department of Chemistry, Stony Brook, NY, 11794, USA
| | - Iwao Ojima
- Stony Brook University, Department of Chemistry, Stony Brook, NY, 11794, USA
| | - Dimitris Samaras
- Stony Brook University, Department of Computer Science, Stony Brook, NY, 11794, USA
| | - Fusheng Wang
- Stony Brook University, Department of Biomedical Informatics, Stony Brook, NY, 11794, USA.
- Stony Brook University, Department of Computer Science, Stony Brook, NY, 11794, USA.
| |
Collapse
|
13
|
Sinha K, Ghosh N, Sil PC. A Review on the Recent Applications of Deep Learning in Predictive Drug Toxicological Studies. Chem Res Toxicol 2023; 36:1174-1205. [PMID: 37561655 DOI: 10.1021/acs.chemrestox.2c00375] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/12/2023]
Abstract
Drug toxicity prediction is an important step in ensuring patient safety during drug design studies. While traditional preclinical studies have historically relied on animal models to evaluate toxicity, recent advances in deep-learning approaches have shown great promise in advancing drug safety science and reducing animal use in preclinical studies. However, deep-learning-based approaches also face challenges in handling large biological data sets, model interpretability, and regulatory acceptance. In this review, we provide an overview of recent developments in deep-learning-based approaches for predicting drug toxicity, highlighting their potential advantages over traditional methods and the need to address their limitations. Deep-learning models have demonstrated excellent performance in predicting toxicity outcomes from various data sources such as chemical structures, genomic data, and high-throughput screening assays. The potential of deep learning for automated feature engineering is also discussed. This review emphasizes the need to address ethical concerns related to the use of deep learning in drug toxicity studies, including the reduction of animal use and ensuring regulatory acceptance. Furthermore, emerging applications of deep learning in drug toxicity prediction, such as predicting drug-drug interactions and toxicity in rare subpopulations, are highlighted. The integration of deep-learning-based approaches with traditional methods is discussed as a way to develop more reliable and efficient predictive models for drug safety assessment, paving the way for safer and more effective drug discovery and development. Overall, this review highlights the critical role of deep learning in predictive toxicology and drug safety evaluation, emphasizing the need for continued research and development in this rapidly evolving field. By addressing the limitations of traditional methods, leveraging the potential of deep learning for automated feature engineering, and addressing ethical concerns, deep-learning-based approaches have the potential to revolutionize drug toxicity prediction and improve patient safety in drug discovery and development.
Collapse
Affiliation(s)
- Krishnendu Sinha
- Department of Zoology, Jhargram Raj College, Jhargram 721507, West Bengal, India
| | - Nabanita Ghosh
- Department of Zoology, Maulana Azad College, Kolkata 700013, West Bengal, India
| | - Parames C Sil
- Division of Molecular Medicine, Bose Institute, Kolkata 700054, West Bengal, India
| |
Collapse
|
14
|
Lunghini F, Fava A, Pisapia V, Sacco F, Iaconis D, Beccari AR. ProfhEX: AI-based platform for small molecules liability profiling. J Cheminform 2023; 15:60. [PMID: 37296454 DOI: 10.1186/s13321-023-00728-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2022] [Accepted: 05/28/2023] [Indexed: 06/12/2023] Open
Abstract
Off-target drug interactions are a major reason for candidate failure in the drug discovery process. Anticipating potential drug's adverse effects in the early stages is necessary to minimize health risks to patients, animal testing, and economical costs. With the constantly increasing size of virtual screening libraries, AI-driven methods can be exploited as first-tier screening tools to provide liability estimation for drug candidates. In this work we present ProfhEX, an AI-driven suite of 46 OECD-compliant machine learning models that can profile small molecules on 7 relevant liability groups: cardiovascular, central nervous system, gastrointestinal, endocrine, renal, pulmonary and immune system toxicities. Experimental affinity data was collected from public and commercial data sources. The entire chemical space comprised 289'202 activity data for a total of 210'116 unique compounds, spanning over 46 targets with dataset sizes ranging from 819 to 18896. Gradient boosting and random forest algorithms were initially employed and ensembled for the selection of a champion model. Models were validated according to the OECD principles, including robust internal (cross validation, bootstrap, y-scrambling) and external validation. Champion models achieved an average Pearson correlation coefficient of 0.84 (SD of 0.05), an R2 determination coefficient of 0.68 (SD = 0.1) and a root mean squared error of 0.69 (SD of 0.08). All liability groups showed good hit-detection power with an average enrichment factor at 5% of 13.1 (SD of 4.5) and AUC of 0.92 (SD of 0.05). Benchmarking against already existing tools demonstrated the predictive power of ProfhEX models for large-scale liability profiling. This platform will be further expanded with the inclusion of new targets and through complementary modelling approaches, such as structure and pharmacophore-based models. ProfhEX is freely accessible at the following address: https://profhex.exscalate.eu/ .
Collapse
Affiliation(s)
- Filippo Lunghini
- EXSCALATE, Dompé Farmaceutici SpA, Via Tommaso de Amicis 95, 80123, Naples, Italy
| | - Anna Fava
- EXSCALATE, Dompé Farmaceutici SpA, Via Tommaso de Amicis 95, 80123, Naples, Italy
| | - Vincenzo Pisapia
- Professional Service Department, SAS Institute, Via Darwin 20/22, 20143, Milan, Italy
| | - Francesco Sacco
- Professional Service Department, SAS Institute, Via Darwin 20/22, 20143, Milan, Italy
| | - Daniela Iaconis
- EXSCALATE, Dompé Farmaceutici SpA, Via Tommaso de Amicis 95, 80123, Naples, Italy
| | | |
Collapse
|
15
|
Lazare J, Tebes-Stevens C, Weber EJ. A multiple linear regression approach to the estimation of carboxylic acid ester and lactone alkaline hydrolysis rate constants. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2023; 34:183-210. [PMID: 36951517 PMCID: PMC10547131 DOI: 10.1080/1062936x.2023.2188608] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Accepted: 02/25/2023] [Indexed: 05/03/2023]
Abstract
Pesticides, pharmaceuticals, and other organic contaminants often undergo hydrolysis when released into the environment; therefore, measured or estimated hydrolysis rates are needed to assess their environmental persistence. An intuitive multiple linear regression (MLR) approach was used to develop robust QSARs for predicting base-catalyzed rate constants of carboxylic acid esters (CAEs) and lactones. We explored various combinations of independent descriptors, resulting in four primary models (two for lactones and two for CAEs), with a total of 15 and 11 parameters included in the CAE and lactone QSAR models, respectively. The most significant descriptors include pKa, electronegativity, charge density, and steric parameters. Model performance is assessed using Drug Theoretics and Cheminformatics Laboratory's DTC-QSAR tool, demonstrating high accuracy for both internal validation (r2 = 0.93 and RMSE = 0.41-0.43 for CAEs; r2 = 0.90-0.93 and RMSE = 0.38-0.46 for lactones) and external validation (r2 = 0.93 and RMSE = 0.43-0.45 for CAEs; r2 = 0.94-0.98 and RMSE = 0.33-0.41 for lactones). The developed models require only low-cost computational resources and have substantially improved performance compared to existing hydrolysis rate prediction models (HYDROWIN and SPARC).
Collapse
Affiliation(s)
- Jovian Lazare
- Oak Ridge Institute for Science and Education (ORISE), hosted at U.S. Environmental Protection Agency, Athens, Georgia 30605, United States
| | - Caroline Tebes-Stevens
- Center for Environmental Measurement and Modeling, United States Environmental Protection Agency, Athens, Georgia 30605, United States
| | - Eric J. Weber
- Center for Environmental Measurement and Modeling, United States Environmental Protection Agency, Athens, Georgia 30605, United States
| |
Collapse
|
16
|
Bernau CR, Knödler M, Emonts J, Jäpel RC, Buyel JF. The use of predictive models to develop chromatography-based purification processes. Front Bioeng Biotechnol 2022; 10:1009102. [PMID: 36312533 PMCID: PMC9605695 DOI: 10.3389/fbioe.2022.1009102] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Accepted: 09/23/2022] [Indexed: 11/13/2022] Open
Abstract
Chromatography is the workhorse of biopharmaceutical downstream processing because it can selectively enrich a target product while removing impurities from complex feed streams. This is achieved by exploiting differences in molecular properties, such as size, charge and hydrophobicity (alone or in different combinations). Accordingly, many parameters must be tested during process development in order to maximize product purity and recovery, including resin and ligand types, conductivity, pH, gradient profiles, and the sequence of separation operations. The number of possible experimental conditions quickly becomes unmanageable. Although the range of suitable conditions can be narrowed based on experience, the time and cost of the work remain high even when using high-throughput laboratory automation. In contrast, chromatography modeling using inexpensive, parallelized computer hardware can provide expert knowledge, predicting conditions that achieve high purity and efficient recovery. The prediction of suitable conditions in silico reduces the number of empirical tests required and provides in-depth process understanding, which is recommended by regulatory authorities. In this article, we discuss the benefits and specific challenges of chromatography modeling. We describe the experimental characterization of chromatography devices and settings prior to modeling, such as the determination of column porosity. We also consider the challenges that must be overcome when models are set up and calibrated, including the cross-validation and verification of data-driven and hybrid (combined data-driven and mechanistic) models. This review will therefore support researchers intending to establish a chromatography modeling workflow in their laboratory.
Collapse
Affiliation(s)
- C. R. Bernau
- Fraunhofer Institute for Molecular Biology and Applied Ecology IME, Aachen, Germany
| | - M. Knödler
- Fraunhofer Institute for Molecular Biology and Applied Ecology IME, Aachen, Germany
- Institute for Molecular Biotechnology, RWTH Aachen University, Aachen, Germany
| | - J. Emonts
- Fraunhofer Institute for Molecular Biology and Applied Ecology IME, Aachen, Germany
| | - R. C. Jäpel
- Fraunhofer Institute for Molecular Biology and Applied Ecology IME, Aachen, Germany
- Institute for Molecular Biotechnology, RWTH Aachen University, Aachen, Germany
| | - J. F. Buyel
- Fraunhofer Institute for Molecular Biology and Applied Ecology IME, Aachen, Germany
- Institute for Molecular Biotechnology, RWTH Aachen University, Aachen, Germany
- University of Natural Resources and Life Sciences, Vienna (BOKU), Department of Biotechnology (DBT), Institute of Bioprocess Science and Engineering (IBSE), Vienna, Austria
- *Correspondence: J. F. Buyel,
| |
Collapse
|
17
|
Bellamy H, Rehim AA, Orhobor OI, King R. Batched Bayesian Optimization for Drug Design in Noisy Environments. J Chem Inf Model 2022; 62:3970-3981. [PMID: 36044048 PMCID: PMC9472273 DOI: 10.1021/acs.jcim.2c00602] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
![]()
The early stages of the drug design process involve identifying
compounds with suitable bioactivities via noisy assays. As databases
of possible drugs are often very large, assays can only be performed
on a subset of the candidates. Selecting which assays to perform is
best done within an active learning process, such as batched Bayesian
optimization, and aims to reduce the number of assays that must be
performed. We compare how noise affects different batched Bayesian
optimization techniques and introduce a retest policy to mitigate
the effect of noise. Our experiments show that batched Bayesian optimization
remains effective, even when large amounts of noise are present, and
that the retest policy enables more active compounds to be identified
in the same number of experiments.
Collapse
Affiliation(s)
- Hugo Bellamy
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge CB3 0AS, UK
| | - Abbi Abdel Rehim
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge CB3 0AS, UK
| | - Oghenejokpeme I Orhobor
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge CB3 0AS, UK
| | - Ross King
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge CB3 0AS, UK
| |
Collapse
|