1
|
Horne RI, Andrzejewska EA, Alam P, Brotzakis ZF, Srivastava A, Aubert A, Nowinska M, Gregory RC, Staats R, Possenti A, Chia S, Sormanni P, Ghetti B, Caughey B, Knowles TPJ, Vendruscolo M. Discovery of potent inhibitors of α-synuclein aggregation using structure-based iterative learning. Nat Chem Biol 2024; 20:634-645. [PMID: 38632492 PMCID: PMC11062903 DOI: 10.1038/s41589-024-01580-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2023] [Accepted: 02/12/2024] [Indexed: 04/19/2024]
Abstract
Machine learning methods hold the promise to reduce the costs and the failure rates of conventional drug discovery pipelines. This issue is especially pressing for neurodegenerative diseases, where the development of disease-modifying drugs has been particularly challenging. To address this problem, we describe here a machine learning approach to identify small molecule inhibitors of α-synuclein aggregation, a process implicated in Parkinson's disease and other synucleinopathies. Because the proliferation of α-synuclein aggregates takes place through autocatalytic secondary nucleation, we aim to identify compounds that bind the catalytic sites on the surface of the aggregates. To achieve this goal, we use structure-based machine learning in an iterative manner to first identify and then progressively optimize secondary nucleation inhibitors. Our results demonstrate that this approach leads to the facile identification of compounds two orders of magnitude more potent than previously reported ones.
Collapse
Affiliation(s)
- Robert I Horne
- Centre for Misfolding Diseases, Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, UK
| | - Ewa A Andrzejewska
- Centre for Misfolding Diseases, Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, UK
| | - Parvez Alam
- Laboratory of Neurological Infections and Immunity, Rocky Mountain Laboratories, National Institute for Allergy and Infectious Diseases, National Institutes of Health, Hamilton, MT, USA
| | - Z Faidon Brotzakis
- Centre for Misfolding Diseases, Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, UK
| | - Ankit Srivastava
- Laboratory of Neurological Infections and Immunity, Rocky Mountain Laboratories, National Institute for Allergy and Infectious Diseases, National Institutes of Health, Hamilton, MT, USA
| | - Alice Aubert
- Centre for Misfolding Diseases, Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, UK
| | - Magdalena Nowinska
- Centre for Misfolding Diseases, Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, UK
| | - Rebecca C Gregory
- Centre for Misfolding Diseases, Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, UK
| | - Roxine Staats
- Centre for Misfolding Diseases, Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, UK
| | - Andrea Possenti
- Centre for Misfolding Diseases, Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, UK
| | - Sean Chia
- Centre for Misfolding Diseases, Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, UK
- Bioprocessing Technology Institute, Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
| | - Pietro Sormanni
- Centre for Misfolding Diseases, Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, UK
| | - Bernardino Ghetti
- Department of Pathology and Laboratory Medicine, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Byron Caughey
- Laboratory of Neurological Infections and Immunity, Rocky Mountain Laboratories, National Institute for Allergy and Infectious Diseases, National Institutes of Health, Hamilton, MT, USA
| | - Tuomas P J Knowles
- Centre for Misfolding Diseases, Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, UK
| | - Michele Vendruscolo
- Centre for Misfolding Diseases, Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, UK.
| |
Collapse
|
2
|
Seal S, Yang H, Trapotsi MA, Singh S, Carreras-Puigvert J, Spjuth O, Bender A. Merging bioactivity predictions from cell morphology and chemical fingerprint models using similarity to training data. J Cheminform 2023; 15:56. [PMID: 37268960 DOI: 10.1186/s13321-023-00723-x] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2023] [Accepted: 04/20/2023] [Indexed: 06/04/2023] Open
Abstract
The applicability domain of machine learning models trained on structural fingerprints for the prediction of biological endpoints is often limited by the lack of diversity of chemical space of the training data. In this work, we developed similarity-based merger models which combined the outputs of individual models trained on cell morphology (based on Cell Painting) and chemical structure (based on chemical fingerprints) and the structural and morphological similarities of the compounds in the test dataset to compounds in the training dataset. We applied these similarity-based merger models using logistic regression models on the predictions and similarities as features and predicted assay hit calls of 177 assays from ChEMBL, PubChem and the Broad Institute (where the required Cell Painting annotations were available). We found that the similarity-based merger models outperformed other models with an additional 20% assays (79 out of 177 assays) with an AUC > 0.70 compared with 65 out of 177 assays using structural models and 50 out of 177 assays using Cell Painting models. Our results demonstrated that similarity-based merger models combining structure and cell morphology models can more accurately predict a wide range of biological assay outcomes and further expanded the applicability domain by better extrapolating to new structural and morphology spaces.
Collapse
Affiliation(s)
- Srijit Seal
- Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, UK
| | - Hongbin Yang
- Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, UK
| | - Maria-Anna Trapotsi
- Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, UK
| | - Satvik Singh
- Department of Applied Mathematics and Theoretical Physics (DAMTP), University of Cambridge, Cambridge, UK
| | - Jordi Carreras-Puigvert
- Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Uppsala, Sweden
| | - Ola Spjuth
- Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Uppsala, Sweden.
| | - Andreas Bender
- Yusuf Hamied Department of Chemistry, University of Cambridge, Cambridge, UK.
| |
Collapse
|
3
|
Seal S, Yang H, Vollmers L, Bender A. Comparison of Cellular Morphological Descriptors and Molecular Fingerprints for the Prediction of Cytotoxicity- and Proliferation-Related Assays. Chem Res Toxicol 2021; 34:422-437. [PMID: 33522793 DOI: 10.1021/acs.chemrestox.0c00303] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Cell morphology features, such as those from the Cell Painting assay, can be generated at relatively low costs and represent versatile biological descriptors of a system and thereby compound response. In this study, we explored cell morphology descriptors and molecular fingerprints, separately and in combination, for the prediction of cytotoxicity- and proliferation-related in vitro assay endpoints. We selected 135 compounds from the MoleculeNet ToxCast benchmark data set which were annotated with Cell Painting readouts, where the relatively small size of the data set is due to the overlap of required annotations. We trained Random Forest classification models using nested cross-validation and Cell Painting descriptors, Morgan and ErG fingerprints, and their combinations. While using leave-one-cluster-out cross-validation (with clusters based on physicochemical descriptors), models using Cell Painting descriptors achieved higher average performance over all assays (Balanced Accuracy of 0.65, Matthews Correlation Coefficient of 0.28, and AUC-ROC of 0.71) compared to models using ErG fingerprints (BA 0.55, MCC 0.09, and AUC-ROC 0.60) and Morgan fingerprints alone (BA 0.54, MCC 0.06, and AUC-ROC 0.56). While using random shuffle splits, the combination of Cell Painting descriptors with ErG and Morgan fingerprints further improved balanced accuracy on average by 8.9% (in 9 out of 12 assays) and 23.4% (in 8 out of 12 assays) compared to using only ErG and Morgan fingerprints, respectively. Regarding feature importance, Cell Painting descriptors related to nuclei texture, granularity of cells, and cytoplasm as well as cell neighbors and radial distributions were identified to be most contributing, which is plausible given the endpoint considered. We conclude that cell morphological descriptors contain complementary information to molecular fingerprints which can be used to improve the performance of predictive cytotoxicity models, in particular in areas of novel structural space.
Collapse
Affiliation(s)
- Srijit Seal
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| | - Hongbin Yang
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| | - Luis Vollmers
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| | - Andreas Bender
- Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, United Kingdom
| |
Collapse
|
4
|
Nuñez JR, Mcgrady M, Yesiltepe Y, Renslow RS, Metz TO. Chespa: Streamlining Expansive Chemical Space Evaluation of Molecular Sets. J Chem Inf Model 2020; 60:6251-6257. [PMID: 33283505 DOI: 10.1021/acs.jcim.0c00899] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Thousands of chemical properties can be calculated for small molecules, which can be used to place the molecules within the context of a broader "chemical space." These definitions vary based on compounds of interest and the goals for the given chemical space definition. Here, we introduce a customizable Python module, chespa, built to easily assess different chemical space definitions through clustering of compounds in these spaces and visualizing trends of these clusters. To demonstrate this, chespa currently streamlines prediction of various molecular descriptors (predicted chemical properties, molecular substructures, AI-based chemical space, and chemical class ontology) in order to test six different chemical space definitions. Furthermore, we investigated how these varying definitions trend with mass spectrometry (MS)-based observability, that is, the ability of a molecule to be observed with MS (e.g., as a function of the molecule ionizability), using an example data set from the U.S. EPA's nontargeted analysis collaborative trial, where blinded samples had been analyzed previously, providing 1398 data points. Improved understanding of observability would offer many advantages in small-molecule identification, such as (i) a priori selection of experimental conditions based on suspected sample composition, (ii) the ability to reduce the number of candidate structures during compound identification by removing those less likely to ionize, and, in turn, (iii) a reduced false discovery rate and increased confidence in identifications. Factors controlling observability are not fully understood, making prediction of this property nontrivial and a prime candidate for chemical space analysis. Chespa is available at github.com/pnnl/chespa.
Collapse
Affiliation(s)
- Jamie R Nuñez
- Earth and Biological Sciences Directorate, Pacific Northwest National Laboratory, Richland, Washington 99352, United States.,The Gene and Linda Voiland School of Chemical Engineering and Bioengineering, Washington State University, Pullman, Washington 99164, United States
| | - Monee Mcgrady
- Earth and Biological Sciences Directorate, Pacific Northwest National Laboratory, Richland, Washington 99352, United States
| | - Yasemin Yesiltepe
- Earth and Biological Sciences Directorate, Pacific Northwest National Laboratory, Richland, Washington 99352, United States.,The Gene and Linda Voiland School of Chemical Engineering and Bioengineering, Washington State University, Pullman, Washington 99164, United States
| | - Ryan S Renslow
- Earth and Biological Sciences Directorate, Pacific Northwest National Laboratory, Richland, Washington 99352, United States.,The Gene and Linda Voiland School of Chemical Engineering and Bioengineering, Washington State University, Pullman, Washington 99164, United States
| | - Thomas O Metz
- Earth and Biological Sciences Directorate, Pacific Northwest National Laboratory, Richland, Washington 99352, United States
| |
Collapse
|
5
|
Abstract
Abstract
The prediction of toxicological endpoints has gained broad acceptance; it is widely applied in early stages of drug discovery as well as for impurities obtained in the production of generic or equivalent products. In this work, we describe methodologies for the prediction of toxicological endpoints compounds, with a particular focus on secondary metabolites. Case studies include toxicity prediction of natural compound databases with anti-diabetic, anti-malaria and anti-HIV properties.
Collapse
|
6
|
Allen CHG, Mervin LH, Mahmoud SY, Bender A. Leveraging heterogeneous data from GHS toxicity annotations, molecular and protein target descriptors and Tox21 assay readouts to predict and rationalise acute toxicity. J Cheminform 2019; 11:36. [PMID: 31152262 PMCID: PMC6544914 DOI: 10.1186/s13321-019-0356-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2018] [Accepted: 05/15/2019] [Indexed: 01/06/2023] Open
Abstract
Despite the increasing knowledge in both the chemical and biological domains the assimilation and exploration of heterogeneous datasets, encoding information about the chemical, bioactivity and phenotypic properties of compounds, remains a challenge due to requirement for overlap between chemicals assayed across the spaces. Here, we have constructed a novel dataset, larger than we have used in prior work, comprising 579 acute oral toxic compounds and 1427 non-toxic compounds derived from regulatory GHS information, along with their corresponding molecular and protein target descriptors and qHTS in vitro assay readouts from the Tox21 project. We found no clear association between the results of a FAFDrugs4 toxicophore screen and the acute oral toxicity classifications for our compound set; and a screen using a subset of the ToxAlerts toxicophores was also of limited utility, with only slight enrichment toward the toxic set (odds ratio of 1.48). We then investigated to what degree toxic and non-toxic compounds could be separated in each of the spaces, to compare their potential contribution to further analyses. Using an LDA projection, we found the largest degree of separation using chemical descriptors (Cohen’s d of 1.95) and the lowest degree of separation between toxicity classes using qHTS descriptors (Cohen’s d of 0.67). To compare the predictivity of the feature spaces for the toxicity endpoint, we next trained Random Forest (RF) acute oral toxicity classifiers on either molecular, protein target and qHTS descriptors. RFs trained on molecular and protein target descriptors were most predictive, with ROC AUC values of 0.80–0.92 and 0.70–0.85, respectively, across three test sets. RFs trained on both chemical and protein target descriptors combined exhibited similar predictive performance to the single-domain models (ROC AUC of 0.80–0.91). Model interpretability was improved by the inclusion of protein target descriptors, which allow the identification of specific targets (e.g. Retinal dehydrogenase) with literature links to toxic modes of action (e.g. oxidative stress). The dataset compiled in this study has been made available for future application.
Collapse
Affiliation(s)
- Chad H G Allen
- Department of Chemistry, Centre for Molecular Informatics, Lensfield Road, Cambridge, CB2 1EW, UK
| | - Lewis H Mervin
- Department of Chemistry, Centre for Molecular Informatics, Lensfield Road, Cambridge, CB2 1EW, UK
| | - Samar Y Mahmoud
- Department of Chemistry, Centre for Molecular Informatics, Lensfield Road, Cambridge, CB2 1EW, UK
| | - Andreas Bender
- Department of Chemistry, Centre for Molecular Informatics, Lensfield Road, Cambridge, CB2 1EW, UK.
| |
Collapse
|
7
|
Yin Z, Ai H, Zhang L, Ren G, Wang Y, Zhao Q, Liu H. Predicting the cytotoxicity of chemicals using ensemble learning methods and molecular fingerprints. J Appl Toxicol 2019; 39:1366-1377. [PMID: 30763981 DOI: 10.1002/jat.3785] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2018] [Revised: 01/14/2019] [Accepted: 01/14/2019] [Indexed: 12/12/2022]
Abstract
The prediction of compound cytotoxicity is an important part of the drug discovery process. However, it usually appears as poor predictive performance because the datasets are high-throughput and have a class-imbalance problem. In this study, several strategies of performing a structure-activity relationship study for a cytotoxic endpoint in the AID364 dataset were explored to solve the class-imbalance problem. Random forest adaboost was used as the base learners for 10 types of molecular fingerprints and an ensemble method and six data-balancing methods were applied to balance the classes. As a result, the ensemble model using MACCS fingerprint was found to be the best, giving area under the curve of 85.2% ± 0.35%, sensitivity of 81.8% ± 0.65%, and specificity of 76.0% ± 0.12% in fivefold cross-validation and area under the curve of 78.8%, sensitivity of 55.5% and specificity of 78.5% in external validation. Good performance also appeared on other datasets with different sizes/degrees of imbalance. To explore the structural commonality of cytotoxic compounds, several substructures were identified as an important reference for substructure alerts. The convincing results indicate that the proposed models are helpful in predicting the cytotoxicity of chemicals.
Collapse
Affiliation(s)
- Zimo Yin
- School of Information, Liaoning University, Shenyang, 110036, China
| | - Haixin Ai
- School of Life Science, Liaoning University, Shenyang, 110036, China.,Research Center for Computer Simulating and Information Processing of Bio-macromolecules of Liaoning Province, Shenyang, 110036, China.,Engineering Laboratory for Molecular Simulation and Designing of Drug Molecules of Liaoning, Shenyang, 110036, China
| | - Li Zhang
- School of Life Science, Liaoning University, Shenyang, 110036, China.,Research Center for Computer Simulating and Information Processing of Bio-macromolecules of Liaoning Province, Shenyang, 110036, China.,Engineering Laboratory for Molecular Simulation and Designing of Drug Molecules of Liaoning, Shenyang, 110036, China
| | - Guofei Ren
- School of Information, Liaoning University, Shenyang, 110036, China
| | - Yuming Wang
- Department of Breast Surgery, The First Hospital of China Medical University, Shenyang, Liaoning, 110001, China
| | - Qi Zhao
- School of Mathematics, Liaoning University, Shenyang, 110036, China
| | - Hongsheng Liu
- School of Life Science, Liaoning University, Shenyang, 110036, China.,Research Center for Computer Simulating and Information Processing of Bio-macromolecules of Liaoning Province, Shenyang, 110036, China.,Engineering Laboratory for Molecular Simulation and Designing of Drug Molecules of Liaoning, Shenyang, 110036, China
| |
Collapse
|
8
|
Svensson F, Norinder U, Bender A. Modelling compound cytotoxicity using conformal prediction and PubChem HTS data. Toxicol Res (Camb) 2017; 6:73-80. [PMID: 30090478 PMCID: PMC6061930 DOI: 10.1039/c6tx00252h] [Citation(s) in RCA: 37] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2016] [Accepted: 10/28/2016] [Indexed: 12/28/2022] Open
Abstract
The assessment of compound cytotoxicity is an important part of the drug discovery process. Accurate predictions of cytotoxicity have the potential to expedite decision making and save considerable time and effort. In this work we apply class conditional conformal prediction to model the cytotoxicity of compounds based on 16 high throughput cytotoxicity assays from PubChem. The data span 16 cell lines and comprise more than 440 000 unique compounds. The data sets are heavily imbalanced with only 0.8% of the tested compounds being cytotoxic. We trained one classification model for each cell line and validated the performance with respect to validity and accuracy. The generated models deliver high quality predictions for both toxic and non-toxic compounds despite the imbalance between the two classes. On external data collected from the same assay provider as one of the investigated cell lines the model had a sensitivity of 74% and a specificity of 65% at the 80% confidence level among the compounds assigned to a single class. Compared to previous approaches for large scale cytotoxicity modelling, this represents a balanced performance in the prediction of the toxic and non-toxic classes. The conformal prediction framework also allows the modeller to control the error frequency of the predictions, allowing predictions of cytotoxicity outcomes with confidence.
Collapse
Affiliation(s)
- Fredrik Svensson
- Centre for Molecular Informatics , Department of Chemistry , University of Cambridge , Lensfield Road , Cambridge CB2 1EW , UK .
| | - Ulf Norinder
- Swedish Toxicology Sciences Research Center , SE-151 36 Södertälje , Sweden
- Dept. Computer and Systems Sciences , Stockholm Univ. , Box 7003 , SE-164 07 Kista , Sweden
| | - Andreas Bender
- Centre for Molecular Informatics , Department of Chemistry , University of Cambridge , Lensfield Road , Cambridge CB2 1EW , UK .
| |
Collapse
|