Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Škuta C, Cortés-Ciriano I, Dehaen W, Kříž P, van Westen GJP, Tetko IV, Bender A, Svozil D. QSAR-derived affinity fingerprints (part 1): fingerprint construction and modeling performance for similarity searching, bioactivity classification and scaffold hopping. J Cheminform 2020;12:39. [PMID: 33431038 PMCID: PMC7260783 DOI: 10.1186/s13321-020-00443-6] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Accepted: 05/16/2020] [Indexed: 02/11/2023] Open

For:	Škuta C, Cortés-Ciriano I, Dehaen W, Kříž P, van Westen GJP, Tetko IV, Bender A, Svozil D. QSAR-derived affinity fingerprints (part 1): fingerprint construction and modeling performance for similarity searching, bioactivity classification and scaffold hopping. J Cheminform 2020;12:39. [PMID: 33431038 PMCID: PMC7260783 DOI: 10.1186/s13321-020-00443-6] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Accepted: 05/16/2020] [Indexed: 02/11/2023] Open

Number

Cited by Other Article(s)

Snyder SH, Vignaux PA, Ozalp MK, Gerlach J, Puhl AC, Lane TR, Corbett J, Urbina F, Ekins S. The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications. Commun Chem 2024;7:134. [PMID: 38866916 PMCID: PMC11169557 DOI: 10.1038/s42004-024-01220-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Accepted: 06/04/2024] [Indexed: 06/14/2024] Open

Vogt M. Chemoinformatic approaches for navigating large chemical spaces. Expert Opin Drug Discov 2024;19:403-414. [PMID: 38300511 DOI: 10.1080/17460441.2024.2313475] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Accepted: 01/30/2024] [Indexed: 02/02/2024]

Carracedo-Reboredo P, Aranzamendi E, He S, Arrasate S, Munteanu CR, Fernandez-Lozano C, Sotomayor N, Lete E, González-Díaz H. MATEO: intermolecular α-amidoalkylation theoretical enantioselectivity optimization. Online tool for selection and design of chiral catalysts and products. J Cheminform 2024;16:9. [PMID: 38254200 PMCID: PMC10804835 DOI: 10.1186/s13321-024-00802-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Accepted: 01/11/2024] [Indexed: 01/24/2024] Open

Affiliation(s)

Paula Carracedo-Reboredo Department of Organic and Inorganic Chemistry, Faculty of Science and Technology, University of The Basque Country (UPV/EHU), P.O. Box 644, 48080, Bilbao, Spain Department of Computer Science and Information Technologies, Faculty of Computer Science, CITIC-Research Center of Information and Communication Technologies, University of A Coruña, Campus Elviña s/n, 15071, A Coruña, Spain
Eider Aranzamendi Department of Organic and Inorganic Chemistry, Faculty of Science and Technology, University of The Basque Country (UPV/EHU), P.O. Box 644, 48080, Bilbao, Spain
Shan He Department of Organic and Inorganic Chemistry, Faculty of Science and Technology, University of The Basque Country (UPV/EHU), P.O. Box 644, 48080, Bilbao, Spain IKERDATA S.L., ZITEK, University of Basque Country UPVEHU, Rectorate Building, 48940, Leioa, Spain
Sonia Arrasate Department of Organic and Inorganic Chemistry, Faculty of Science and Technology, University of The Basque Country (UPV/EHU), P.O. Box 644, 48080, Bilbao, Spain
Cristian R Munteanu Department of Computer Science and Information Technologies, Faculty of Computer Science, CITIC-Research Center of Information and Communication Technologies, University of A Coruña, Campus Elviña s/n, 15071, A Coruña, Spain
Carlos Fernandez-Lozano Department of Computer Science and Information Technologies, Faculty of Computer Science, CITIC-Research Center of Information and Communication Technologies, University of A Coruña, Campus Elviña s/n, 15071, A Coruña, Spain
Nuria Sotomayor Department of Organic and Inorganic Chemistry, Faculty of Science and Technology, University of The Basque Country (UPV/EHU), P.O. Box 644, 48080, Bilbao, Spain.
Esther Lete Department of Organic and Inorganic Chemistry, Faculty of Science and Technology, University of The Basque Country (UPV/EHU), P.O. Box 644, 48080, Bilbao, Spain.
Humberto González-Díaz Department of Organic and Inorganic Chemistry, Faculty of Science and Technology, University of The Basque Country (UPV/EHU), P.O. Box 644, 48080, Bilbao, Spain. IKERBASQUE, Basque Foundation for Science, 48011, Bilbao, Spain.

Collapse

Karabulut S, Kaur H, Gauld JW. Applications and Potential of In Silico Approaches for Psychedelic Chemistry. Molecules 2023;28:5966. [PMID: 37630218 PMCID: PMC10459288 DOI: 10.3390/molecules28165966] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Revised: 08/01/2023] [Accepted: 08/07/2023] [Indexed: 08/27/2023] Open

Srisongkram T, Khamtang P, Weerapreeyakul N. Prediction of KRAS^G12C inhibitors using conjoint fingerprint and machine learning-based QSAR models. J Mol Graph Model 2023;122:108466. [PMID: 37058997 DOI: 10.1016/j.jmgm.2023.108466] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2022] [Revised: 03/19/2023] [Accepted: 03/29/2023] [Indexed: 04/16/2023]

Abstract

Kirsten rat sarcoma virus G12C (KRAS^G12C) is the major protein mutation associated with non-small cell lung cancer (NSCLC) severity. Inhibiting KRAS^G12C is therefore one of the key therapeutic strategies for NSCLC patients. In this paper, a cost-effective data driven drug design employing machine learning-based quantitative structure-activity relationship (QSAR) analysis was built for predicting ligand affinities against KRAS^G12C protein. A curated and non-redundant dataset of 1033 compounds with KRAS^G12C inhibitory activity (pIC₅₀) was used to build and test the models. The PubChem fingerprint, Substructure fingerprint, Substructure fingerprint count, and the conjoint fingerprint-a combination of PubChem fingerprint and Substructure fingerprint count-were used to train the models. Using comprehensive validation methods and various machine learning algorithms, the results clearly showed that the XGBoost regression (XGBoost) achieved the highest performance in term of goodness of fit, predictivity, generalizability and model robustness (R² = 0.81, Q²_CV = 0.60, Q²_Ext = 0.62, R² - Q²_Ext = 0.19, R²_Y-Random = 0.31 ± 0.03, Q²_Y-Random = -0.09 ± 0.04). The top 13 molecular fingerprints that correlated with the predicted pIC₅₀ values were SubFPC274 (aromatic atoms), SubFPC307 (number of chiral-centers), PubChemFP37 (≥1 Chlorine), SubFPC18 (Number of alkylarylethers), SubFPC1 (number of primary carbons), SubFPC300 (number of 1,3-tautomerizables), PubChemFP621 (N-C:C:C:N structure), PubChemFP23 (≥1 Fluorine), SubFPC2 (number of secondary carbons), SubFPC295 (number of C-ONS bonds), PubChemFP199 (≥4 6-membered rings), PubChemFP180 (≥1 nitrogen-containing 6-membered ring), and SubFPC180 (number of tertiary amine). These molecular fingerprints were virtualized and validated using molecular docking experiments. In conclusion, this conjoint fingerprint and XGBoost-QSAR model demonstrated to be useful as a high-throughput screening tool for KRAS^G12C inhibitor identification and drug design.

Collapse

McNair D. Artificial Intelligence and Machine Learning for Lead-to-Candidate Decision-Making and Beyond. Annu Rev Pharmacol Toxicol 2023;63:77-97. [PMID: 35679624 DOI: 10.1146/annurev-pharmtox-051921-023255] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]

Muegge I, Hu Y. How do we further enhance 2D fingerprint similarity searching for novel drug discovery? Expert Opin Drug Discov 2022;17:1173-1176. [PMID: 36150044 DOI: 10.1080/17460441.2022.2128332] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]

Yang J, Cai Y, Zhao K, Xie H, Chen X. Concepts and applications of chemical fingerprint for hit and lead screening. Drug Discov Today 2022;27:103356. [PMID: 36113834 DOI: 10.1016/j.drudis.2022.103356] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2021] [Revised: 07/28/2022] [Accepted: 09/08/2022] [Indexed: 11/22/2022]

Weyrich A, Joel M, Lewin G, Hofmann T, Frericks M. Review of the state of science and evaluation of currently available in silico prediction models for reproductive and developmental toxicity: A case study on pesticides. Birth Defects Res 2022;114:812-842. [PMID: 35748219 PMCID: PMC9545887 DOI: 10.1002/bdr2.2062] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2022] [Revised: 05/10/2022] [Accepted: 05/28/2022] [Indexed: 11/06/2022]

Morger A, Garcia de Lomana M, Norinder U, Svensson F, Kirchmair J, Mathea M, Volkamer A. Studying and mitigating the effects of data drifts on ML model performance at the example of chemical toxicity data. Sci Rep 2022;12:7244. [PMID: 35508546 PMCID: PMC9068909 DOI: 10.1038/s41598-022-09309-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2021] [Accepted: 03/17/2022] [Indexed: 11/09/2022] Open

Abstract

Machine learning models are widely applied to predict molecular properties or the biological activity of small molecules on a specific protein. Models can be integrated in a conformal prediction (CP) framework which adds a calibration step to estimate the confidence of the predictions. CP models present the advantage of ensuring a predefined error rate under the assumption that test and calibration set are exchangeable. In cases where the test data have drifted away from the descriptor space of the training data, or where assay setups have changed, this assumption might not be fulfilled and the models are not guaranteed to be valid. In this study, the performance of internally valid CP models when applied to either newer time-split data or to external data was evaluated. In detail, temporal data drifts were analysed based on twelve datasets from the ChEMBL database. In addition, discrepancies between models trained on publicly-available data and applied to proprietary data for the liver toxicity and MNT in vivo endpoints were investigated. In most cases, a drastic decrease in the validity of the models was observed when applied to the time-split or external (holdout) test sets. To overcome the decrease in model validity, a strategy for updating the calibration set with data more similar to the holdout set was investigated. Updating the calibration set generally improved the validity, restoring it completely to its expected value in many cases. The restored validity is the first requisite for applying the CP models with confidence. However, the increased validity comes at the cost of a decrease in model efficiency, as more predictions are identified as inconclusive. This study presents a strategy to recalibrate CP models to mitigate the effects of data drifts. Updating the calibration sets without having to retrain the model has proven to be a useful approach to restore the validity of most models.

Collapse

Ligand-based approaches to activity prediction for the early stage of structure–activity–relationship progression. J Comput Aided Mol Des 2022;36:237-252. [DOI: 10.1007/s10822-022-00449-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2021] [Accepted: 03/07/2022] [Indexed: 11/27/2022]

Prediction of the Neurotoxic Potential of Chemicals Based on Modelling of Molecular Initiating Events Upstream of the Adverse Outcome Pathways of (Developmental) Neurotoxicity. Int J Mol Sci 2022;23:ijms23063053. [PMID: 35328472 PMCID: PMC8954925 DOI: 10.3390/ijms23063053] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 03/07/2022] [Accepted: 03/08/2022] [Indexed: 12/23/2022] Open

Lee SY, Lee ST, Suh S, Ko BJ, Oh HB. Revealing Unknown Controlled Substances and New Psychoactive Substances Using High-Resolution LC-MS/MS Machine Learning Models and the Hybrid Similarity Search Algorithm. J Anal Toxicol 2021;46:732-742. [PMID: 34498039 DOI: 10.1093/jat/bkab098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2021] [Revised: 08/11/2021] [Accepted: 09/08/2021] [Indexed: 11/12/2022] Open

Shen WX, Zeng X, Zhu F, Wang YL, Qin C, Tan Y, Jiang YY, Chen YZ. Out-of-the-box deep learning prediction of pharmaceutical properties by broadly learned knowledge-based molecular representations. NAT MACH INTELL 2021. [DOI: 10.1038/s42256-021-00301-6] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]

Rácz A, Bajusz D, Héberger K. Effect of Dataset Size and Train/Test Split Ratios in QSAR/QSPR Multiclass Classification. Molecules 2021;26:1111. [PMID: 33669834 PMCID: PMC7922354 DOI: 10.3390/molecules26041111] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2020] [Revised: 02/04/2021] [Accepted: 02/16/2021] [Indexed: 01/04/2023] Open

Čmelo I, Voršilák M, Svozil D. Profiling and analysis of chemical compounds using pointwise mutual information. J Cheminform 2021;13:3. [PMID: 33423694 PMCID: PMC7798221 DOI: 10.1186/s13321-020-00483-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Accepted: 12/24/2020] [Indexed: 12/21/2022] Open

Abstract

Pointwise mutual information (PMI) is a measure of association used in information theory. In this paper, PMI is used to characterize several publicly available databases (DrugBank, ChEMBL, PubChem and ZINC) in terms of association strength between compound structural features resulting in database PMI interrelation profiles. As structural features, substructure fragments obtained by coding individual compounds as MACCS, PubChemKey and ECFP fingerprints are used. The analysis of publicly available databases reveals, in accord with other studies, unusual properties of DrugBank compounds which further confirms the validity of PMI profiling approach. Z-standardized relative feature tightness (ZRFT), a PMI-derived measure that quantifies how well the given compound's feature combinations fit these in a particular compound set, is applied for the analysis of compound synthetic accessibility (SA), as well as for the classification of compounds as easy (ES) and hard (HS) to synthesize. ZRFT value distributions are compared with these of SYBA and SAScore. The analysis of ZRFT values of structurally complex compounds in the SAVI database reveals oligopeptide structures that are mispredicted by SAScore as HS, while correctly predicted by ZRFT and SYBA as ES. Compared to SAScore, SYBA and random forest, ZRFT predictions are less accurate, though by a narrow margin (AccZRFT = 94.5%, AccSYBA = 98.8%, AccSAScore = 99.0%, AccRF = 97.3%). However, ZRFT ability to distinguish between ES and HS compounds is surprisingly high considering that while SYBA, SAScore and random forest are dedicated SA models, ZRFT is a generic measurement that merely quantifies the strength of interrelations between structural feature pairs. The results presented in the current work indicate that structural feature co-occurrence, quantified by PMI or ZRFT, contains a significant amount of information relevant to physico-chemical properties of organic compounds.

Collapse

Wan Z, Wang QD, Liu D, Liang J. Data-driven machine learning model for the prediction of oxygen vacancy formation energy of metal oxide materials. Phys Chem Chem Phys 2021;23:15675-15684. [PMID: 34269780 DOI: 10.1039/d1cp02066h] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]

Tetko IV, Engkvist O. From Big Data to Artificial Intelligence: chemoinformatics meets new challenges. J Cheminform 2020;12:74. [PMID: 33339533 PMCID: PMC7747384 DOI: 10.1186/s13321-020-00475-y] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2020] [Accepted: 11/18/2020] [Indexed: 12/17/2022] Open

Lane TR, Foil DH, Minerali E, Urbina F, Zorn KM, Ekins S. Bioactivity Comparison across Multiple Machine Learning Algorithms Using over 5000 Datasets for Drug Discovery. Mol Pharm 2020;18:403-415. [PMID: 33325717 DOI: 10.1021/acs.molpharmaceut.0c01013] [Citation(s) in RCA: 21] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]

Abstract

Machine learning methods are attracting considerable attention from the pharmaceutical industry for use in drug discovery and applications beyond. In recent studies, we and others have applied multiple machine learning algorithms and modeling metrics and, in some cases, compared molecular descriptors to build models for individual targets or properties on a relatively small scale. Several research groups have used large numbers of datasets from public databases such as ChEMBL in order to evaluate machine learning methods of interest to them. The largest of these types of studies used on the order of 1400 datasets. We have now extracted well over 5000 datasets from CHEMBL for use with the ECFP6 fingerprint and in comparison of our proprietary software Assay Central with random forest, k-nearest neighbors, support vector classification, naïve Bayesian, AdaBoosted decision trees, and deep neural networks (three layers). Model performance was assessed using an array of fivefold cross-validation metrics including area-under-the-curve, F1 score, Cohen's kappa, and Matthews correlation coefficient. Based on ranked normalized scores for the metrics or datasets, all methods appeared comparable, while the distance from the top indicated that Assay Central and support vector classification were comparable. Unlike prior studies which have placed considerable emphasis on deep neural networks (deep learning), no advantage was seen in this case. If anything, Assay Central may have been at a slight advantage as the activity cutoff for each of the over 5000 datasets representing over 570,000 unique compounds was based on Assay Central performance, although support vector classification seems to be a strong competitor. We also applied Assay Central to perform prospective predictions for the toxicity targets PXR and hERG to further validate these models. This work appears to be the largest scale comparison of these machine learning algorithms to date. Future studies will likely evaluate additional databases, descriptors, and machine learning algorithms and further refine the methods for evaluating and comparing such models.

Collapse

Hsieh JH, Sedykh A, Mutlu E, Germolec DR, Auerbach SS, Rider CV. Harnessing In Silico, In Vitro, and In Vivo Data to Understand the Toxicity Landscape of Polycyclic Aromatic Compounds (PACs). Chem Res Toxicol 2020;34:268-285. [PMID: 33063992 DOI: 10.1021/acs.chemrestox.0c00213] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/13/2023]

Abstract

Polycyclic aromatic compounds (PACs) are compounds with a minimum of two six-atom aromatic fused rings. PACs arise from incomplete combustion or thermal decomposition of organic matter and are ubiquitous in the environment. Within PACs, carcinogenicity is generally regarded to be the most important public health concern. However, toxicity in other systems (reproductive and developmental toxicity, immunotoxicity) has also been reported. Despite the large number of PACs identified in the environment, research attention to understand exposure and health effects of PACs has focused on a relatively limited subset, namely polycyclic aromatic hydrocarbons (PAHs), the PACs with only carbon and hydrogen atoms. To triage the rest of the vast number of PACs for more resource-intensive testing, we developed a data-driven approach to contextualize hazard characterization of PACs, by leveraging the available data from various data streams (in silico toxicity, in vitro activity, structural fingerprints, and in vivo data availability). The PACs were clustered on the basis of their in silico toxicity profiles containing predictions from 8 different categories (carcinogenicity, cardiotoxicity, developmental toxicity, genotoxicity, hepatotoxicity, neurotoxicity, reproductive toxicity, and urinary toxicity). We found that PACs with the same parent structure (e.g., fluorene) could have diverse in silico toxicity profiles. In contrast, PACs with similar substituted groups (e.g., alkylated-PAHs) or heterocyclics (e.g., N-PACs) with varying ring sizes could have similar in silico toxicity profiles, suggesting that these groups are better candidates for toxicity read-across analysis. The clusters/regions associated with certain in silico toxicity, in vitro activity, and structural fingerprints were identified. We found that genotoxicity/carcinogenicity (in silico toxicity) and xenobiotic homeostasis and stress response (in vitro activity), respectively, dominate the toxicity/activity variation seen in the PACs. The "hot spots" with enriched toxicity/activity in conjunction with availability of in vivo carcinogenicity data revealed regions of either data-poor (hydroxylated-PAHs) or data-rich (unsubstituted, parent PAHs) PACs. These regions offer potential targets for prioritization of further in vivo assessment and for chemical read-across efforts. The analysis results are searchable through an interactive web application (https://ntp.niehs.nih.gov/go/pacs_tableau), allowing for alternative hypothesis generation.

Collapse

Cortés-Ciriano I, Škuta C, Bender A, Svozil D. QSAR-derived affinity fingerprints (part 2): modeling performance for potency prediction. J Cheminform 2020;12:41. [PMID: 33431016 PMCID: PMC7339533 DOI: 10.1186/s13321-020-00444-5] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2019] [Accepted: 05/16/2020] [Indexed: 01/22/2023] Open

Abstract

Affinity fingerprints report the activity of small molecules across a set of assays, and thus permit to gather information about the bioactivities of structurally dissimilar compounds, where models based on chemical structure alone are often limited, and model complex biological endpoints, such as human toxicity and in vitro cancer cell line sensitivity. Here, we propose to model in vitro compound activity using computationally predicted bioactivity profiles as compound descriptors. To this aim, we apply and validate a framework for the calculation of QSAR-derived affinity fingerprints (QAFFP) using a set of 1360 QSAR models generated using K_i, K_d, IC₅₀ and EC₅₀ data from ChEMBL database. QAFFP thus represent a method to encode and relate compounds on the basis of their similarity in bioactivity space. To benchmark the predictive power of QAFFP we assembled IC₅₀ data from ChEMBL database for 18 diverse cancer cell lines widely used in preclinical drug discovery, and 25 diverse protein target data sets. This study complements part 1 where the performance of QAFFP in similarity searching, scaffold hopping, and bioactivity classification is evaluated. Despite being inherently noisy, we show that using QAFFP as descriptors leads to errors in prediction on the test set in the ~ 0.65-0.95 pIC₅₀ units range, which are comparable to the estimated uncertainty of bioactivity data in ChEMBL (0.76-1.00 pIC₅₀ units). We find that the predictive power of QAFFP is slightly worse than that of Morgan2 fingerprints and 1D and 2D physicochemical descriptors, with an effect size in the 0.02-0.08 pIC₅₀ units range. Including QSAR models with low predictive power in the generation of QAFFP does not lead to improved predictive power. Given that the QSAR models we used to compute the QAFFP were selected on the basis of data availability alone, we anticipate better modeling results for QAFFP generated using more diverse and biologically meaningful targets. Data sets and Python code are publicly available at https://github.com/isidroc/QAFFP_regression .

Collapse