Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

Total Articles

55
(from Reference Citation Analysis)

Article PDFs (24)

Cited by > 0 (48)

Searched Name

Pedro J Ballester

Ranked By

Results Analysis

Year Published Analysis
Article Type Analysis
Publication Title Analysis
Category Analysis

Results Analysis

Indexed Articles

Year Published

Show more Refine

Article Statistics

Refine

Publication Titles

Show more Refine

Grant Agencies

Show more Refine

Category

Show more Refine

Number	Citation Analysis
1	Comprehensive machine learning boosts structure-based virtual screening for PARP1 inhibitors. J Cheminform 2024;16:40. [PMID: 38582911 PMCID: PMC10999096 DOI: 10.1186/s13321-024-00832-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2023] [Accepted: 03/23/2024] [Indexed: 04/08/2024] Open Abstract Poly ADP-ribose polymerase 1 (PARP1) is an attractive therapeutic target for cancer treatment. Machine-learning scoring functions constitute a promising approach to discovering novel PARP1 inhibitors. Cutting-edge PARP1-specific machine-learning scoring functions were investigated using semi-synthetic training data from docking activity-labelled molecules: known PARP1 inhibitors, hard-to-discriminate decoys property-matched to them with generative graph neural networks and confirmed inactives. We further made test sets harder by including only molecules dissimilar to those in the training set. Comprehensive analysis of these datasets using five supervised learning algorithms, and protein-ligand fingerprints extracted from docking poses and ligand only features revealed one highly predictive scoring function. This is the PARP1-specific support vector machine-based regressor, when employing PLEC fingerprints, which achieved a high Normalized Enrichment Factor at the top 1% on the hardest test set (NEF1% = 0.588, median of 10 repetitions), and was more predictive than any other investigated scoring function, especially the classical scoring function employed as baseline. Collapse Key Words Machine learning scoring functions Molecular docking PARP1 inhibitors Structure-based virtual screening Target-specific scoring functions Collapse MESH Headings Collapse Grants Collapse
2	Inactive-enriched machine-learning models exploiting patent data improve structure-based virtual screening for PDL1 dimerizers. J Adv Res 2024:S2090-1232(24)00037-7. [PMID: 38280715 DOI: 10.1016/j.jare.2024.01.024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Revised: 12/01/2023] [Accepted: 01/21/2024] [Indexed: 01/29/2024] Open Abstract INTRODUCTION Small-molecule Programmable Cell Death Protein 1/Programmable Death-Ligand 1 (PD1/PDL1) inhibition via PDL1 dimerization has the potential to lead to inexpensive drugs with better cancer patient outcomes and milder side effects. However, this therapeutic approach has proven challenging, with only one PDL1 dimerizer reaching early clinical trials so far. There is hence a need for fast and accurate methods to develop alternative PDL1 dimerizers. OBJECTIVES We aim to show that structure-based virtual screening (SBVS) based on PDL1-specific machine-learning (ML) scoring functions (SFs) is a powerful drug design tool for detecting PD1/PDL1 inhibitors via PDL1 dimerization. METHODS By incorporating the latest MLSF advances, we generated and evaluated PDL1-specific MLSFs (classifiers and inactive-enriched regressors) on two demanding test sets. RESULTS 60 PDL1-specific MLSFs (30 classifiers and 30 regressors) were generated. Our large-scale analysis provides highly predictive PDL1-specific MLSFs that benefitted from training with large volumes of docked inactives and enabling inactive-enriched regression. CONCLUSION PDL1-specific MLSFs strongly outperformed generic SFs of various types on this target and are released here without restrictions. Collapse Key Words Artificial intelligence Docking Immunotherapy Machine learning PD1 PDL1 Virtual screening Collapse MESH Headings Collapse Grants Collapse
3	Large-Scale Machine Learning Analysis Reveals DNA Methylation and Gene Expression Response Signatures for Gemcitabine-Treated Pancreatic Cancer. HEALTH DATA SCIENCE 2024;4:0108. [PMID: 38486621 PMCID: PMC10904073 DOI: 10.34133/hds.0108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/15/2023] [Accepted: 12/08/2023] [Indexed: 03/17/2024] Abstract Background: Gemcitabine is a first-line chemotherapy for pancreatic adenocarcinoma (PAAD), but many PAAD patients do not respond to gemcitabine-containing treatments. Being able to predict such nonresponders would hence permit the undelayed administration of more promising treatments while sparing gemcitabine life-threatening side effects for those patients. Unfortunately, the few predictors of PAAD patient response to this drug are weak, none of them exploiting yet the power of machine learning (ML). Methods: Here, we applied ML to predict the response of PAAD patients to gemcitabine from the molecular profiles of their tumors. More concretely, we collected diverse molecular profiles of PAAD patient tumors along with the corresponding clinical data (gemcitabine responses and clinical features) from the Genomic Data Commons resource. From systematically combining 8 tumor profiles with 16 classification algorithms, each of the resulting 128 ML models was evaluated by multiple 10-fold cross-validations. Results: Only 7 of these 128 models were predictive, which underlines the importance of carrying out such a large-scale analysis to avoid missing the most predictive models. These were here random forest using 4 selected mRNAs [0.44 Matthews correlation coefficient (MCC), 0.785 receiver operating characteristic-area under the curve (ROC-AUC)] and XGBoost combining 12 DNA methylation probes (0.32 MCC, 0.697 ROC-AUC). By contrast, the hENT1 marker obtained much worse random-level performance (practically 0 MCC, 0.5 ROC-AUC). Despite not being trained to predict prognosis (overall and progression-free survival), these ML models were also able to anticipate this patient outcome. Conclusions: We release these promising ML models so that they can be evaluated prospectively on other gemcitabine-treated PAAD patients. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
4	The AI revolution in chemistry is not that far away. Nature 2023;624:252. [PMID: 38086935 DOI: 10.1038/d41586-023-03948-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2023] Abstract Collapse Key Words Chemistry Machine learning Collapse MESH Headings Collapse Grants Collapse
5	A practical guide to machine-learning scoring for structure-based virtual screening. Nat Protoc 2023;18:3460-3511. [PMID: 37845361 DOI: 10.1038/s41596-023-00885-w] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Accepted: 07/03/2023] [Indexed: 10/18/2023] Abstract Structure-based virtual screening (SBVS) via docking has been used to discover active molecules for a range of therapeutic targets. Chemical and protein data sets that contain integrated bioactivity information have increased both in number and in size. Artificial intelligence and, more concretely, its machine-learning (ML) branch, including deep learning, have effectively exploited these data sets to build scoring functions (SFs) for SBVS against targets with an atomic-resolution 3D model (e.g., generated by X-ray crystallography or predicted by AlphaFold2). Often outperforming their generic and non-ML counterparts, target-specific ML-based SFs represent the state of the art for SBVS. Here, we present a comprehensive and user-friendly protocol to build and rigorously evaluate these new SFs for SBVS. This protocol is organized into four sections: (i) using a public benchmark of a given target to evaluate an existing generic SF; (ii) preparing experimental data for a target from public repositories; (iii) partitioning data into a training set and a test set for subsequent target-specific ML modeling; and (iv) generating and evaluating target-specific ML SFs by using the prepared training-test partitions. All necessary code and input/output data related to three example targets (acetylcholinesterase, HMG-CoA reductase, and peroxisome proliferator-activated receptor-α) are available at https://github.com/vktrannguyen/MLSF-protocol , can be run by using a single computer within 1 week and make use of easily accessible software/programs (e.g., Smina, CNN-Score, RF-Score-VS and DeepCoy) and web resources. Our aim is to provide practical guidance on how to augment training data to enhance SBVS performance, how to identify the most suitable supervised learning algorithm for a data set, and how to build an SF with the highest likelihood of discovering target-active molecules within a given compound library. Collapse Key Words Collapse MESH Headings Artificial Intelligence Acetylcholinesterase Ligands Machine Learning Algorithms Molecular Docking Simulation Collapse Grants Collapse
6	A machine learning approach to predict cellular uptake of pBAE polyplexes. Biomater Sci 2023;11:5797-5808. [PMID: 37401742 DOI: 10.1039/d3bm00741c] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/05/2023] Abstract The delivery of genetic material (DNA and RNA) to cells can cure a wide range of diseases but is limited by the delivery efficiency of the carrier system. Poly β-amino esters (pBAEs) are promising polymer-based vectors that form polyplexes with negatively charged oligonucleotides, enabling cell membrane uptake and gene delivery. pBAE backbone polymer chemistry, as well as terminal oligopeptide modifications, define cellular uptake and transfection efficiency in a given cell line, along with nanoparticle size and polydispersity. Moreover, uptake and transfection efficiency of a given polyplex formulation also vary from cell type to cell type. Therefore, finding the optimal formulation leading to high uptake in a new cell line is dictated by trial and error, and requires time and resources. Machine learning (ML) is an ideal in silico screening tool to learn the non-linearities of complex data sets, like the one presented herein, with the aim of predicting cellular internalisation of pBAE polyplexes. A library of pBAE nanoparticles was fabricated and the uptake studied in 4 different cell lines, on which various ML models were successfully trained. The best performing models were found to be gradient-boosted trees and neural networks. The gradient-boosted trees model was then analysed using SHapley Additive exPlanations, to interpret the model and gain an understanding into the important features and their impact on the predicted outcome. Collapse Key Words Collapse MESH Headings Polymers Transfection DNA Gene Transfer Techniques Cell Line Nanoparticles Collapse Grants Collapse
7	Beware of Simple Methods for Structure-Based Virtual Screening: The Critical Importance of Broader Comparisons. J Chem Inf Model 2023;63:1401-1405. [PMID: 36848585 PMCID: PMC10015451 DOI: 10.1021/acs.jcim.3c00218] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/01/2023] Abstract We discuss how data unbiasing and simple methods such as protein-ligand Interaction FingerPrint (IFP) can overestimate virtual screening performance. We also show that IFP is strongly outperformed by target-specific machine-learning scoring functions, which were not considered in a recent report concluding that simple methods were better than machine-learning scoring functions at virtual screening. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
8	On the Best Way to Cluster NCI-60 Molecules. Biomolecules 2023;13:biom13030498. [PMID: 36979433 PMCID: PMC10046274 DOI: 10.3390/biom13030498] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Revised: 03/02/2023] [Accepted: 03/06/2023] [Indexed: 03/30/2023] Open Abstract Machine learning-based models have been widely used in the early drug-design pipeline. To validate these models, cross-validation strategies have been employed, including those using clustering of molecules in terms of their chemical structures. However, the poor clustering of compounds will compromise such validation, especially on test molecules dissimilar to those in the training set. This study aims at finding the best way to cluster the molecules screened by the National Cancer Institute (NCI)-60 project by comparing hierarchical, Taylor-Butina, and uniform manifold approximation and projection (UMAP) clustering methods. The best-performing algorithm can then be used to generate clusters for model validation strategies. This study also aims at measuring the impact of removing outlier molecules prior to the clustering step. Clustering results are evaluated using three well-known clustering quality metrics. In addition, we compute an average similarity matrix to assess the quality of each cluster. The results show variation in clustering quality from method to method. The clusters obtained by the hierarchical and Taylor-Butina methods are more computationally expensive to use in cross-validation strategies, and both cluster the molecules poorly. In contrast, the UMAP method provides the best quality, and therefore we recommend it to analyze this highly valuable dataset. Collapse Key Words NCI-60 panel clustering model validation small molecules Collapse MESH Headings Collapse Grants 775584 Consejo Nacional de Ciencia y Tecnología Collapse
9	Interpretable Machine Learning Models to Predict the Resistance of Breast Cancer Patients to Doxorubicin from Their microRNA Profiles. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2022;9:e2201501. [PMID: 35785523 PMCID: PMC9403644 DOI: 10.1002/advs.202201501] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Revised: 06/02/2022] [Indexed: 05/05/2023] Abstract Doxorubicin is a common treatment for breast cancer. However, not all patients respond to this drug, which sometimes causes life-threatening side effects. Accurately anticipating doxorubicin-resistant patients would therefore permit to spare them this risk while considering alternative treatments without delay. Stratifying patients based on molecular markers in their pretreatment tumors is a promising approach to advance toward this ambitious goal, but single-gene gene markers such as HER2 expression have not shown to be sufficiently predictive. The recent availability of matched doxorubicin-response and diverse molecular profiles across breast cancer patients permits now analysis at a much larger scale. 16 machine learning algorithms and 8 molecular profiles are systematically evaluated on the same cohort of patients. Only 2 of the 128 resulting models are substantially predictive, showing that they can be easily missed by a standard-scale analysis. The best model is classification and regression tree (CART) nonlinearly combining 4 selected miRNA isoforms to predict doxorubicin response (median Matthew correlation coefficient (MCC) and area under the curve (AUC) of 0.56 and 0.80, respectively). By contrast, HER2 expression is significantly less predictive (median MCC and AUC of 0.14 and 0.57, respectively). As the predictive accuracy of this CART model increases with larger training sets, its update with future data should result in even better accuracy. Collapse Key Words artificial intelligence machine learning multiomics precision oncology tumor profiling Collapse MESH Headings Algorithms Breast Neoplasms/drug therapy Breast Neoplasms/genetics Doxorubicin/therapeutic use Female Humans Machine Learning MicroRNAs/genetics Collapse Grants Indo-French Centre for the Promotion of Advanced Research - CEFIPRA Petroleum Technology Development Fund (PTDF), Nigeria Collapse
10	Structure-based virtual screening for PDL1 dimerizers: Evaluating generic scoring functions. Curr Res Struct Biol 2022;4:206-210. [PMID: 35769111 PMCID: PMC9234010 DOI: 10.1016/j.crstbi.2022.06.002] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 05/14/2022] [Accepted: 06/02/2022] [Indexed: 10/31/2022] Open Abstract The interaction between PD1 and its ligand PDL1 has been shown to render tumor cells resistant to apoptosis and promote tumor progression. An innovative mechanism to inhibit the PD1/PDL1 interaction is PDL1 dimerization induced by small-molecule PDL1 binders. Structure-based virtual screening is a promising approach to discovering such small-molecule PD1/PDL1 inhibitors. Here we investigate which type of generic scoring functions is most suitable to tackle this problem. We consider CNN-Score, an ensemble of convolutional neural networks, as the representative of machine-learning scoring functions. We also evaluate Smina, a commonly used classical scoring function, and IFP, a top structural fingerprint similarity scoring function. These three types of scoring functions were evaluated on two test sets sharing the same set of small-molecule PD1/PDL1 inhibitors, but using different types of inactives: either true inactives (molecules with no in vitro PD1/PDL1 inhibition activity) or assumed inactives (property-matched decoy molecules generated from each active). On both test sets, CNN-Score performed much better than Smina, which in turn strongly outperformed IFP. The fact that the latter was the case, despite precluding any possibility of exploiting decoy bias, demonstrates the predictive value of CNN-Score for PDL1. These results suggest that re-scoring Smina-docked molecules with CNN-Score is a promising structure-based virtual screening method to discover new small-molecule inhibitors of this therapeutic target. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
11	Artificial intelligence for drug response prediction in disease models. Brief Bioinform 2021;23:6398131. [PMID: 34655289 DOI: 10.1093/bib/bbab450] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open Abstract Collapse Key Words Collapse MESH Headings Artificial Intelligence Collapse Grants Collapse
12	Predicting Cancer Drug Response In Vivo by Learning an Optimal Feature Selection of Tumour Molecular Profiles. Biomedicines 2021;9:biomedicines9101319. [PMID: 34680436 PMCID: PMC8533095 DOI: 10.3390/biomedicines9101319] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2021] [Revised: 09/22/2021] [Accepted: 09/23/2021] [Indexed: 12/17/2022] Open Abstract (1) Background: Inter-tumour heterogeneity is one of cancer’s most fundamental features. Patient stratification based on drug response prediction is hence needed for effective anti-cancer therapy. However, single-gene markers of response are rare and/or may fail to achieve a significant impact in the clinic. Machine Learning (ML) is emerging as a particularly promising complementary approach to precision oncology. (2) Methods: Here we leverage comprehensive Patient-Derived Xenograft (PDX) pharmacogenomic data sets with dimensionality-reducing ML algorithms with this purpose. (3) Results: Combining multiple gene alterations via ML leads to better discrimination between sensitive and resistant PDXs in 19 of the 26 analysed cases. Highly predictive ML models employing concise gene lists were found for three cases: paclitaxel (breast cancer), binimetinib (breast cancer) and cetuximab (colorectal cancer). Interestingly, each of these multi-gene ML models identifies some treatment-responsive PDXs not harbouring the best actionable mutation for that case. Thus, ML multi-gene predictors generally have much fewer false negatives than the corresponding single-gene marker. (4) Conclusions: As PDXs often recapitulate clinical outcomes, these results suggest that many more patients could benefit from precision oncology if ML algorithms were also applied to existing clinical pharmacogenomics data, especially those algorithms generating classifiers combining data-selected gene alterations. Collapse Key Words biomarker discovery machine learning patient-derived xenograft precision oncology tumour profiling Collapse MESH Headings Collapse Grants Collapse
13	A gentle introduction to understanding preclinical data for cancer pharmaco-omic modeling. Brief Bioinform 2021;22:6343527. [PMID: 34368843 DOI: 10.1093/bib/bbab312] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2021] [Revised: 06/25/2021] [Accepted: 07/20/2021] [Indexed: 12/16/2022] Open Abstract A central goal of precision oncology is to administer an optimal drug treatment to each cancer patient. A common preclinical approach to tackle this problem has been to characterize the tumors of patients at the molecular and drug response levels, and employ the resulting datasets for predictive in silico modeling (mostly using machine learning). Understanding how and why the different variants of these datasets are generated is an important component of this process. This review focuses on providing such introduction aimed at scientists with little previous exposure to this research area. Collapse Key Words machine learningdrug response molecular profiling pharmacogenomic modeling phenotypic screening precision oncology Collapse MESH Headings Collapse Grants Collapse
14	NF-κB-dependent IRF1 activation programs cDC1 dendritic cells to drive antitumor immunity. Sci Immunol 2021;6:6/61/eabg3570. [PMID: 34244313 DOI: 10.1126/sciimmunol.abg3570] [Citation(s) in RCA: 43] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2020] [Accepted: 06/02/2021] [Indexed: 11/02/2022] Abstract Conventional type 1 dendritic cells (cDC1s) are critical for antitumor immunity. They acquire antigens from dying tumor cells and cross-present them to CD8⁺ T cells, promoting the expansion of tumor-specific cytotoxic T cells. However, the signaling pathways that govern the antitumor functions of cDC1s in immunogenic tumors are poorly understood. Using single-cell transcriptomics to examine the molecular pathways regulating intratumoral cDC1 maturation, we found nuclear factor κB (NF-κB) and interferon (IFN) pathways to be highly enriched in a subset of functionally mature cDC1s. We identified an NF-κB-dependent and IFN-γ-regulated gene network in cDC1s, including cytokines and chemokines specialized in the recruitment and activation of cytotoxic T cells. By mapping the trajectory of intratumoral cDC1 maturation, we demonstrated the dynamic reprogramming of tumor-infiltrating cDC1s by NF-κB and IFN signaling pathways. This maturation process was perturbed by specific inactivation of either NF-κB or IFN regulatory factor 1 (IRF1) in cDC1s, resulting in impaired expression of IFN-γ-responsive genes and consequently a failure to efficiently recruit and activate antitumoral CD8⁺ T cells. Last, we demonstrate the relevance of these findings to patients with melanoma, showing that activation of the NF-κB/IRF1 axis in association with cDC1s is linked with improved clinical outcome. The NF-κB/IRF1 axis in cDC1s may therefore represent an important focal point for the development of new diagnostic and therapeutic approaches to improve cancer immunotherapy. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
15	Recent progress on the prospective application of machine learning to structure-based virtual screening. Curr Opin Chem Biol 2021;65:28-34. [PMID: 34052776 DOI: 10.1016/j.cbpa.2021.04.009] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Revised: 04/13/2021] [Accepted: 04/23/2021] [Indexed: 12/30/2022] Abstract As more bioactivity and protein structure data become available, scoring functions (SFs) using machine learning (ML) to leverage these data sets continue to gain further accuracy and broader applicability. Advances in our understanding of the optimal ways to train and evaluate these ML-based SFs have introduced further improvements. One of these advances is how to select the most suitable decoys (molecules assumed inactive) to train or test an ML-based SF on a given target. We also review the latest applications of ML-based SFs for prospective structure-based virtual screening (SBVS), with a focus on the observed improvement over those using classical SFs. Finally, we provide recommendations for future prospective SBVS studies based on the findings of recent methodological studies. Collapse Key Words Artificial intelligence Machine learning Molecular docking Scoring functions Virtual screening Collapse MESH Headings Collapse Grants Collapse
16	Identification and Validation of Carbonic Anhydrase II as the First Target of the Anti-Inflammatory Drug Actarit. Biomolecules 2020;10:biom10111570. [PMID: 33227945 PMCID: PMC7699199 DOI: 10.3390/biom10111570] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2020] [Revised: 11/13/2020] [Accepted: 11/16/2020] [Indexed: 12/31/2022] Open Abstract Background and purpose: Identifying the macromolecular targets of drug molecules is a fundamental aspect of drug discovery and pharmacology. Several drugs remain without known targets (orphan) despite large-scale in silico and in vitro target prediction efforts. Ligand-centric chemical-similarity-based methods for in silico target prediction have been found to be particularly powerful, but the question remains of whether they are able to discover targets for target-orphan drugs. Experimental Approach: We used one of these in silico methods to carry out a target prediction analysis for two orphan drugs: actarit and malotilate. The top target predicted for each drug was carbonic anhydrase II (CAII). Each drug was therefore quantitatively evaluated for CAII inhibition to validate these two prospective predictions. Key Results: Actarit showed in vitro concentration-dependent inhibition of CAII activity with submicromolar potency (IC₅₀ = 422 nM) whilst no consistent inhibition was observed for malotilate. Among the other 25 targets predicted for actarit, RORγ (RAR-related orphan receptor-gamma) is promising in that it is strongly related to actarit’s indication, rheumatoid arthritis (RA). Conclusion and Implications: This study is a proof-of-concept of the utility of MolTarPred for the fast and cost-effective identification of targets of orphan drugs. Furthermore, the mechanism of action of actarit as an anti-RA agent can now be re-examined from a CAII-inhibitor perspective, given existing relationships between this target and RA. Moreover, the confirmed CAII-actarit association supports investigating the repositioning of actarit on other CAII-linked indications (e.g., hypertension, epilepsy, migraine, anemia and bone, eye and cardiac disorders). Collapse Key Words MolTarPred actarit carbonic anhydrase II malotilate target prediction Collapse MESH Headings Anti-Inflammatory Agents/administration & dosage Antirheumatic Agents/administration & dosage Arthritis, Rheumatoid/drug therapy Arthritis, Rheumatoid/enzymology Carbonic Anhydrase II/antagonists & inhibitors Carbonic Anhydrase II/metabolism Dose-Response Relationship, Drug Drug Delivery Systems/methods Humans Phenylacetates/administration & dosage Proof of Concept Study Reproducibility of Results Collapse Grants Collapse
17	Editorial: Intelligent Systems for Genome Functional Annotations. Front Genet 2020;11:915. [PMID: 33061935 PMCID: PMC7477101 DOI: 10.3389/fgene.2020.00915] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Accepted: 07/23/2020] [Indexed: 11/27/2022] Open Abstract Collapse Key Words functional annotation gene annotation intelligent system applications machine learning protein-protein interaction (PPI) Collapse MESH Headings Collapse Grants Collapse
18	Concise Polygenic Models for Cancer-Specific Identification of Drug-Sensitive Tumors from Their Multi-Omics Profiles. Biomolecules 2020;10:E963. [PMID: 32604779 PMCID: PMC7356608 DOI: 10.3390/biom10060963] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2020] [Revised: 06/20/2020] [Accepted: 06/22/2020] [Indexed: 12/15/2022] Open Abstract In silico models to predict which tumors will respond to a given drug are necessary for Precision Oncology. However, predictive models are only available for a handful of cases (each case being a given drug acting on tumors of a specific cancer type). A way to generate predictive models for the remaining cases is with suitable machine learning algorithms that are yet to be applied to existing in vitro pharmacogenomics datasets. Here, we apply XGBoost integrated with a stringent feature selection approach, which is an algorithm that is advantageous for these high-dimensional problems. Thus, we identified and validated 118 predictive models for 62 drugs across five cancer types by exploiting four molecular profiles (sequence mutations, copy-number alterations, gene expression, and DNA methylation). Predictive models were found in each cancer type and with every molecular profile. On average, no omics profile or cancer type obtained models with higher predictive accuracy than the rest. However, within a given cancer type, some molecular profiles were overrepresented among predictive models. For instance, CNA profiles were predictive in breast invasive carcinoma (BRCA) cell lines, but not in small cell lung cancer (SCLC) cell lines where gene expression (GEX) and DNA methylation profiles were the most predictive. Lastly, we identified the best XGBoost model per cancer type and analyzed their selected features. For each model, some of the genes in the selected list had already been found to be individually linked to the response to that drug, providing additional evidence of the usefulness of these models and the merits of the feature selection scheme. Collapse Key Words cancer pharmacogenomics drug resistance feature selection machine learning model interpretability Collapse MESH Headings Antineoplastic Agents/therapeutic use Computational Biology Humans Machine Learning Models, Statistical Neoplasms/drug therapy Collapse Grants Collapse
19	The impact of compound library size on the performance of scoring functions for structure-based virtual screening. Brief Bioinform 2020;22:5855396. [PMID: 32568385 DOI: 10.1093/bib/bbaa095] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2020] [Revised: 04/20/2020] [Accepted: 04/28/2020] [Indexed: 12/20/2022] Open Abstract Larger training datasets have been shown to improve the accuracy of machine learning (ML)-based scoring functions (SFs) for structure-based virtual screening (SBVS). In addition, massive test sets for SBVS, known as ultra-large compound libraries, have been demonstrated to enable the fast discovery of selective drug leads with low-nanomolar potency. This proof-of-concept was carried out on two targets using a single docking tool along with its SF. It is thus unclear whether this high level of performance would generalise to other targets, docking tools and SFs. We found that screening a larger compound library results in more potent actives being identified in all six additional targets using a different docking tool along with its classical SF. Furthermore, we established that a way to improve the potency of the retrieved molecules further is to rank them with more accurate ML-based SFs (we found this to be true in four of the six targets; the difference was not significant in the remaining two targets). A 3-fold increase in average hit rate across targets was also achieved by the ML-based SFs. Lastly, we observed that classical and ML-based SFs often find different actives, which supports using both types of SFs on those targets. Collapse Key Words big data docking drug design machine learning virtual screening Collapse MESH Headings Databases, Protein Machine Learning Molecular Docking Simulation Proteins/chemistry Proteins/genetics Collapse Grants Collapse
20	Machine‐learning scoring functions for structure‐based virtual screening. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE 2020. [DOI: 10.1002/wcms.1478] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Abstract Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
21	Machine‐learning scoring functions for structure‐based drug lead optimization. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE 2020. [DOI: 10.1002/wcms.1465] [Citation(s) in RCA: 53] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Abstract Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
22	Paclitaxel Response Can Be Predicted With Interpretable Multi-Variate Classifiers Exploiting DNA-Methylation and miRNA Data. Front Genet 2019;10:1041. [PMID: 31708973 PMCID: PMC6823251 DOI: 10.3389/fgene.2019.01041] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2019] [Accepted: 09/30/2019] [Indexed: 12/27/2022] Open Abstract To address the problem of resistance to paclitaxel treatment, we have investigated to which extent is possible to predict Breast Cancer (BC) patient response to this drug. We carried out a large-scale tumor-based prediction analysis using data from the US National Cancer Institute’s Genomic Data Commons. These data sets comprise the responses of BC patients to paclitaxel along with six molecular profiles of their tumors. We assessed 10 Machine Learning (ML) algorithms on each of these profiles and evaluated the resulting 60 classifiers on the same BC patients. DNA methylation and miRNA profiles were the most informative overall. In combination with these two profiles, ML algorithms selecting the smallest subset of molecular features generated the most predictive classifiers: a complexity-optimized XGBoost classifier based on CpG island methylation extracted a subset of molecular factors relevant to predict paclitaxel response (AUC = 0.74). A CpG site methylation-based Decision Tree (DT) combining only 2 of the 22,941 considered CpG sites (AUC = 0.89) and a miRNA expression-based DT employing just 4 of the 337 analyzed mature miRNAs (AUC = 0.72) reveal the molecular types associated to paclitaxel-sensitive and resistant BC tumors. A literature review shows that features selected by these three classifiers have been individually linked to the cytotoxic-drug sensitivities and prognosis of BC patients. Our work leads to several molecular signatures, unearthed from methylome and miRNome, able to anticipate to some extent which BC tumors respond or not to paclitaxel. These results may provide insights to optimize paclitaxel-therapies in clinical practice. Collapse Key Words artificial intelligence biomarker discovery machine learning precision oncology tumor profiling Collapse MESH Headings Collapse Grants Collapse
23	Predicting Synergism of Cancer Drug Combinations Using NCI-ALMANAC Data. Front Chem 2019;7:509. [PMID: 31380352 PMCID: PMC6646421 DOI: 10.3389/fchem.2019.00509] [Citation(s) in RCA: 67] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2019] [Accepted: 07/02/2019] [Indexed: 12/15/2022] Open Abstract Drug combinations are of great interest for cancer treatment. Unfortunately, the discovery of synergistic combinations by purely experimental means is only feasible on small sets of drugs. In silico modeling methods can substantially widen this search by providing tools able to predict which of all possible combinations in a large compound library are synergistic. Here we investigate to which extent drug combination synergy can be predicted by exploiting the largest available dataset to date (NCI-ALMANAC, with over 290,000 synergy determinations). Each cell line is modeled using primarily two machine learning techniques, Random Forest (RF) and Extreme Gradient Boosting (XGBoost), on the datasets provided by NCI-ALMANAC. This large-scale predictive modeling study comprises more than 5,000 pair-wise drug combinations, 60 cell lines, 4 types of models, and 5 types of chemical features. The application of a powerful, yet uncommonly used, RF-specific technique for reliability prediction is also investigated. The evaluation of these models shows that it is possible to predict the synergy of unseen drug combinations with high accuracy (Pearson correlations between 0.43 and 0.86 depending on the considered cell line, with XGBoost providing slightly better predictions than RF). We have also found that restricting to the most reliable synergy predictions results in at least 2-fold error decrease with respect to employing the best learning algorithm without any reliability estimation. Alkylating agents, tyrosine kinase inhibitors and topoisomerase inhibitors are the drugs whose synergy with other partner drugs are better predicted by the models. Despite its leading size, NCI-ALMANAC comprises an extremely small part of all conceivable combinations. Given their accuracy and reliability estimation, the developed models should drastically reduce the number of required in vitro tests by predicting in silico which of the considered combinations are likely to be synergistic. Collapse Key Words QSAR (qualitative structure-activity relationships) chemoinformatics drug synergy machine learning predictive (QSPR) models Collapse MESH Headings Collapse Grants Collapse
24	Machine Learning for Molecular Modelling in Drug Design. Biomolecules 2019;9:biom9060216. [PMID: 31167503 PMCID: PMC6627644 DOI: 10.3390/biom9060216] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2019] [Accepted: 06/03/2019] [Indexed: 01/28/2023] Open Abstract Machine learning (ML) has become a crucial component of early drug discovery. This researcharea has been fueled by two main factors [...]. Collapse Key Words Collapse MESH Headings Drug Design Machine Learning Models, Molecular Collapse Grants Collapse
25	MolTarPred: A web tool for comprehensive target prediction with reliability estimation. Chem Biol Drug Des 2019;94:1390-1401. [PMID: 30916462 DOI: 10.1111/cbdd.13516] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2018] [Revised: 02/07/2019] [Accepted: 03/03/2019] [Indexed: 12/17/2022] Abstract Molecular target prediction can provide a starting point to understand the efficacy and side effects of phenotypic screening hits. Unfortunately, the vast majority of in silico target prediction methods are not available as web tools. Furthermore, these are limited in the number of targets that can be predicted, do not estimate which target predictions are more reliable and/or lack comprehensive retrospective validations. We present MolTarPred ( http://moltarpred.marseille.inserm.fr/), a user-friendly web tool for predicting protein targets of small organic compounds. It is powered by a large knowledge base comprising 607,659 compounds and 4,553 macromolecular targets collected from the ChEMBL database. In about 1 min, the predicted targets for the supplied molecule will be listed in a table. The chemical structures of the query molecule and the most similar compounds annotated with the predicted target will also be shown to permit visual inspection and comparison. Practical examples of the use of MolTarPred are showcased. MolTarPred is a new resource for scientists that require a more complete knowledge of the polypharmacology of a molecule. The introduction of a reliability score constitutes an attractive functionality of MolTarPred, as it permits focusing experimental confirmatory tests on the most reliable predictions, which leads to higher prospective hit rates. Collapse Key Words polypharmacology prediction target deconvolution target fishing target prediction webserver Collapse MESH Headings Collapse Grants Collapse
26	Classical scoring functions for docking are unable to exploit large volumes of structural and interaction data. Bioinformatics 2019;35:3989-3995. [DOI: 10.1093/bioinformatics/btz183] [Citation(s) in RCA: 41] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2018] [Revised: 02/04/2019] [Accepted: 03/13/2019] [Indexed: 12/15/2022] Open Abstract Abstract Motivation Studies have shown that the accuracy of random forest (RF)-based scoring functions (SFs), such as RF-Score-v3, increases with more training samples, whereas that of classical SFs, such as X-Score, does not. Nevertheless, the impact of the similarity between training and test samples on this matter has not been studied in a systematic manner. It is therefore unclear how these SFs would perform when only trained on protein-ligand complexes that are highly dissimilar or highly similar to the test set. It is also unclear whether SFs based on machine learning algorithms other than RF can also improve accuracy with increasing training set size and to what extent they learn from dissimilar or similar training complexes. Results We present a systematic study to investigate how the accuracy of classical and machine-learning SFs varies with protein-ligand complex similarities between training and test sets. We considered three types of similarity metrics, based on the comparison of either protein structures, protein sequences or ligand structures. Regardless of the similarity metric, we found that incorporating a larger proportion of similar complexes to the training set did not make classical SFs more accurate. In contrast, RF-Score-v3 was able to outperform X-Score even when trained on just 32% of the most dissimilar complexes, showing that its superior performance owes considerably to learning from dissimilar training complexes to those in the test set. In addition, we generated the first SF employing Extreme Gradient Boosting (XGBoost), XGB-Score, and observed that it also improves with training set size while outperforming the rest of SFs. Given the continuous growth of training datasets, the development of machine-learning SFs has become very appealing. Availability and implementation https://github.com/HongjianLi/MLSF Supplementary information Supplementary data are available at Bioinformatics online. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
27	Building Machine-Learning Scoring Functions for Structure-Based Prediction of Intermolecular Binding Affinity. Methods Mol Biol 2019;2053:1-12. [PMID: 31452095 DOI: 10.1007/978-1-4939-9752-7_1] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/10/2023] Abstract Molecular docking enables large-scale prediction of whether and how small molecules bind to a macromolecular target. Machine-learning scoring functions are particularly well suited to predict the strength of this interaction. Here we describe how to build RF-Score, a scoring function utilizing the machine-learning technique known as Random Forest (RF). We also point out how to use different data, features, and regression models using either R or Python programming languages. Collapse Key Words Binding affinity Docking Machine learning Scoring function Collapse MESH Headings Databases, Genetic Ligands Machine Learning Models, Molecular Protein Binding Proteins/chemistry Quantitative Structure-Activity Relationship Software Web Browser Workflow Collapse Grants Collapse
28	A Stochastic Spiking Neural Network for Virtual Screening. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2018;29:1371-1375. [PMID: 28186913 DOI: 10.1109/tnnls.2017.2657601] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023] Abstract Virtual screening (VS) has become a key computational tool in early drug design and screening performance is of high relevance due to the large volume of data that must be processed to identify molecules with the sought activity-related pattern. At the same time, the hardware implementations of spiking neural networks (SNNs) arise as an emerging computing technique that can be applied to parallelize processes that normally present a high cost in terms of computing time and power. Consequently, SNN represents an attractive alternative to perform time-consuming processing tasks, such as VS. In this brief, we present a smart stochastic spiking neural architecture that implements the ultrafast shape recognition (USR) algorithm achieving two order of magnitude of speed improvement with respect to USR software implementations. The neural system is implemented in hardware using field-programmable gate arrays allowing a highly parallelized USR implementation. The results show that, due to the high parallelization of the system, millions of compounds can be checked in reasonable times. From these results, we can state that the proposed architecture arises as a feasible methodology to efficiently enhance time-consuming data-mining processes such as 3-D molecular similarity search. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
29	The Impact of Protein Structure and Sequence Similarity on the Accuracy of Machine-Learning Scoring Functions for Binding Affinity Prediction. Biomolecules 2018. [PMID: 29538331 PMCID: PMC5871981 DOI: 10.3390/biom8010012] [Citation(s) in RCA: 39] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open Abstract It has recently been claimed that the outstanding performance of machine-learning scoring functions (SFs) is exclusively due to the presence of training complexes with highly similar proteins to those in the test set. Here, we revisit this question using 24 similarity-based training sets, a widely used test set, and four SFs. Three of these SFs employ machine learning instead of the classical linear regression approach of the fourth SF (X-Score which has the best test set performance out of 16 classical SFs). We have found that random forest (RF)-based RF-Score-v3 outperforms X-Score even when 68% of the most similar proteins are removed from the training set. In addition, unlike X-Score, RF-Score-v3 is able to keep learning with an increasing training set size, becoming substantially more predictive than X-Score when the full 1105 complexes are used for training. These results show that machine-learning SFs owe a substantial part of their performance to training on complexes with dissimilar proteins to those in the test set, against what has been previously concluded using the same data. Given that a growing amount of structural and interaction data will be available from academic and industrial sources, this performance gap between machine-learning SFs and classical SFs is expected to enlarge in the future. Collapse Key Words binding affinity prediction machine learning molecular docking scoring function Collapse MESH Headings Machine Learning Molecular Docking Simulation/standards Protein Interaction Mapping/methods Protein Interaction Mapping/standards Sequence Analysis, Protein/standards Collapse Grants Collapse
30	Unearthing new genomic markers of drug response by improved measurement of discriminative power. BMC Med Genomics 2018;11:10. [PMID: 29409485 PMCID: PMC5801688 DOI: 10.1186/s12920-018-0336-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2016] [Accepted: 01/29/2018] [Indexed: 12/29/2022] Open Abstract Background Oncology drugs are only effective in a small proportion of cancer patients. Our current ability to identify these responsive patients before treatment is still poor in most cases. Thus, there is a pressing need to discover response markers for marketed and research oncology drugs. Screening these drugs against a large panel of cancer cell lines has led to the discovery of new genomic markers of in vitro drug response. However, while the identification of such markers among thousands of candidate drug-gene associations in the data is error-prone, an appraisal of the effectiveness of such detection task is currently lacking. Methods Here we present a new non-parametric method to measuring the discriminative power of a drug-gene association. Unlike parametric statistical tests, the adopted non-parametric test has the advantage of not making strong assumptions about the data distorting the identification of genomic markers. Furthermore, we introduce a new benchmark to further validate these markers in vitro using more recent data not used to identify the markers. Results The application of this new methodology has led to the identification of 128 new genomic markers distributed across 61% of the analysed drugs, including 5 drugs without previously known markers, which were missed by the MANOVA test initially applied to analyse data from the Genomics of Drug Sensitivity in Cancer consortium. Conclusions Discovering markers using more than one statistical test and testing them on independent data is unusual. We found this helpful to discard statistically significant drug-gene associations that were actually spurious correlations. This approach also revealed new, independently validated, in vitro markers of drug response such as Temsirolimus-CDKN2A (resistance) and Gemcitabine-EWS_FLI1 (sensitivity). Electronic supplementary material The online version of this article (10.1186/s12920-018-0336-z) contains supplementary material, which is available to authorized users. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
31	Drug repurposing for aging research using model organisms. Aging Cell 2017. [PMID: 28620943 PMCID: PMC5595691 DOI: 10.1111/acel.12626] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open Abstract Many increasingly prevalent diseases share a common risk factor: age. However, little is known about pharmaceutical interventions against aging, despite many genes and pathways shown to be important in the aging process and numerous studies demonstrating that genetic interventions can lead to a healthier aging phenotype. An important challenge is to assess the potential to repurpose existing drugs for initial testing on model organisms, where such experiments are possible. To this end, we present a new approach to rank drug-like compounds with known mammalian targets according to their likelihood to modulate aging in the invertebrates Caenorhabditis elegans and Drosophila. Our approach combines information on genetic effects on aging, orthology relationships and sequence conservation, 3D protein structures, drug binding and bioavailability. Overall, we rank 743 different drug-like compounds for their likelihood to modulate aging. We provide various lines of evidence for the successful enrichment of our ranking for compounds modulating aging, despite sparse public data suitable for validation. The top ranked compounds are thus prime candidates for in vivo testing of their effects on lifespan in C. elegans or Drosophila. As such, these compounds are promising as research tools and ultimately a step towards identifying drugs for a healthier human aging. Collapse Key Words C. elegans Drosophila aging computational predictions drug repurposing lifespan Collapse MESH Headings Collapse Grants Collapse
32	Precision and recall oncology: combining multiple gene mutations for improved identification of drug-sensitive tumours. Oncotarget 2017;8:97025-97040. [PMID: 29228590 PMCID: PMC5722542 DOI: 10.18632/oncotarget.20923] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2017] [Accepted: 08/14/2017] [Indexed: 02/07/2023] Open Abstract Cancer drug therapies are only effective in a small proportion of patients. To make things worse, our ability to identify these responsive patients before administering a treatment is generally very limited. The recent arrival of large-scale pharmacogenomic data sets, which measure the sensitivity of molecularly profiled cancer cell lines to a panel of drugs, has boosted research on the discovery of drug sensitivity markers. However, no systematic comparison of widely-used single-gene markers with multi-gene machine-learning markers exploiting genomic data has been so far conducted. We therefore assessed the performance offered by these two types of models in discriminating between sensitive and resistant cell lines to a given drug. This was carried out for each of 127 considered drugs using genomic data characterising the cell lines. We found that the proportion of cell lines predicted to be sensitive that are actually sensitive (precision) varies strongly with the drug and type of model used. Furthermore, the proportion of sensitive cell lines that are correctly predicted as sensitive (recall) of the best single-gene marker was lower than that of the multi-gene marker in 118 of the 127 tested drugs. We conclude that single-gene markers are only able to identify those drug-sensitive cell lines with the considered actionable mutation, unlike multi-gene markers that can in principle combine multiple gene mutations to identify additional sensitive cell lines. We also found that cell line sensitivities to some drugs (e.g. Temsirolimus, 17-AAG or Methotrexate) are better predicted by these machine-learning models. Collapse Key Words biomarker discovery cancer drug sensitivity genomics machine learning Collapse MESH Headings Collapse Grants Collapse
33	Predicting the Reliability of Drug-target Interaction Predictions with Maximum Coverage of Target Space. Sci Rep 2017. [PMID: 28630414 PMCID: PMC5476590 DOI: 10.1038/s41598-017-04264-w] [Citation(s) in RCA: 46] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open Abstract Many computational methods to predict the macromolecular targets of small organic molecules have been presented to date. Despite progress, target prediction methods still have important limitations. For example, the most accurate methods implicitly restrict their predictions to a relatively small number of targets, are not systematically validated on drugs (whose targets are harder to predict than those of non-drug molecules) and often lack a reliability score associated with each predicted target. Here we present a systematic validation of ligand-centric target prediction methods on a set of clinical drugs. These methods exploit a knowledge-base covering 887,435 known ligand-target associations between 504,755 molecules and 4,167 targets. Based on this dataset, we provide a new estimate of the polypharmacology of drugs, which on average have 11.5 targets below IC₅₀ 10 µM. The average performance achieved across clinical drugs is remarkable (0.348 precision and 0.423 recall, with large drug-dependent variability), especially given the unusually large coverage of the target space. Furthermore, we show how a sparse ligand-target bioactivity matrix to retrospectively validate target prediction methods could underestimate prospective performance. Lastly, we present and validate a first-in-kind score capable of accurately predicting the reliability of target predictions. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
34	Performance of machine-learning scoring functions in structure-based virtual screening. Sci Rep 2017;7:46710. [PMID: 28440302 PMCID: PMC5404222 DOI: 10.1038/srep46710] [Citation(s) in RCA: 188] [Impact Index Per Article: 26.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2016] [Accepted: 03/23/2017] [Indexed: 12/23/2022] Open Abstract Classical scoring functions have reached a plateau in their performance in virtual screening and binding affinity prediction. Recently, machine-learning scoring functions trained on protein-ligand complexes have shown great promise in small tailored studies. They have also raised controversy, specifically concerning model overfitting and applicability to novel targets. Here we provide a new ready-to-use scoring function (RF-Score-VS) trained on 15 426 active and 893 897 inactive molecules docked to a set of 102 targets. We use the full DUD-E data sets along with three docking tools, five classical and three machine-learning scoring functions for model building and performance assessment. Our results show RF-Score-VS can substantially improve virtual screening performance: RF-Score-VS top 1% provides 55.6% hit rate, whereas that of Vina only 16.2% (for smaller percent the difference is even more encouraging: RF-Score-VS top 0.1% achieves 88.6% hit rate for 27.5% using Vina). In addition, RF-Score-VS provides much better prediction of measured binding affinity than Vina (Pearson correlation of 0.56 and −0.18, respectively). Lastly, we test RF-Score-VS on an independent test set from the DEKOIS benchmark and observed comparable results. We provide full data sets to facilitate further research in this area (http://github.com/oddt/rfscorevs) as well as ready-to-use RF-Score-VS (http://github.com/oddt/rfscorevs_binary). Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
35	Systematic assessment of multi-gene predictors of pan-cancer cell line sensitivity to drugs exploiting gene expression data. F1000Res 2016;5. [PMID: 28299173 PMCID: PMC5310525 DOI: 10.12688/f1000research.10529.2] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 03/10/2017] [Indexed: 12/19/2022] Open Abstract Background: Selected gene mutations are routinely used to guide the selection of cancer drugs for a given patient tumour. Large pharmacogenomic data sets, such as those by Genomics of Drug Sensitivity in Cancer (GDSC) consortium, were introduced to discover more of these single-gene markers of drug sensitivity. Very recently, machine learning regression has been used to investigate how well cancer cell line sensitivity to drugs is predicted depending on the type of molecular profile. The latter has revealed that gene expression data is the most predictive profile in the pan-cancer setting. However, no study to date has exploited GDSC data to systematically compare the performance of machine learning models based on multi-gene expression data against that of widely-used single-gene markers based on genomics data. Methods: Here we present this systematic comparison using Random Forest (RF) classifiers exploiting the expression levels of 13,321 genes and an average of 501 tested cell lines per drug. To account for time-dependent batch effects in IC ₅₀ measurements, we employ independent test sets generated with more recent GDSC data than that used to train the predictors and show that this is a more realistic validation than standard k-fold cross-validation. Results and Discussion: Across 127 GDSC drugs, our results show that the single-gene markers unveiled by the MANOVA analysis tend to achieve higher precision than these RF-based multi-gene models, at the cost of generally having a poor recall (i.e. correctly detecting only a small part of the cell lines sensitive to the drug). Regarding overall classification performance, about two thirds of the drugs are better predicted by the multi-gene RF classifiers. Among the drugs with the most predictive of these models, we found pyrimethamine, sunitinib and 17-AAG. Conclusions: Thanks to this unbiased validation, we now know that this type of models can predict in vitro tumour response to some of these drugs. These models can thus be further investigated on in vivo tumour models. R code to facilitate the construction of alternative machine learning models and their validation in the presented benchmark is available at http://ballester.marseille.inserm.fr/gdsc.transcriptomicDatav2.tar.gz. Collapse Key Words benchmarking bioinformatics biomarkers drug response machine learning pharmacogenomics pharmacotranscriptomics precision oncology Collapse MESH Headings Collapse Grants Collapse
36	Systematic assessment of multi-gene predictors of pan-cancer cell line sensitivity to drugs exploiting gene expression data. F1000Res 2016;5. [PMID: 28299173 DOI: 10.12688/f1000research.10529.1] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 12/28/2016] [Indexed: 12/30/2022] Open Abstract Background: Selected gene mutations are routinely used to guide the selection of cancer drugs for a given patient tumour. Large pharmacogenomic data sets, such as those by Genomics of Drug Sensitivity in Cancer (GDSC) consortium, were introduced to discover more of these single-gene markers of drug sensitivity. Very recently, machine learning regression has been used to investigate how well cancer cell line sensitivity to drugs is predicted depending on the type of molecular profile. The latter has revealed that gene expression data is the most predictive profile in the pan-cancer setting. However, no study to date has exploited GDSC data to systematically compare the performance of machine learning models based on multi-gene expression data against that of widely-used single-gene markers based on genomics data. Methods: Here we present this systematic comparison using Random Forest (RF) classifiers exploiting the expression levels of 13,321 genes and an average of 501 tested cell lines per drug. To account for time-dependent batch effects in IC ₅₀ measurements, we employ independent test sets generated with more recent GDSC data than that used to train the predictors and show that this is a more realistic validation than standard k-fold cross-validation. Results and Discussion: Across 127 GDSC drugs, our results show that the single-gene markers unveiled by the MANOVA analysis tend to achieve higher precision than these RF-based multi-gene models, at the cost of generally having a poor recall (i.e. correctly detecting only a small part of the cell lines sensitive to the drug). Regarding overall classification performance, about two thirds of the drugs are better predicted by the multi-gene RF classifiers. Among the drugs with the most predictive of these models, we found pyrimethamine, sunitinib and 17-AAG. Conclusions: Thanks to this unbiased validation, we now know that this type of models can predict in vitro tumour response to some of these drugs. These models can thus be further investigated on in vivo tumour models. R code to facilitate the construction of alternative machine learning models and their validation in the presented benchmark is available at http://ballester.marseille.inserm.fr/gdsc.transcriptomicDatav2.tar.gz. Collapse Key Words benchmarking bioinformatics biomarkers drug response machine learning pharmacogenomics pharmacotranscriptomics precision oncology Collapse MESH Headings Collapse Grants Collapse
37	Correcting the impact of docking pose generation error on binding affinity prediction. BMC Bioinformatics 2016;17:308. [PMID: 28185549 PMCID: PMC5046193 DOI: 10.1186/s12859-016-1169-4] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open Abstract Background Pose generation error is usually quantified as the difference between the geometry of the pose generated by the docking software and that of the same molecule co-crystallised with the considered protein. Surprisingly, the impact of this error on binding affinity prediction is yet to be systematically analysed across diverse protein-ligand complexes. Results Against commonly-held views, we have found that pose generation error has generally a small impact on the accuracy of binding affinity prediction. This is also true for large pose generation errors and it is not only observed with machine-learning scoring functions, but also with classical scoring functions such as AutoDock Vina. Furthermore, we propose a procedure to correct a substantial part of this error which consists of calibrating the scoring functions with re-docked, rather than co-crystallised, poses. In this way, the relationship between Vina-generated protein-ligand poses and their binding affinities is directly learned. As a result, test set performance after this error-correcting procedure is much closer to that of predicting the binding affinity in the absence of pose generation error (i.e. on crystal structures). We evaluated several strategies, obtaining better results for those using a single docked pose per ligand than those using multiple docked poses per ligand. Conclusions Binding affinity prediction is often carried out on the docked pose of a known binder rather than its co-crystallised pose. Our results suggest than pose generation error is in general far less damaging for binding affinity prediction than it is currently believed. Another contribution of our study is the proposal of a procedure that largely corrects for this error. The resulting machine-learning scoring function is freely available at http://istar.cse.cuhk.edu.hk/rf-score-4.tgz and http://ballester.marseille.inserm.fr/rf-score-4.tgz. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1169-4) contains supplementary material, which is available to authorized users. Collapse Key Words Binding affinity Drug discovery Machine learning Molecular docking Collapse MESH Headings Collapse Grants Collapse
38	USR-VS: a web server for large-scale prospective virtual screening using ultrafast shape recognition techniques. Nucleic Acids Res 2016;44:W436-41. [PMID: 27106057 PMCID: PMC4987897 DOI: 10.1093/nar/gkw320] [Citation(s) in RCA: 35] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2016] [Accepted: 04/06/2016] [Indexed: 12/12/2022] Open Abstract Ligand-based Virtual Screening (VS) methods aim at identifying molecules with a similar activity profile across phenotypic and macromolecular targets to that of a query molecule used as search template. VS using 3D similarity methods have the advantage of biasing this search toward active molecules with innovative chemical scaffolds, which are highly sought after in drug design to provide novel leads with improved properties over the query molecule (e.g. patentable, of lower toxicity or increased potency). Ultrafast Shape Recognition (USR) has demonstrated excellent performance in the discovery of molecules with previously-unknown phenotypic or target activity, with retrospective studies suggesting that its pharmacophoric extension (USRCAT) should obtain even better hit rates once it is used prospectively. Here we present USR-VS (http://usr.marseille.inserm.fr/), the first web server using these two validated ligand-based 3D methods for large-scale prospective VS. In about 2 s, 93.9 million 3D conformers, expanded from 23.1 million purchasable molecules, are screened and the 100 most similar molecules among them in terms of 3D shape and pharmacophoric properties are shown. USR-VS functionality also provides interactive visualization of the similarity of the query molecule against the hit molecules as well as vendor information to purchase selected hits in order to be experimentally tested. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
39	How Reliable Are Ligand-Centric Methods for Target Fishing? Front Chem 2016;4:15. [PMID: 27148522 PMCID: PMC4830838 DOI: 10.3389/fchem.2016.00015] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2016] [Accepted: 03/24/2016] [Indexed: 12/18/2022] Open Abstract Computational methods for Target Fishing (TF), also known as Target Prediction or Polypharmacology Prediction, can be used to discover new targets for small-molecule drugs. This may result in repositioning the drug in a new indication or improving our current understanding of its efficacy and side effects. While there is a substantial body of research on TF methods, there is still a need to improve their validation, which is often limited to a small part of the available targets and not easily interpretable by the user. Here we discuss how target-centric TF methods are inherently limited by the number of targets that can possibly predict (this number is by construction much larger in ligand-centric techniques). We also propose a new benchmark to validate TF methods, which is particularly suited to analyse how predictive performance varies with the query molecule. On average over approved drugs, we estimate that only five predicted targets will have to be tested to find two true targets with submicromolar potency (a strong variability in performance is however observed). In addition, we find that an approved drug has currently an average of eight known targets, which reinforces the notion that polypharmacology is a common and strong event. Furthermore, with the assistance of a control group of randomly-selected molecules, we show that the targets of approved drugs are generally harder to predict. The benchmark and a simple target prediction method to use as a performance baseline are available at http://ballester.marseille.inserm.fr/TF-benchmark.tar.gz. Collapse Key Words drug repositioning polypharmacology prediction target prediction virtual screening Collapse MESH Headings Collapse Grants Collapse
40	Machine-learning scoring functions to improve structure-based binding affinity prediction and virtual screening. WILEY INTERDISCIPLINARY REVIEWS. COMPUTATIONAL MOLECULAR SCIENCE 2015;5:405-424. [PMID: 27110292 PMCID: PMC4832270 DOI: 10.1002/wcms.1225] [Citation(s) in RCA: 187] [Impact Index Per Article: 20.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/03/2015] [Revised: 07/17/2015] [Accepted: 07/18/2015] [Indexed: 12/29/2022] Abstract Docking tools to predict whether and how a small molecule binds to a target can be applied if a structural model of such target is available. The reliability of docking depends, however, on the accuracy of the adopted scoring function (SF). Despite intense research over the years, improving the accuracy of SFs for structure-based binding affinity prediction or virtual screening has proven to be a challenging task for any class of method. New SFs based on modern machine-learning regression models, which do not impose a predetermined functional form and thus are able to exploit effectively much larger amounts of experimental data, have recently been introduced. These machine-learning SFs have been shown to outperform a wide range of classical SFs at both binding affinity prediction and virtual screening. The emerging picture from these studies is that the classical approach of using linear regression with a small number of expert-selected structural features can be strongly improved by a machine-learning approach based on nonlinear regression allied with comprehensive data-driven feature selection. Furthermore, the performance of classical SFs does not grow with larger training datasets and hence this performance gap is expected to widen as more training data becomes available in the future. Other topics covered in this review include predicting the reliability of a SF on a particular target class, generating synthetic data to improve predictive performance and modeling guidelines for SF development. WIREs Comput Mol Sci 2015, 5:405-424. doi: 10.1002/wcms.1225 For further resources related to this article, please visit the WIREs website. Collapse Key Words Collapse MESH Headings Collapse Grants G0902106 Medical Research Council Collapse
41	Low-Quality Structural and Interaction Data Improves Binding Affinity Prediction via Random Forest. Molecules 2015;20:10947-62. [PMID: 26076113 PMCID: PMC6272292 DOI: 10.3390/molecules200610947] [Citation(s) in RCA: 58] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2015] [Revised: 06/04/2015] [Accepted: 06/09/2015] [Indexed: 12/17/2022] Open Abstract Docking scoring functions can be used to predict the strength of protein-ligand binding. It is widely believed that training a scoring function with low-quality data is detrimental for its predictive performance. Nevertheless, there is a surprising lack of systematic validation experiments in support of this hypothesis. In this study, we investigated to which extent training a scoring function with data containing low-quality structural and binding data is detrimental for predictive performance. We actually found that low-quality data is not only non-detrimental, but beneficial for the predictive performance of machine-learning scoring functions, though the improvement is less important than that coming from high-quality data. Furthermore, we observed that classical scoring functions are not able to effectively exploit data beyond an early threshold, regardless of its quality. This demonstrates that exploiting a larger data volume is more important for the performance of machine-learning scoring functions than restricting to a smaller set of higher data quality. Collapse Key Words binding affinity prediction docking machine-learning scoring functions Collapse MESH Headings Models, Theoretical Structure-Activity Relationship Collapse Grants Collapse
42	Improving AutoDock Vina Using Random Forest: The Growing Accuracy of Binding Affinity Prediction by the Effective Exploitation of Larger Data Sets. Mol Inform 2015;34:115-26. [PMID: 27490034 DOI: 10.1002/minf.201400132] [Citation(s) in RCA: 150] [Impact Index Per Article: 16.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2014] [Accepted: 12/06/2014] [Indexed: 12/28/2022] Abstract There is a growing body of evidence showing that machine learning regression results in more accurate structure-based prediction of protein-ligand binding affinity. Docking methods that aim at optimizing the affinity of ligands for a target rely on how accurate their predicted ranking is. However, despite their proven advantages, machine-learning scoring functions are still not widely applied. This seems to be due to insufficient understanding of their properties and the lack of user-friendly software implementing them. Here we present a study where the accuracy of AutoDock Vina, arguably the most commonly-used docking software, is strongly improved by following a machine learning approach. We also analyse the factors that are responsible for this improvement and their generality. Most importantly, with the help of a proposed benchmark, we demonstrate that this improvement will be larger as more data becomes available for training Random Forest models, as regression models implying additive functional forms do not improve with more training data. We discuss how the latter opens the door to new opportunities in scoring function development. In order to facilitate the translation of this advance to enhance structure-based molecular design, we provide software to directly re-score Vina-generated poses and thus strongly improve their predicted binding affinity. The software is available at http://istar.cse.cuhk.edu.hk/rf-score-3.tgz and http://crcm. marseille.inserm.fr/fileadmin/rf-score-3.tgz. Collapse Key Words Docking Drug lead optimization Machine learning Collapse MESH Headings Collapse Grants Collapse
43	Biochemical evaluation of virtual screening methods reveals a cell-active inhibitor of the cancer-promoting phosphatases of regenerating liver. Eur J Med Chem 2014;88:89-100. [PMID: 25159123 PMCID: PMC4255093 DOI: 10.1016/j.ejmech.2014.08.060] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2014] [Revised: 08/17/2014] [Accepted: 08/20/2014] [Indexed: 11/30/2022] Abstract Computationally supported development of small molecule inhibitors has successfully been applied to protein tyrosine phosphatases in the past, revealing a number of cell-active compounds. Similar approaches have also been used to screen for small molecule inhibitors for the cancer-related phosphatases of regenerating liver (PRL) family. Still, selective and cell-active compounds are of limited availability. Since especially PRL-3 remains an attractive drug target due to its clear role in cancer metastasis, such compounds are highly demanded. In this study, we investigated various virtual screening approaches for their applicability to identify novel small molecule entities for PRL-3 as target. Biochemical evaluation of purchasable compounds revealed ligand-based approaches as well suited for this target, compared to docking-based techniques that did not perform well in this context. The best hit of this study, a 2-cyano-2-ene-ester and hence a novel chemotype targeting the PRLs, was further optimized by a structure–activity-relationship (SAR) study, leading to a low micromolar PRL inhibitor with acceptable selectivity over other protein tyrosine phosphatases. The compound is active in cells, as shown by its ability to specifically revert PRL-3 induced cell migration, and exhibits similar effects on PRL-1 and PRL-2. It is furthermore suitable for fluorescence microscopy applications, and it is commercially available. These features make it the only purchasable, cell-active and acceptably selective PRL inhibitor to date that can be used in various cellular applications. • Computational ligand- and docking-based approaches were tested for PRL-3 as a target. • Ligand-based screening was proven a feasible approach for PRL-3 inhibitor discovery. • A low micromolar, non-competitive inhibitor with novel chemotype for PRLs was discovered. • The inhibitor efficiently blocks PRL induced cell migration. • The inhibitor is non-cytotoxic, commercially available and suitable for fluorescence microscopy applications. Collapse Key Words 2-Cyano-2-ene-esters Dual specificity phosphatases Enzyme inhibitors Phosphatases of regenerating liver Thienopyridone Virtual screening methods Collapse MESH Headings Collapse Grants Collapse
44	Does a more precise chemical description of protein-ligand complexes lead to more accurate prediction of binding affinity? J Chem Inf Model 2014;54:944-55. [PMID: 24528282 PMCID: PMC3966527 DOI: 10.1021/ci500091r] [Citation(s) in RCA: 127] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/15/2023] Abstract Predicting the binding affinities of large sets of diverse molecules against a range of macromolecular targets is an extremely challenging task. The scoring functions that attempt such computational prediction are essential for exploiting and analyzing the outputs of docking, which is in turn an important tool in problems such as structure-based drug design. Classical scoring functions assume a predetermined theory-inspired functional form for the relationship between the variables that describe an experimentally determined or modeled structure of a protein–ligand complex and its binding affinity. The inherent problem of this approach is in the difficulty of explicitly modeling the various contributions of intermolecular interactions to binding affinity. New scoring functions based on machine-learning regression models, which are able to exploit effectively much larger amounts of experimental data and circumvent the need for a predetermined functional form, have already been shown to outperform a broad range of state-of-the-art scoring functions in a widely used benchmark. Here, we investigate the impact of the chemical description of the complex on the predictive power of the resulting scoring function using a systematic battery of numerical experiments. The latter resulted in the most accurate scoring function to date on the benchmark. Strikingly, we also found that a more precise chemical description of the protein–ligand complex does not generally lead to a more accurate prediction of binding affinity. We discuss four factors that may contribute to this result: modeling assumptions, codependence of representation and regression, data restricted to the bound state, and conformational heterogeneity in data. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
45	Prospective virtual screening for novel p53–MDM2 inhibitors using ultrafast shape recognition. J Comput Aided Mol Des 2014;28:89-97. [DOI: 10.1007/s10822-014-9732-4] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2013] [Accepted: 02/11/2014] [Indexed: 01/21/2023] Abstract Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
46	istar: a web platform for large-scale protein-ligand docking. PLoS One 2014;9:e85678. [PMID: 24475049 PMCID: PMC3901662 DOI: 10.1371/journal.pone.0085678] [Citation(s) in RCA: 68] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2013] [Accepted: 12/05/2013] [Indexed: 11/18/2022] Open Abstract Protein-ligand docking is a key computational method in the design of starting points for the drug discovery process. We are motivated by the desire to automate large-scale docking using our popular docking engine idock and thus have developed a publicly-accessible web platform called istar. Without tedious software installation, users can submit jobs using our website. Our istar website supports 1) filtering ligands by desired molecular properties and previewing the number of ligands to dock, 2) monitoring job progress in real time, and 3) visualizing ligand conformations and outputting free energy and ligand efficiency predicted by idock, binding affinity predicted by RF-Score, putative hydrogen bonds, and supplier information for easy purchase, three useful features commonly lacked on other online docking platforms like DOCK Blaster or iScreen. We have collected 17,224,424 ligands from the All Clean subset of the ZINC database, and revamped our docking engine idock to version 2.0, further improving docking speed and accuracy, and integrating RF-Score as an alternative rescoring function. To compare idock 2.0 with the state-of-the-art AutoDock Vina 1.1.2, we have carried out a rescoring benchmark and a redocking benchmark on the 2,897 and 343 protein-ligand complexes of PDBbind v2012 refined set and CSAR NRC HiQ Set 24Sept2010 respectively, and an execution time benchmark on 12 diverse proteins and 3,000 ligands of different molecular weight. Results show that, under various scenarios, idock achieves comparable success rates while outperforming AutoDock Vina in terms of docking speed by at least 8.69 times and at most 37.51 times. When evaluated on the PDBbind v2012 core set, our istar platform combining with RF-Score manages to reproduce Pearson's correlation coefficient and Spearman's correlation coefficient of as high as 0.855 and 0.859 respectively between the experimental binding affinity and the predicted binding affinity of the docked conformation. istar is freely available at http://istar.cse.cuhk.edu.hk/idock. Collapse Key Words Collapse MESH Headings Algorithms Databases, Protein Ligands Molecular Conformation Molecular Docking Simulation Protein Binding Protein Conformation Proteins/chemistry Software Web Browser Collapse Grants Collapse
47	Machine learning prediction of cancer cell sensitivity to drugs based on genomic and chemical properties. PLoS One 2013;8:e61318. [PMID: 23646105 PMCID: PMC3640019 DOI: 10.1371/journal.pone.0061318] [Citation(s) in RCA: 271] [Impact Index Per Article: 24.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2012] [Accepted: 03/07/2013] [Indexed: 12/24/2022] Open Abstract Predicting the response of a specific cancer to a therapy is a major goal in modern oncology that should ultimately lead to a personalised treatment. High-throughput screenings of potentially active compounds against a panel of genomically heterogeneous cancer cell lines have unveiled multiple relationships between genomic alterations and drug responses. Various computational approaches have been proposed to predict sensitivity based on genomic features, while others have used the chemical properties of the drugs to ascertain their effect. In an effort to integrate these complementary approaches, we developed machine learning models to predict the response of cancer cell lines to drug treatment, quantified through IC₅₀ values, based on both the genomic features of the cell lines and the chemical properties of the considered drugs. Models predicted IC₅₀ values in a 8-fold cross-validation and an independent blind test with coefficient of determination R² of 0.72 and 0.64 respectively. Furthermore, models were able to predict with comparable accuracy (R² of 0.61) IC50s of cell lines from a tissue not used in the training stage. Our in silico models can be used to optimise the experimental design of drug-cell screenings by estimating a large proportion of missing IC₅₀ values rather than experimentally measuring them. The implications of our results go beyond virtual drug screening design: potentially thousands of drugs could be probed in silico to systematically test their potential efficacy as anti-tumour agents based on their structure, thus providing a computational framework to identify new drug repositioning opportunities as well as ultimately be useful for personalized medicine by linking the genomic traits of patients to drug sensitivity. Collapse Key Words Collapse MESH Headings Analysis of Variance Antineoplastic Agents/pharmacology Antineoplastic Agents/therapeutic use Artificial Intelligence Computer Simulation Drug Resistance, Neoplasm/genetics Genomics/methods Humans Inhibitory Concentration 50 Neoplasms/drug therapy Neoplasms/genetics Pharmacogenetics/methods Workflow Collapse Grants G0902106 Medical Research Council Wellcome Trust Cancer Research UK Collapse
48	Hierarchical virtual screening for the discovery of new molecular scaffolds in antibacterial hit identification. J R Soc Interface 2012;9:3196-207. [PMID: 22933186 PMCID: PMC3481598 DOI: 10.1098/rsif.2012.0569] [Citation(s) in RCA: 55] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open Abstract One of the initial steps of modern drug discovery is the identification of small organic molecules able to inhibit a target macromolecule of therapeutic interest. A small proportion of these hits are further developed into lead compounds, which in turn may ultimately lead to a marketed drug. A commonly used screening protocol used for this task is high-throughput screening (HTS). However, the performance of HTS against antibacterial targets has generally been unsatisfactory, with high costs and low rates of hit identification. Here, we present a novel computational methodology that is able to identify a high proportion of structurally diverse inhibitors by searching unusually large molecular databases in a time-, cost- and resource-efficient manner. This virtual screening methodology was tested prospectively on two versions of an antibacterial target (type II dehydroquinase from Mycobacterium tuberculosis and Streptomyces coelicolor), for which HTS has not provided satisfactory results and consequently practically all known inhibitors are derivatives of the same core scaffold. Overall, our protocols identified 100 new inhibitors, with calculated K_i ranging from 4 to 250 μM (confirmed hit rates are 60% and 62% against each version of the target). Most importantly, over 50 new active molecular scaffolds were discovered that underscore the benefits that a wide application of prospectively validated in silico screening tools is likely to bring to antibacterial hit identification. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
49	Comments on “Leave-Cluster-Out Cross-Validation Is Appropriate for Scoring Functions Derived from Diverse Protein Data Sets”: Significance for the Validation of Scoring Functions. J Chem Inf Model 2011;51:1739-41. [DOI: 10.1021/ci200057e] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Abstract Collapse Key Words Collapse MESH Headings Collapse Grants Collapse
50	A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking. ACTA ACUST UNITED AC 2010;26:1169-75. [PMID: 20236947 DOI: 10.1093/bioinformatics/btq112] [Citation(s) in RCA: 451] [Impact Index Per Article: 32.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Abstract MOTIVATION Accurately predicting the binding affinities of large sets of diverse protein-ligand complexes is an extremely challenging task. The scoring functions that attempt such computational prediction are essential for analysing the outputs of molecular docking, which in turn is an important technique for drug discovery, chemical biology and structural biology. Each scoring function assumes a predetermined theory-inspired functional form for the relationship between the variables that characterize the complex, which also include parameters fitted to experimental or simulation data and its predicted binding affinity. The inherent problem of this rigid approach is that it leads to poor predictivity for those complexes that do not conform to the modelling assumptions. Moreover, resampling strategies, such as cross-validation or bootstrapping, are still not systematically used to guard against the overfitting of calibration data in parameter estimation for scoring functions. RESULTS We propose a novel scoring function (RF-Score) that circumvents the need for problematic modelling assumptions via non-parametric machine learning. In particular, Random Forest was used to implicitly capture binding effects that are hard to model explicitly. RF-Score is compared with the state of the art on the demanding PDBbind benchmark. Results show that RF-Score is a very competitive scoring function. Importantly, RF-Score's performance was shown to improve dramatically with training set size and hence the future availability of more high-quality structural and interaction data is expected to lead to improved versions of RF-Score. CONTACT pedro.ballester@ebi.ac.uk; jbom@st-andrews.ac.uk SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online. Collapse Key Words Collapse MESH Headings Collapse Grants Collapse